Methodology

The interpretive layer of SCM-Arena: why it is structured this way, what the metrics capture, and how to make sense of the results.

New here? Start with the 3-minute guide for quick metric definitions, thresholds, and practitioner interpretation rules. This page is the deeper technical reference and design rationale.

Key takeaways
  • SCM-Arena evaluates behavior over time, not single-shot correctness.
  • Results are reported in two contexts: baseline (standard conditions) and stress-tested (deliberately constrained).
  • Large metric values are diagnostic signals of system-level breakdown, not noise.
  • Comparisons are diagnostic, not winner-take-all rankings.
  • v1 is frozen and fully reproducible; seeds, configs, and run artifacts are versioned.
On this page
  1. What this benchmark measures
  2. Design overview
  3. Contexts and regimes
  4. Outcome metrics
  5. Behavioral descriptors
  6. Interpretation pitfalls
  7. Reading the results
  8. Reproducibility

What this benchmark measures

SCM-Arena measures how models behave over time when their decisions interact inside a supply chain system.

Unlike knowledge or question-answering benchmarks, SCM-Arena does not evaluate whether a model produces a correct response in isolation. It evaluates how a model's decisions propagate through a dynamic system, affecting inventory, backlog, and downstream behavior across multiple roles and time periods.

This distinction matters because supply chains are governed by feedback, delay, and coordination. Small differences in ordering behavior can compound into stability, oscillation, or breakdown. SCM-Arena is designed to surface these dynamics explicitly.

A model can score highly on knowledge benchmarks and still perform poorly when its decisions interact with delayed information, partial visibility, other decision-makers, and uncertain demand over time. In operational settings, these interactions determine outcomes.

Design overview

SCM-Arena is built on variants of the Beer Game, a canonical multi-tier supply chain simulation used in operations research for decades to study coordination, information delays, and the bullwhip effect. The Beer Game provides a four-tier supply chain, delayed feedback, inventory and backlog dynamics, and a well-understood human baseline (Sterman).

Each model is evaluated across a fully crossed experimental design. Conditions vary along four factors:

  • Scenario: demand pattern (classic steady, random, shock, seasonal).
  • Visibility: what each tier can see (local only, adjacent neighbors, full chain).
  • Memory: how much history the model receives (none, short window, full).
  • Prompt framing: minimal guidance (neutral) vs. structured objectives (specific).

This produces 144 unique experimental conditions per model (4 scenarios × 3 visibility levels × 3 memory strategies × 2 prompt types × 2 game modes). Each condition is replicated 5 times with fixed seeds, and each episode runs for 52 rounds of ordering decisions. Across 75 models, this yields over 53,000 total episodes and 10,800 model-condition cells.

Models
75
Conditions per model
144
Replications per cell
5
Rounds per episode
52
Total episodes
53,000+
Total model-condition cells
10,800

This design enables within-model sensitivity analysis (how does behavior change as conditions tighten?) and across-model comparison (which models degrade more under constraint?). Fixed seeds ensure full reproducibility.

Contexts and regimes

SCM-Arena reports results in two complementary contexts.

Baseline reflects the canonical Beer Game: steady demand, local visibility (each tier sees only its own state), no decision memory, and neutral prompts. This matches how the game has been played in research settings for decades: limited information, delayed feedback, no shared state. It answers: how does this model behave under the standard conditions that produce the bullwhip effect in human play?

Stress-tested contexts vary demand patterns, expand or restrict visibility, and alter memory access to probe how behavior changes across the design space. They answer: what improves when models get more information, and what breaks when scenarios become more volatile?

Under stress, models can enter distinct behavioral regimes. Some maintain stable ordering with graceful degradation. Others cross into failure regimes where costs and demand amplification grow by orders of magnitude. Identifying where these transitions occur, and what drives them, is the central purpose of the stress-tested view.

For threshold definitions that separate normal, failure, and severe regimes, see the Guide.

Outcome metrics

Cost ratio

Cost ratio compares total system cost under a model's decisions to the Sterman human reference baseline. It is the primary outcome measure. Very large cost ratios under stress are common and expected; they typically reflect runaway ordering, persistent backlog, or loss of coordination across tiers. These are not measurement errors. They are signals of system-level breakdown.

Cost ratio is useful for identifying whether a model's behavior produces acceptable outcomes, but it does not explain why. Two models can have similar cost ratios through very different behavioral mechanisms.

Bullwhip

Bullwhip measures demand amplification across the supply chain. It captures dynamic instability: higher values indicate that small changes in downstream demand are magnified as they propagate upstream. Bullwhip complements cost ratio by revealing how a model destabilizes the system, not just the final cost outcome.

A model can have a moderate cost ratio but high bullwhip, indicating that ordering behavior is volatile even if costs happen to net out. Conversely, a model can have high cost ratio and low bullwhip if it consistently over-orders or under-orders without amplification.

For plain-language definitions and numeric thresholds, see the Guide.

Behavioral descriptors

In addition to outcome metrics, SCM-Arena reports two behavioral descriptors that characterize how models behave, independent of whether outcomes are good or bad.

Entropy

A normalized Shannon entropy (0 to 1) measuring how spread out ordering decisions are across supply chain tiers. Low entropy means tiers order similarly (tight coordination). High entropy means orders are dispersed across tiers (fragmented or role-differentiated behavior). Entropy describes the structure of decisions, not their quality.

Complexity

A standardized score (approximately -1 to 1) reflecting how information-intensive a model's decision behavior is, relative to all models in the benchmark. Negative values indicate simpler, more rigid decision patterns than average. Positive values indicate richer, more adaptive behavior that may also be harder to stabilize. Zero is the benchmark median. Complexity captures behavioral richness, not performance.

For scale explanations and intuitive glosses, see the Guide.

Interpretation pitfalls

Several patterns in SCM-Arena results can be misleading if read too quickly.

  • Low entropy does not guarantee low cost. A model can coordinate tightly on the wrong ordering policy. All tiers may order similarly, but if that policy over-orders or under-orders, costs will be high despite strong coordination.
  • Good baseline performance does not predict stress robustness. Some models perform well under full visibility and memory but collapse when either is restricted. Baseline results alone are insufficient for evaluating operational suitability.
  • Averages can hide regime transitions. A model's aggregate cost ratio may look moderate if it performs well under most conditions but catastrophically fails under a few. The condition-level views in the explorer and comparison tool are designed to surface these regime changes.
  • Bullwhip and cost ratio can diverge. A model with high bullwhip may have moderate costs (volatile but offsetting), and a model with low bullwhip may have high costs (stable but consistently wrong). Both metrics are needed.
  • Complexity is relative, not absolute. A model's complexity score depends on the benchmark population. Adding or removing models in future versions could shift the distribution.

Reading the results

SCM-Arena is designed for comparison, not ranking. There is no single best model across all contexts.

The benchmark supports three modes of exploration:

  • Explore models: browse all models, filter by family or constraint, toggle between baseline and stress-tested views.
  • Compare models: select 2-4 models and see side-by-side sensitivity plots for visibility and memory degradation.
  • Model detail pages: drill into a single model's profile, degradation curves, and condition-level data.

Sorting and filtering are provided to support exploration. Comparisons are diagnostic, not a winner-take-all ranking.

Reproducibility

SCM-Arena v1 is frozen and fully reproducible.

All seeds (global, bootstrap, t-SNE), configuration hashes, and run artifacts are versioned. The website consumes pre-computed static JSON from the analysis pipeline and does not recompute any metrics at runtime.

Exact definitions, implementation choices, and run artifacts are documented in the technical release files associated with this version. These materials are intentionally separated from the UI to keep the focus here on interpretation rather than configuration.

If you use SCM-Arena in your work, a citation reference will be provided with the public release.