The interpretive layer of SCM-Arena: why it is structured this way, what the metrics capture, and how to make sense of the results.
New here? Start with the 3-minute guide for quick metric definitions, thresholds, and practitioner interpretation rules. This page is the deeper technical reference and design rationale.
SCM-Arena measures how models behave over time when their decisions interact inside a supply chain system.
Unlike knowledge or question-answering benchmarks, SCM-Arena does not evaluate whether a model produces a correct response in isolation. It evaluates how a model's decisions propagate through a dynamic system, affecting inventory, backlog, and downstream behavior across multiple roles and time periods.
This distinction matters because supply chains are governed by feedback, delay, and coordination. Small differences in ordering behavior can compound into stability, oscillation, or breakdown. SCM-Arena is designed to surface these dynamics explicitly.
A model can score highly on knowledge benchmarks and still perform poorly when its decisions interact with delayed information, partial visibility, other decision-makers, and uncertain demand over time. In operational settings, these interactions determine outcomes.
SCM-Arena is built on variants of the Beer Game, a canonical multi-tier supply chain simulation used in operations research for decades to study coordination, information delays, and the bullwhip effect. The Beer Game provides a four-tier supply chain, delayed feedback, inventory and backlog dynamics, and a well-understood human baseline (Sterman).
Each model is evaluated across a fully crossed experimental design. Conditions vary along four factors:
This produces 144 unique experimental conditions per model (4 scenarios × 3 visibility levels × 3 memory strategies × 2 prompt types × 2 game modes). Each condition is replicated 5 times with fixed seeds, and each episode runs for 52 rounds of ordering decisions. Across 75 models, this yields over 53,000 total episodes and 10,800 model-condition cells.
This design enables within-model sensitivity analysis (how does behavior change as conditions tighten?) and across-model comparison (which models degrade more under constraint?). Fixed seeds ensure full reproducibility.
SCM-Arena reports results in two complementary contexts.
Baseline reflects the canonical Beer Game: steady demand, local visibility (each tier sees only its own state), no decision memory, and neutral prompts. This matches how the game has been played in research settings for decades: limited information, delayed feedback, no shared state. It answers: how does this model behave under the standard conditions that produce the bullwhip effect in human play?
Stress-tested contexts vary demand patterns, expand or restrict visibility, and alter memory access to probe how behavior changes across the design space. They answer: what improves when models get more information, and what breaks when scenarios become more volatile?
Under stress, models can enter distinct behavioral regimes. Some maintain stable ordering with graceful degradation. Others cross into failure regimes where costs and demand amplification grow by orders of magnitude. Identifying where these transitions occur, and what drives them, is the central purpose of the stress-tested view.
For threshold definitions that separate normal, failure, and severe regimes, see the Guide.
Cost ratio compares total system cost under a model's decisions to the Sterman human reference baseline. It is the primary outcome measure. Very large cost ratios under stress are common and expected; they typically reflect runaway ordering, persistent backlog, or loss of coordination across tiers. These are not measurement errors. They are signals of system-level breakdown.
Cost ratio is useful for identifying whether a model's behavior produces acceptable outcomes, but it does not explain why. Two models can have similar cost ratios through very different behavioral mechanisms.
Bullwhip measures demand amplification across the supply chain. It captures dynamic instability: higher values indicate that small changes in downstream demand are magnified as they propagate upstream. Bullwhip complements cost ratio by revealing how a model destabilizes the system, not just the final cost outcome.
A model can have a moderate cost ratio but high bullwhip, indicating that ordering behavior is volatile even if costs happen to net out. Conversely, a model can have high cost ratio and low bullwhip if it consistently over-orders or under-orders without amplification.
For plain-language definitions and numeric thresholds, see the Guide.
In addition to outcome metrics, SCM-Arena reports two behavioral descriptors that characterize how models behave, independent of whether outcomes are good or bad.
A normalized Shannon entropy (0 to 1) measuring how spread out ordering decisions are across supply chain tiers. Low entropy means tiers order similarly (tight coordination). High entropy means orders are dispersed across tiers (fragmented or role-differentiated behavior). Entropy describes the structure of decisions, not their quality.
A standardized score (approximately -1 to 1) reflecting how information-intensive a model's decision behavior is, relative to all models in the benchmark. Negative values indicate simpler, more rigid decision patterns than average. Positive values indicate richer, more adaptive behavior that may also be harder to stabilize. Zero is the benchmark median. Complexity captures behavioral richness, not performance.
For scale explanations and intuitive glosses, see the Guide.
Several patterns in SCM-Arena results can be misleading if read too quickly.
SCM-Arena is designed for comparison, not ranking. There is no single best model across all contexts.
The benchmark supports three modes of exploration:
Sorting and filtering are provided to support exploration. Comparisons are diagnostic, not a winner-take-all ranking.
SCM-Arena v1 is frozen and fully reproducible.
All seeds (global, bootstrap, t-SNE), configuration hashes, and run artifacts are versioned. The website consumes pre-computed static JSON from the analysis pipeline and does not recompute any metrics at runtime.
Exact definitions, implementation choices, and run artifacts are documented in the technical release files associated with this version. These materials are intentionally separated from the UI to keep the focus here on interpretation rather than configuration.
If you use SCM-Arena in your work, a citation reference will be provided with the public release.