A 3-minute orientation before you start exploring. Covers the key metrics, what the contexts mean, and how to interpret what you see.
Cost ratio compares total system cost under a model's ordering decisions to a human reference baseline (Sterman). A value of 1.0 matches the baseline. Values above 1 mean higher cost. Values above 10 suggest coordination is breaking down; above 100, the system has likely entered a failure regime.
Threshold guidance: >10 = failure likely, >100 = severe breakdown.
Bullwhip measures demand amplification across supply chain tiers. Small changes in customer demand get magnified as they move upstream. Higher bullwhip means more instability in ordering behavior.
Bullwhip >3 = elevated instability, >10 = unstable.
Baseline matches the canonical Beer Game: steady demand, local visibility (each tier sees only its own state), no decision memory, and neutral prompts. This is the constrained, informationally limited setting that produces the bullwhip effect in human play. Stress-tested contexts then vary demand patterns, visibility, and memory to probe how behavior changes across the design space.
Stress tests are not edge cases. They reveal how models respond when given more information, and where they break under volatile demand.
Entropy reflects how spread out ordering decisions are across tiers. It ranges from 0 to 1 because it is a normalized Shannon entropy: 0 means all tiers order identically (perfect coordination), 1 means orders are maximally dispersed (no coordination). Complexity reflects how information-intensive a model's behavior is, measured as a standardized score across all models in the benchmark. It typically ranges from roughly -1 to 1: negative values mean simpler, more rigid responses than average; positive values mean richer but harder-to-stabilize behavior than average; zero is the benchmark median.
Both describe behavioral style, not whether outcomes are good or bad. A model can have low entropy (coordinated) and still fail on cost if it coordinates on the wrong policy.
Large cost ratios and bullwhip values are diagnostic signals, not noise. Under stress-tested conditions, SCM-Arena is designed to push models into regimes where coordination can fail. When this happens, costs and demand amplification can grow rapidly. These outcomes help distinguish models that degrade gradually from those that collapse under pressure.
There is no single best model across all contexts. SCM-Arena is designed for exploring how behavior changes across conditions, where stability gives way to breakdown, and how different models trade off robustness and responsiveness. Sorting and filtering support exploration. Comparisons are diagnostic, not a winner-take-all ranking.
SCM-Arena evaluates each model across many repeated episodes per condition, not single runs. When interpreting results, look for consistency: low variance in cost ratio across episodes, stable bullwhip across scenarios, and gradual (not sudden) degradation as conditions tighten. These patterns indicate systematic behavior, not lucky or unlucky draws. The sensitivity plots on model detail pages and the comparison tool show these degradation paths directly. Failure frequency across conditions (how often a model enters a failure regime, not just its average) is a key robustness signal.
These are diagnostic signals of systematic behavior under stress, not cherry-picked illustrative charts.
Practical rules of thumb for interpreting SCM-Arena results. These do not recommend specific models; they describe what behavioral patterns suggest about deployment and oversight.
The model is brittle under information constraints. Treat it as unsuitable for autonomous operation in low-visibility settings. Enforce human-in-the-loop review for ordering decisions when upstream/downstream information is limited.
The model amplifies volatility even when final costs happen to net out. This pattern can destabilize upstream partners. Human oversight is warranted during demand transitions, and order smoothing rules may be needed as guardrails.
This is a strong disqualifying signal. The baseline is the most studied, most constrained setting. Failure here indicates fundamental issues with the model's ordering behavior that are unlikely to be resolved by giving it more information or memory.
The model coordinates tightly across tiers, but on a poor policy. All tiers order similarly, yet the system still incurs high costs. This suggests the model has learned a consistent but wrong heuristic. Intervention should target the policy itself, not the coordination mechanism.
The model has a narrow failure mode. Identify the triggering condition (e.g., seasonal demand, memory restriction) and either sandbox the model away from that condition or implement scenario-specific fallback rules.
SCM-Arena results help inform three categories of operational decisions:
For full methodology details, metric definitions, and reproducibility notes, see the full methodology page.