SCM-Arena v1
A behavioral benchmark for LLM decision-making in supply chains

See how LLMs behave when they run a supply chain,not just what they know.

SCM-Arena is a behavioral benchmark where models play the Beer Game (a canonical multi-tier supply chain simulation) under uncertainty, partial information, and stress. Repeated evaluation across scenarios and constraints to surface stability, breakdown, and coordination failure.

This is not a winner-take-all leaderboard. It is a way to see where behavior holds, where it breaks, and what drives the difference.

Models tested
Open-weight and frontier LLMs, 144 conditions each
Episodes run
5 replications per cell, 52 rounds per episode
Experimental conditions
Fully crossed: scenario, visibility, memory, prompt, mode
What you can learn here

SCM-Arena results help inform deployment and oversight decisions for LLM-driven ordering systems: when autonomous operation is viable, when human-in-the-loop is warranted, and which operating conditions require guardrails.

  • Rapid cost ratio growth as visibility decreases suggests the model is brittle under information constraints. Avoid autonomous deployment in low-visibility settings; enforce human review.
  • High bullwhip under demand shocks indicates sensitivity to volatility. Models showing this pattern may amplify disruptions rather than absorb them; human oversight is warranted during demand transitions.
  • Stable cost ratio across conditions but high entropy means the model produces acceptable costs but through fragmented, uncoordinated ordering. This pattern may mask risks that surface at longer horizons or larger scale.
  • Failure in baseline (canonical Beer Game) conditions is a strong signal against deployment. If a model cannot maintain stability under the standard, well-studied setting, it is unlikely to perform reliably under real-world constraints.

These are behavioral patterns observed across repeated episodes, not single-run anecdotes. See the guide for interpretation rules of thumb.

How it works
01
Models play the Beer Game

A canonical supply chain simulation with four tiers, information delays, and inventory dynamics. Used in operations research for decades.

02
We vary the conditions

Demand patterns (steady, random, shock, seasonal). Visibility (local, adjacent, full). Memory (none, short, full). Prompt framing.

03
We measure what happens

Cost ratio vs. a human baseline. Demand amplification (bullwhip). Coordination (entropy). Behavioral complexity.

04
You explore the results

Filter by model, scenario, or constraint. Compare side by side. See where behavior is stable and where it breaks.

Start here
What decisions this informs

SCM-Arena does not recommend specific models. It surfaces behavioral patterns that inform how you deploy, constrain, and oversee LLM-based ordering systems.

Autonomous vs. supervised

Models that maintain stable cost ratios across visibility and memory conditions may be candidates for autonomous operation in constrained settings. Models that degrade sharply need human-in-the-loop or tighter operating boundaries.

Failure mode awareness

Sensitivity plots reveal whether a model fails gradually (costs rise smoothly) or catastrophically (regime transition to runaway ordering). Catastrophic failure profiles warrant sandboxing and monitoring before any deployment.

Condition-specific guardrails

A model may be reliable under steady demand but brittle under shocks. SCM-Arena results help identify which operating conditions require fallback rules, safety limits, or human override.

Robustness signals

Consistency across repeated episodes (low variance in cost ratio, stable bullwhip) is a stronger signal than a single good run. SCM-Arena evaluates each model across many episodes to distinguish reliable behavior from lucky outcomes.

Why a behavioral benchmark

Traditional LLM benchmarks evaluate isolated responses: a question in, an answer out. They do not capture what happens when a model's decisions feed back into a system that changes over time.

In supply chain settings, decisions interact. One tier's order becomes another tier's demand. Information is delayed. Mistakes compound. A model that answers well in isolation can still destabilize a system.

SCM-Arena is designed to surface these dynamics. To our knowledge, it is the first benchmark focused on repeated, behavioral evaluation of LLM decision policies in a dynamic supply chain testbed.

Comparisons are diagnostic, not a winner-take-all ranking.