SCM-Arena is a behavioral benchmark where models play the Beer Game (a canonical multi-tier supply chain simulation) under uncertainty, partial information, and stress. Repeated evaluation across scenarios and constraints to surface stability, breakdown, and coordination failure.
This is not a winner-take-all leaderboard. It is a way to see where behavior holds, where it breaks, and what drives the difference.
SCM-Arena results help inform deployment and oversight decisions for LLM-driven ordering systems: when autonomous operation is viable, when human-in-the-loop is warranted, and which operating conditions require guardrails.
These are behavioral patterns observed across repeated episodes, not single-run anecdotes. See the guide for interpretation rules of thumb.
A canonical supply chain simulation with four tiers, information delays, and inventory dynamics. Used in operations research for decades.
Demand patterns (steady, random, shock, seasonal). Visibility (local, adjacent, full). Memory (none, short, full). Prompt framing.
Cost ratio vs. a human baseline. Demand amplification (bullwhip). Coordination (entropy). Behavioral complexity.
Filter by model, scenario, or constraint. Compare side by side. See where behavior is stable and where it breaks.
Browse all models and filter by family, scenario, or constraint level. See how each one performs under baseline and stress-tested conditions.
Select 2-4 models and compare them side by side. See how cost, bullwhip, and behavioral style differ, and how each model degrades as visibility or memory is restricted.
What cost ratio and bullwhip mean. Why large values are diagnostic signals, not noise. How to interpret entropy and complexity as behavioral descriptors.
SCM-Arena does not recommend specific models. It surfaces behavioral patterns that inform how you deploy, constrain, and oversee LLM-based ordering systems.
Models that maintain stable cost ratios across visibility and memory conditions may be candidates for autonomous operation in constrained settings. Models that degrade sharply need human-in-the-loop or tighter operating boundaries.
Sensitivity plots reveal whether a model fails gradually (costs rise smoothly) or catastrophically (regime transition to runaway ordering). Catastrophic failure profiles warrant sandboxing and monitoring before any deployment.
A model may be reliable under steady demand but brittle under shocks. SCM-Arena results help identify which operating conditions require fallback rules, safety limits, or human override.
Consistency across repeated episodes (low variance in cost ratio, stable bullwhip) is a stronger signal than a single good run. SCM-Arena evaluates each model across many episodes to distinguish reliable behavior from lucky outcomes.
Traditional LLM benchmarks evaluate isolated responses: a question in, an answer out. They do not capture what happens when a model's decisions feed back into a system that changes over time.
In supply chain settings, decisions interact. One tier's order becomes another tier's demand. Information is delayed. Mistakes compound. A model that answers well in isolation can still destabilize a system.
SCM-Arena is designed to surface these dynamics. To our knowledge, it is the first benchmark focused on repeated, behavioral evaluation of LLM decision policies in a dynamic supply chain testbed.
Comparisons are diagnostic, not a winner-take-all ranking.