AI Can Run the Catalog. A New Test Says Not the Budget.
A 132-month benchmark that put language models in a CFO's chair found only 15.4% survived, failing at the timing and sizing of capital rather than the work around it. That is the exact decision retailers are preparing to hand their open-to-buy agents.
Neritus Vale
The part that merchandising AI has learned to do well is the catalog; the part a new benchmark says it cannot yet do is the budget behind it. In March, researchers released EnterpriseArena, a simulator that seats a language-model agent in the chief financial officer’s chair of a lending firm and runs it for 132 months of shifting macroeconomic weather. Across 23 models and four agent frameworks, only 15.4% of trials survived the full run; the rest steered the company into insolvency. The decision it scores is the open-to-buy call in another uniform: how much capital to commit, and when. Retailers are wiring agents toward exactly that decision now.
The assortment layer is genuinely strong, and it is strong for a reason. Toolio generates receipt plans that hold to inventory targets; Blue Yonder runs demand forecasting and automates pricing across a product’s life. These are pattern-completion problems with dense, fast feedback: last week’s sell-through grades last week’s plan, and the model corrects. Choosing what to stock, and how deep, is a bounded, well-lit task on which machine learning has earned its place on the floor.
Open-to-buy is not a forecast; it is a capital-allocation call made under uncertainty and held open across a season. A buyer decides how much of a finite budget to commit now, how much to keep in reserve for in-season chase, and when to release it. That is sizing and timing, the two motions the benchmark scores, and neither is pattern completion. The agentic-retail pitch for 2026 treats the budget as one more workflow to hand over. Vendors now describe agents that autonomously manage pricing, inventory and promotions and invoke a Gartner estimate, relayed by Airia, that 75% of organisations plan to deploy multi-agent frameworks within eighteen months. The motion being delegated is the one the benchmark just measured.
The benchmark is exact about how the agents fail, and that precision is what should unsettle a planning team. They do not lose to a hard market; they lose to their own choices. Failed runs raise money too late, moving to fundraise only after cash has peaked and begun to fall, and they ask for too little when they finally move. The environment handed the bankrupt agents the same opportunities; they simply allocated worse. The failures, the authors find, cascade across observation, action timing, and capital sizing — long-horizon resource allocation under uncertainty named as a distinct capability gap, not a feature of the market the agent was placed in.
Neither a bigger model nor a better wrapper reliably rescues this. Survival rates shifted sharply with the agent framework used — a configuration choice no buyer would ever see. Larger models did not reliably outperform smaller ones. A capability this sensitive to its scaffolding is not one to which a retailer should hand a season’s budget.
A second benchmark, from a different team, names the cognitive failure beneath the financial one. BAGEN, posted in May, asked whether agents can tell when they are about to run out of budget, and found that they cannot. Across five frontier models and four settings, the agents grew consistently over-optimistic as their runway drained, continuing to spend on tasks unlikely to succeed rather than alerting early. Task skill barely tracked budget awareness, the two correlating at roughly 0.35. A system that cannot sense itself running short will not pull back before it overcommits, which is the precise moment an open-to-buy plan needs a hand on the brake.
The strongest objection is that none of this is about retail, and that the failure is an engineering artifact. EnterpriseArena is a lending firm, not a buying office, and the sharp variation by framework suggests a wrapper problem the next scaffold will solve. For that objection to win, two things would have to be true: that capital-sizing failure is a packaging defect and not a reasoning limit, and that retail’s feedback loop is dense enough to drill it out. Neither is: the failure landed in sizing and timing, the open-to-buy motions themselves, not the bookkeeping around them. Even the best-performing scaffold left survival far short of robust, and below the human line. Retail is the harder case, not the easier one, because a season’s buy is graded once, months after the capital is committed, long after the over-optimism has done its damage.
The line this benchmark draws is simple: automate the catalog, keep the capital. The vendors, to their credit, mostly say as much: Toolio sells AI that “helps you build merchandise plans,” with the planner still choosing. Anthropologie’s president has gone further, placing human judgment at the decision node rather than the escape hatch. The risk is the quiet promotion of that assistant into a decider, which is what the agentic pitch for next year is selling. Hand a model the assortment and a retailer gains a tireless analyst; hand it the budget and it has given a season’s capital to the one task two independent benchmarks agree it does worst. The model has earned its chair on the floor — the question is which chair.