Technology Evidence Brief (Crabstone)
A waistcoated crab in a monocle inspects a puppeteer's control bar whose strings work a small clockwork shopping-agent pushing a cart of dresses.

The Shopping Agent Answers to a Harness No Brand Has Seen

New agent research finds performance is governed by the harness mediating model and environment, not the model itself. Fashion's agentic-commerce bets therefore rest on a reliability layer brands neither see nor control.

Sir John Crabstone

Fashion has spent a year asking which model will run its shopping agents. It was the wrong question. New research argues an agent’s performance is governed less by the model than by the harness wrapped around it: the layer that decides what the model sees of the catalogue and what it makes of its own mistakes. The cleverest model on the market is doing less than half the work.

The clearest demonstration this year changed nothing about the model at all. LangChain kept a coding agent’s weights fixed and rebuilt only the scaffolding around it. The agent climbed from 52.8% to 66.5% on Terminal-Bench 2.0, from outside the top thirty to the top five. The fixes were unglamorous: a checklist forcing the agent to verify its own work, a guard that interrupts it when it loops. Same mind; better manners.

The claim now carries a name. A position paper posted to arXiv in May calls it the Binding Constraint Thesis: with frontier models roughly matched on long-horizon tasks, what moves performance is the harness, not the model. The authors borrow from control theory and cast the harness as the controller in a closed loop, the part that decides how the system recovers when a step goes wrong. Swap the model and you change a component; rebuild the harness and you change the system.

Reliability is where this stops being academic. Sierra’s τ-bench drops an agent into a retail service loop: a simulated shopper on one side, the retailer’s policies and tools on the other. It asks the agent to finish the same task eight times over. Sierra’s figures imply GPT-4o completed single attempts in the retail domain roughly three times in five; asked to clear all eight, it managed barely a quarter. The model’s weights did not change between trials. Its answers did.

Translate this into a shop. A customer no longer types into a search bar. She tells an assistant to find white trainers under eighty pounds and to handle the rest. Whether her brand of choice is surfaced, sized right, or swapped for a substitute is settled in tool calls between the model and the retailer’s systems. That negotiation is the harness, and the brand wrote none of it. A defect there never registers as an error; the brand records a lost sale and never learns why.

The brand owns the product and the price; it does not own the layer that decides whether either reaches the buyer.

Everyone else keeps score by model — as though the harness were a constant: which one leads a benchmark, which one a retailer has “chosen” for its assistant. The harness paper’s quieter finding is that these rankings say little while the scaffolding stays undisclosed, because the same model’s score moves with the harness beneath it. A leaderboard that hides the harness credits the cloth for the work of the tailor.

The reliability layer belongs to whoever runs the agent, and the brand is not running the agent. The dress is the label’s; the loop the shopper speaks to belongs to whoever built it. When that layer fails, the brand has no log to read and no setting to change. Fashion has spent the season choosing models. The choice that mattered was never theirs to make.

Related Coverage