Retailers Shipped The Apps. The Benchmark Is Still A Preprint.

Retailers have launched apps inside ChatGPT and Claude faster than academic researchers can build evaluation frameworks for them. The closest thing to a public benchmark shows the same product earning a 13x spread in agent selection across frontier models, and it is still a preprint.

Walmart, Target and Etsy have launched commerce apps inside ChatGPT since OpenAI opened its Apps SDK. The closest thing to a public benchmark for what those apps actually do to demand was last revised on 17 December 2025. No retailer announcement we have seen has cited it. The launch curve is steeper than the evaluation curve, which means most of these apps are being tested in production on shoppers who do not know they are the eval set.

Since then the storefronts have arrived in waves. OpenAI’s Instant Checkout launched with Etsy; Walmart joined on 14 October. By March 2026 OpenAI had conceded the original flow stumbled and pivoted to in-ChatGPT apps with Sparky; Target, Sephora, Nordstrom and others integrated separately through the Agents Commerce Protocol. Google’s Universal Commerce Protocol added Ulta on 22 April 2026; the Ask Macy’s agent integrated using Google’s commerce technology the following day. Each launch came with a press release and a partnership quote; none came with a published evaluation framework.

The standard reference for shopping-agent evaluation is still WebShop, a Princeton paper from NeurIPS 2022 built around 1.18 million products and crowdsourced purchase instructions. WebMall from August 2025 extends the trajectory length and the multi-shop surface; Princeton’s Holistic Agent Leaderboard provides cost-aware evaluation infrastructure across agent benchmarks, accepted to ICLR 2026. Neither has crossed into vendor talking points. Retailers are launching against agents the academic literature has not yet caught.

Same product, three models, three different stores.

A working paper from August 2025, revised in October and again in December, asked what happens when a single AI agent shops the same simulated marketplace. “What Is Your AI Agent Buying?”, by Allouah, Besbes, Figueroa, Kanoria and Kumar, ran multiple frontier models through ACES, an agentic e-commerce simulator with randomised positions, prices and badges. For one fitness tracker, the Fitbit Inspire’s selection rate sat at 6% under GPT-5.1, the kind of number that in retail terms describes a long-tail SKU about to be delisted. Swap the agent for Claude Opus 4.5 and the same product cleared 77%, a share normally reserved for exclusives. The merchandise did not change between runs; the agent fronting it did, and the demand distribution moved by more than an order of magnitude.

A retailer integrating with both ChatGPT and Claude has effectively built two storefronts whose conversion behaviour is decoupled from its merchandising decisions. The ACES paper documents that all tested models penalise sponsored badges by roughly 20%, a uniform suppression that already breaks the standard retail-media playbook. An “Overall Pick” endorsement, by contrast, lifts selection by 65 to 138% depending on the model, which is a different shape of merchandising lever altogether. Position bias compounds the asymmetry: GPT-4.1 favours the first column, Claude Sonnet 4 the middle, and GPT-5.1 displays preferences the authors describe as near-opposite to GPT-4.1’s. None of this is testable from a retailer-side dashboard. The retailer sees the orders the agent forwarded, not the products the agent quietly skipped, and the absence of impressions is not the absence of a problem.

The fragmentation across vendors makes the eval problem worse rather than averaging it out. ChatGPT routes Walmart, Target and Etsy queries to retailer-owned apps; Claude routes queries through category-owned apps; Google routes them through Universal Commerce Protocol partners like Ulta and, via its broader commerce infrastructure, Macy’s. Each surface front-loads a different decision: ChatGPT’s preserves the retailer brand, Claude’s preserves the category brand, Google’s preserves the protocol. A merchandising lead building for all three is choosing how to compete across three different demand functions with three different selection biases, none of them visible in their own analytics.

The defensible answer is that production A/B testing is the real evaluation, and academic benchmarks always lag. That is true if the system you are testing is stable. The condition that has to hold for the argument to work is that model behaviour does not shift discontinuously between vendor releases. ACES shows it does — GPT-5.1’s position bias runs near-opposite to GPT-4.1’s. A/B test on Tuesday and the answer is stale by Wednesday’s model update. Conversion data tells you which variant of your prompt sold more; it cannot tell you the agent stopped showing your product altogether after a model swap.

What the retailers have shipped is a tenancy in someone else’s tasting menu. If evaluation infrastructure stays this far behind launch infrastructure, the next OpenAI or Anthropic model release functions as an unannounced merchandising decision the retailer cannot review. The price of being early is being measured by your vendor’s release notes rather than by your own. ACES could become the public test every commerce integration has to clear, the way SOC 2 became table stakes for SaaS. The alternative is the Amazon outcome: retailers learn what their app does by watching the share they used to have evaporate over a quarter, with no log of which model swap caused it.