When Shopping Agents Forget Their Job, the Cart Breaks

New research cataloguing 14 multi-agent LLM failure modes gives agentic commerce a specific name for its most consequential risk: role disobedience, where agents drift outside assigned responsibilities and produce incoherent purchasing sequences that erode buyer trust.

Sources used:

Cemri et al., “Why Do Multi-Agent LLM Systems Fail?” (arXiv 2503.13657) — MAST taxonomy, 1,642 traces, 14 failure modes, role disobedience at 1.5%
Ao, Gao, & Simchi-Levi, “On the Reliability Limits of LLM-Based Multi-Agent Planning” (arXiv 2603.26993) — theoretical proof that distributed agent networks are dominated by centralized decision-makers
Human Security, “The Definitive Guide to Adopting Agentic Commerce” (2026) — more than 6,900% agent traffic growth over eight months of 2025, 2.2% cart interaction rate
Forrester, “What It Means That The Leader In Agentic Commerce Just Pulled Back” — OpenAI Instant Checkout retreat, ~30 merchants live
Shopify agentic commerce blog — 15x AI-driven order growth in 2025
Mastercard Agent Pay and Verifiable Intent

The biggest threat to agentic commerce is not a missing protocol or an underpowered model. It is a coordination failure that researchers can now name: role disobedience. A taxonomy of multi-agent LLM failures published by Cemri, Pan, Yang, and colleagues catalogues 14 distinct failure modes drawn from 1,642 annotated execution traces across seven frameworks including ChatDev, MetaGPT, and Magentic-One. One of those modes, labelled FM-1.2, describes agents that stop adhering to their assigned responsibilities and start behaving like a different agent in the pipeline. In software-engineering benchmarks, role disobedience accounts for 1.5% of observed failures. In a purchasing flow where each agent controls a separate trust boundary, it is the failure most likely to collapse the transaction.

The MAST taxonomy sorts multi-agent failures into three categories: system design flaws at 44% of all observed failures, inter-agent misalignment at 32%, and task verification gaps at 24%. Step repetition is the most frequent single mode at 15.7%, followed by reasoning-action mismatch at 13.2%. Role disobedience sits near the bottom of the frequency table. But frequency is not severity. The researchers documented a case in ChatDev where a chief product officer agent terminated a conversation without consensus from the CEO agent, overriding the workflow hierarchy the system depended on. The failure was rare, but it rewrote the rules mid-task.

Agentic commerce is scaling into exactly the architecture where role disobedience does the most damage. Human Security’s 2026 adoption guide reports that AI agent traffic to retail sites grew more than 6,900% across eight months of 2025, yet only 2.2% of those agents interacted with shopping carts, checkout, or payment funnels. Separately, 87% of all pages those agents browsed were product pages. Shopify reports AI-driven orders grew 15-fold in 2025. The discovery layer works. The transaction layer, where role discipline matters most, barely exists.

A multi-agent shopping pipeline assigns distinct roles along the chain: search, compare, manage the cart, authorize payment. Each mirrors a step in how humans buy and gives the agent a scoped context with bounded tools. Role disobedience collapses that separation. The search agent starts adding items without the comparison agent weighing in, or the payment agent reopens product discovery mid-checkout. From the shopper’s perspective, the screen starts doing things that no longer make sense. Every role boundary in a purchasing pipeline maps to a trust boundary in the customer’s mental model.

OpenAI pulled its Instant Checkout feature in March 2026 after roughly 30 Shopify merchants had gone live. The retreat, noted by Forrester, confirmed what the data already showed: completing a purchase inside an answer engine is the least-adopted consumer use case. Checkout is where agent errors become irreversible. A hallucinated product recommendation wastes a click; an agent that crosses from comparison into cart management generates a chargeback and teaches the customer not to come back. A March 2026 study by Ao, Gao, and Simchi-Levi demonstrated that any delegated multi-agent network is “decision-theoretically dominated” by a centralized decision-maker with the same information — meaning distributed architectures pay a reliability tax that compounds when roles blur. The loss scales with the number of handoffs and the volume of information compressed at each stage.

Mastercard did not build Agent Pay and open-source a Verifiable Intent specification because agent commerce was working.

A reasonable objection is that 1.5% is a rounding error and that step repetition or verification failures are the modes retailers should watch. This is plausible, but only if each agent in the pipeline handles interchangeable, low-stakes tasks. A search agent that repeats a query wastes seconds. A checkout agent that starts browsing invalidates an authorization and potentially charges the customer for the wrong item. The MAST data comes from software-development frameworks, not purchasing pipelines, which means the 1.5% figure likely understates exposure in commerce where role boundaries carry financial and legal weight.

The infrastructure race has centered on protocols: Mastercard’s tokenized agent credentials, Shopify’s agentic storefronts, Google’s and OpenAI’s competing commerce standards. These solve identity and payment plumbing. They do not solve role governance inside the agent pipeline itself. If retailers build purchasing flows on specialized agent architectures without enforcing which agent does what and when, they will discover what the MAST researchers already measured: the system fails not because individual agents are incompetent, but because nobody enforced the org chart.