AI Can Tell Retailers What Sold. It Still Can't Tell Them Why.
The 'proactive insight' dashboards now sold to fashion buyers promise to explain why a style sold. Causal-inference research says explanation is the one thing large models cannot recover from sales history, which turns their markdown and reorder advice into correlation dressed as cause.
Neritus Vale
The newest analytics tools sold to fashion buyers answer the question buyers most want answered, and answer it wrong. Ask one why a dress sold and it returns a fluent, specific sentence about cause. These copilots are marketed as proactive insight into what drove a sale or stalled it; the research on how large language models handle causation says cause is precisely what they cannot recover from the sales history they read.
The cleanest test of the gap asked models to do nothing but tell correlation from cause. Researchers led by Zhijing Jin built Corr2Cause, a benchmark of hundreds of thousands of problems that hand a model a set of correlations and ask which variable causes which. The best system they tested, GPT-4, performed near the level of chance guessing. Every model they ran stayed in the neighbourhood of chance. The benchmark removes memorised fact and leaves only the inference, and the inference is where they failed.
The failure is structural, not a matter of model size or training budget. Judea Pearl’s hierarchy of causation sorts every causal question onto one of three rungs: seeing, doing, and imagining. A dashboard reads what happened, which is rung one, association. A buyer who asks what to reorder or mark down is asking what will happen if she acts, which is rung two, intervention. The foundational result of the field is that no volume of rung-one data answers a rung-two question without an added causal assumption the data itself can never supply. A model trained to predict the next word over observational text lives on the first rung, and cannot climb by reading harder.

In a buying office the error takes a familiar shape. A navy midi climbs the week it is marked down, and the copilot reports the markdown as the reason units moved. It did not see that the cut landed on the first warm weekend of the season, or that a competitor two doors down had sold out of its nearest equivalent. Price fell and units rose together, so the machine named the price, because co-movement is all it has. The buyer then marks down the next style on the same logic, or reorders the discounted assortment, and chases a lift the weather and a rival’s empty shelf produced.
What retailers are buying, then, is not insight but the grammar of insight: an answer shaped like a reason, generated by a system that has no access to reasons.
The vendors have already conceded the point, in documentation no buyer opens. Tableau’s Explain Data offers a one-click account of why any number on a chart looks the way it does, and ships as a standard feature of the platform. Its own help pages state that its explanations “are not causal explanations,” and warn the user not to assume causality, since correlation is not causation. The disclaimer sits in the manual; the feature sits in the meeting. The distance between those two places is exactly where the wrong reorder is decided.
The strongest objection is that the models have already passed the causal exam. In a widely cited 2023 study, Amit Sharma and colleagues found GPT-4 and GPT-3.5 settling pairwise cause-and-effect questions at about 97 percent accuracy, a result that held on datasets built after its training cut-off and so cannot be dismissed as memorisation. If a buyer’s question were one whose answer is written down somewhere in the world’s text, the model would likely find it. A buyer’s question never is. “Would cutting the navy midi next week lift units, or would the warm weekend do it anyway” is a fact about one shop’s coming Tuesday, recorded in no corpus. The 97 percent measures recall of causal knowledge that already exists; Corr2Cause’s near-chance result measures discovery of structure that does not — and buyers live in the second number.
One 2025 study did the honest thing and fed the models the actual numbers behind the labels. Researchers testing data-driven causal discovery found that the variable names alone let a model beat classical statistical methods by up to 0.41 of an F1 point; adding the observed data gained at most 0.11 more. The ranking of those two numbers is the whole story: the labels did the causal work, and the data the dashboard actually holds barely moved the result. A buyer’s labels are generic, “price” and “units” and “SKU,” so the part the model leans on carries no answer about her shop. If vendors keep selling rung-two answers built on rung-one machinery, the cost will not arrive as an error message; it will arrive as a season of markdowns timed to weather no one measured, and as reorders that looked data-driven when they were not.
The repair is cheap and old, and it is called the experiment. A dashboard cannot tell a buyer why a style sold, but it can tell her what to test, and a single holdout store or a staggered markdown will settle the causal question the copilot only pretends to answer. If retailers treat the machine’s “why” as a hypothesis to run rather than an instruction to follow, the same tools that mislead them become the cheapest way to design the test that does not. The technology that cannot find cause is still worth paying for; it is worth one rung less than the price on the invoice.