Alibaba Gave 6,000 Agents an AI Copilot. The Best Ones Got Worse.
A field experiment across 5,940 Taobao agents found that generative AI compressed the customer service skill curve, lifting low performers while degrading top-performer quality through increased multitasking.
Neritus Vale
Alibaba ran a randomized field experiment on Taobao, giving roughly half of 5,940 customer service agents a generative AI copilot and leaving the rest without one. Across all agents, the tool cut issue identification time by 32% and raised customer satisfaction ratings by 5.3%. Low performers captured most of the improvement, while top performers saw service quality fall. Across 2.56 million chats, the skill curve compressed.
The copilot diagnosed each customer’s order issue in real time and proposed a solution as draft text that agents could send, edit, or ignore. Low-performing agents used it heavily and resolved issues faster. Their own typing volume and message count increased, which co-author Lauren Xiaoyuan Lu described as unexpected: she had expected the tool to reduce agent effort. Instead, weaker agents treated the AI’s output as a starting draft and built on it. The tool gave them access to diagnostic patterns they had not yet learned on the job.
Top-performing agents responded to the copilot by multitasking. Given an assistant that handled initial diagnosis, they redirected attention to other concurrent chat sessions. Their shift-away time (a measure of how long agents diverted attention from one active chat to another) increased, slowing responses to individual customers, raising abandonment rates, and triggering retrials. Customer retrial rates rose for the agents who had previously needed the fewest attempts. The AI’s recommendations were calibrated to the median case; for agents already operating well above that line, following them was a step down.
A separate study published in the Quarterly Journal of Economics by Brynjolfsson, Li, and Raymond found the same pattern in a customer support operation. Across 5,172 agents, AI assistance lifted average productivity by 14%, as measured by issues resolved per hour. Novice and low-skilled workers drove the result, improving by 34%. The experience curve compressed so sharply that agents with two months of AI-assisted tenure matched untreated agents with more than six months on the job. The AI, the authors argued, disseminated the best practices of top performers downward, giving novices access to accumulated institutional knowledge. Two independent experiments, on different platforms, produced the same distributional signature.
If generative AI compresses the performance distribution rather than shifting it, the hiring premium on experienced customer service agents drops while the cost of onboarding inexperienced ones falls.
This conclusion holds only if the quality gap between AI-calibrated median service and genuine expert service is small enough that customers accept it. A Qualtrics 2026 study found that nearly one in five consumers who used AI-assisted customer service saw no benefit — a failure rate almost four times higher than for other AI applications. Gartner predicts that half of companies that cut customer service staff because of AI will rehire by 2027. The skill floor rises quickly, but the customers who generate outsized revenue (complex returns, loyalty-threatening complaints, high-value styling requests) still need the ceiling.
Alibaba has already moved past the experiment phase. Its Dianxiaomi customer service tool has served 300 million customers across Taobao and Tmall and raised transaction conversion rates by 30%, as reported by the South China Morning Post. Five million merchants adopted Alibaba’s AI tools over the past year, generating an estimated 100 billion yuan in cost savings. In March 2026, Alibaba announced agentic AI services for millions of merchants — a 24/7 “digital workforce” that handles customer queries, distributes vouchers, and adjusts pricing in real time. ASOS signed with Sierra, the $10 billion AI customer service startup, in late 2025 for a similar overhaul. Returns, sizing disputes, and styling advice carry more ambiguity than the refund-or-replace queries that dominated the Taobao experiment.
If fashion brands read skill-curve compression as a license to replace experienced agents with AI-assisted novices, they will hit the limit the Qualtrics data already describes. The sharper reading of the Alibaba experiment points somewhere else. Low performers improved because the copilot gave them information they lacked. The mechanism reversed for top agents: the copilot consumed attention they could not spare, and quality dropped. Deploy the tool where the gap is largest. Keep it away from the agents who were already better than the model.