Only Two of Seven Chatbot Quality Scores Predicted a Sale
An arXiv study testing chatbot evaluation rubrics against verified sales data found that only two of seven quality dimensions predicted conversion, and the composite score most dashboards report underperformed both.
Neritus Vale
A study on arXiv testing whether chatbot quality scores predict sales found that most do not. Researchers evaluated 60 conversations from a major Chinese platform against verified payment records, scoring each across seven quality dimensions with an LLM judge. Only two of the seven showed a statistically significant association with whether the customer paid. The composite score that gates most deployment decisions underperformed both. This is among the first studies to validate those scores against what they are supposed to predict.
The platform sold matchmaking subscriptions priced between ¥469 and ¥1,688 per year through WeChat Work conversations between human agents and prospective customers. The sample of 60 stratified conversations was drawn from 59,316 collected between March and July 2025. Each was scored on seven dimensions: need elicitation, emotional empathy, pacing strategy, objection handling, contextual memory, product accuracy, and brand consistency. The outcome variable was a verified payment record, not a satisfaction survey or a proxy like session length.
Need Elicitation (ρ=0.368) and Pacing Strategy (ρ=0.354) were the only dimensions that survived Bonferroni correction for multiple comparisons. Emotional Empathy, Product Accuracy, Objection Handling, and Brand Consistency showed no significant link to whether someone bought. Contextual Memory registered ρ=0.018, statistically indistinguishable from zero. In a logistic regression controlling for conversation length, Pacing Strategy’s effect strengthened to an odds ratio of 3.18. Contextual Memory reversed direction entirely, becoming a negative predictor with an odds ratio of 0.15 (p=0.005). The dimension most chatbot vendors highlight as a differentiator had no positive relationship with the outcome they are hired to produce.
The behavioral data suggests why pacing predicted conversion while memory and empathy did not. The researchers tracked each conversation through a six-stage Trust Ladder, from initial skepticism to purchase readiness. Reaching the trust stage where price discussion became viable required an average of 36 messages. Agents who converted customers did not follow a fixed sequence; 37% of successful conversations showed repeated hesitation, with the agent retreating when the customer balked before advancing again. Pacing, in this reading, is not a measure of conversational polish. It measures whether the agent reads resistance and adjusts.
The problem is not scoring accuracy; LLM judges achieve roughly 80% agreement with human evaluators, matching human-to-human consistency. The problem is how scores are weighted and combined. Equal-weighted composites — the default in most commercial evaluation tools — produced a correlation with conversion of ρ=0.230, below the threshold for statistical significance. Reweighting based on observed conversion data improved it to ρ=0.351, but the optimal scheme required giving Pacing Strategy 40% of total weight and zeroing out Contextual Memory entirely. The gap between a standard rubric and a conversion-validated one is the gap between measuring quality and measuring whether quality predicts anything.
In the study’s pilot phase, AI agents scored higher on the quality rubric than human agents while converting zero of ten customers.
The authors attributed this result partly to an agent-type confound: all three pilot conversions came from human agents, making the comparison unsuitable for causal claims. This is plausible as a methodological defense, but only if you set aside the separate behavioral analysis of 130 AI-agent conversations. In that analysis, AI agents reached the closing stage of the sales funnel 72% of the time yet reached the trust threshold zero percent of the time. They cycled through sales stages 3.4 times faster than human agents, executing the correct sequence without building the relationship the sequence was designed to build. The rubric captured every step of the funnel but had no dimension for trust.
The study examined a matchmaking service, not a fashion retailer, but the conversational mechanics are structurally comparable: a considered purchase negotiated through extended dialogue where trust determines whether the customer commits. Fashion brands deploying conversational AI on WhatsApp, WeChat, and Instagram DM evaluate their chatbots with rubric-based quality scores covering empathy, accuracy, and brand voice, without testing whether those scores predict purchase. A chatbot that scores high on emotional empathy and low on pacing may look better on the quality dashboard while converting fewer customers than one with the reverse profile. If the variation across dimensions that this paper documents holds beyond one platform, most brands are optimizing the wrong signals.
The researchers propose a three-layer evaluation architecture: safety gates first, quality scoring with conversion-informed weights second, direct business outcome validation third. Most commercial frameworks stop at layer two. If retailers want their conversational AI scores to mean anything, they need to run the test this study ran — connect the rubric to a transaction record, test each dimension separately, and discover which of their seven scores are not predicting the sale.