The lead score that learned the funnel's own ladder.
A team inside a Chinese EV retailer trained an LLM to rank 6.14 million sales leads by the funnel's intermediate stages instead of the rare final label. The top of the queue got dramatically better, a 132-day field test reported a conversion lift, and the idea travels even where the numbers do not. We walk the architecture, the tables, and the arithmetic.
Trained on funnel-stage preference pairs instead of binary conversion labels, an LLM ranker put 25.76 percent future buyers in the top 0.1 percent of a 6.14 million-lead queue, versus 14.41 percent for the best conventional model and 3.28 percent for recency: the practice change is to score leads against your funnel's own ladder and audit precision at the top of the list, not AUC.
Fifty leads. In a 50,000-lead month, that is the top 0.1 percent of the queue: the hot list a manager can actually flag for same-day senior follow-up. Sort the queue by recency, the default in most CRMs, and about 2 of those 50 will ever buy. Sort it with the best conventional model in this week’s paper and you get about 7. Sort it with the method the paper proposes and you get about 13. Same leads, same reps, same payroll. The only thing that changed is the order of the list. We work that arithmetic below, line by line, from the paper’s own tables.
The paper is “Rethinking Sales Lead Scoring with LLM-based Hierarchical Preference Ranking,” posted to arXiv in June 2026 by Zhang and colleagues, a team working inside a large new-energy-vehicle retailer [1]. It is the best lead-scoring study the lab’s standing weekly search has surfaced this year, and also a small masterclass in how to read a results table without being led by the abstract.
A ranking problem wearing a prediction costume
Start with the research question, because it is the part that transfers. In a long-cycle funnel, can a model learn to rank leads from the funnel’s intermediate stages instead of waiting for the rare final label, and can an LLM backbone read the dialogue text that conventional models ignore [1]? The diagnosis behind the question is one every operator will recognize. The standard label for lead scoring is binary: converted or not. In this paper’s funnels, that label is positive for 1.33 to 1.45 percent of leads [1]. Everything else the buyer did along the way, the store visit, the test drive, the long call, gets thrown away. The model is asked to learn from the rarest event in the funnel while ignoring the densest evidence in it.
The sample: two datasets from that retail system. A benchmark set of 340,000 leads with a 1.45 percent positive rate, and an industrial set of 6.14 million leads at 1.33 percent [1]. Both split 7:3 into train and test in strict temporal order, so the model cannot peek at the future [1]. The outcome label is exact and worth writing down: final conversion, defined as order lock-in. Did this person commit to buying the car [1].
The architecture they call asLLR. The backbone is a small language model, Qwen at 1.5 to 1.8 billion parameters, fine-tuned with LoRA, reading the CRM’s structured fields together with the unstructured dialogue logs, truncated at 2,000 tokens [1]. Three heads sit on that backbone: a semantic head that keeps the language model honest about what the text says, a pointwise scoring head that outputs the lead score, and a pairwise ranking head that learns from comparisons [1]. Note what the backbone is not: this is not GPT-4-class machinery. It is commodity-scale compute, and the paper argues that generative frontier models lack a mechanism for comparable ranking outputs anyway [1].
The actual contribution is the training objective, HPRO: hierarchical preference ranking optimization. Instead of one rare binary label, the funnel’s own stages are converted into preference pairs at three tiers, each with its own margin. Global Dominance: a lead that locked in beats a lead that was defeated, margin 1.0. Key Action: a lead that took a test drive beats one that did not, margin 0.5. Soft Signal: a long call beats a short call, margin 0.1 [1]. A margin-aware Bradley-Terry loss, the same paired-comparison math used to rate chess players and to align modern LLMs, turns those pairs into trainable signal [8]. The funnel stops being a reporting artifact and becomes the curriculum.
| Reported result | Value | Context |
|---|---|---|
| Classification performance | AUC 0.8161 | Best result on the 340k benchmark dataset (Table 1) |
| Ranking performance | +39.7% precision | Reported against the authors' model without HPRO (Table 2 ablation), not external baselines |
| Training signal | Funnel-aware preference pairs | Margin-aware Bradley-Terry formulation over sparse binary labels |
| Data setting | Leading NEV (electric vehicle) brand | Long cycle, multi-stage funnel, structured CRM plus unstructured logs |
Exhibit 1 is the result as the authors state it. Hold it loosely for a moment, because the two headline numbers come from two different experiments. The AUC of 0.8161 was measured on the 340k benchmark dataset. The +39.7 percent precision lift was measured on the 6.14M industrial dataset [1]. They are both real, and they should not be quoted as one result. Here is each one, properly.
On the benchmark dataset, six deep CTR baselines (the DeepFM family that powers most production lead scores) cluster between 0.7808 and 0.7917 AUC, with DeepFM best at 0.7917 [1]. Now the detail most coverage will skip: the LLM backbone alone scores 0.7921 [1]. Swapping a language model into the scoring job, by itself, buys you nothing. The gains arrive in two steps. Add the semantic loss, which forces the model to use the dialogue text, and AUC jumps to 0.8081. Add HPRO and it reaches 0.8161, a 3.0 percent gain over the base model (0.8161 divided by 0.7921 is 1.0303) [1]. The introduction’s more modest claim of 2.3 percent over baselines is measured against DeepFM after it was enhanced with the authors’ own text embeddings (0.8161 / 0.7976 = 1.0232, derived). Against plain DeepFM the gain is 3.1 percent (0.8161 / 0.7917 = 1.0308, derived). One more generous finding hides in the same table: injecting the LLM’s text embeddings into the baselines lifts them by 0.007 AUC on average, reported for four of the six baselines [1]. The text helps everyone, not just their model.
Precision at k is the rep-queue metric
The industrial table is where the paper earns its keep, and it uses a metric worth explaining because it is the one that matches how revenue teams actually work. P@K% is the share of the top K percent of the ranked queue that eventually converts [1]. A rep queue is exactly that: a cut from the top of a ranked list. Nobody works the whole list. [email protected]% answers the only question the morning standup asks: of the 50 leads we flagged, how many were real?
| Ranking method | AUC | [email protected]% | [email protected]% | [email protected]% |
|---|---|---|---|---|
| Funnel + recency heuristic | 0.6332 | 3.28% | 1.50% | 9.23% |
| Funnel + CTR (DeepFM, two-stage) | 0.6898 | 7.21% | 1.90% | 18.76% |
| Funnel + CTR (DeepFM, direct) | 0.7382 | 14.41% | 10.14% | 21.85% |
| asLLR without HPRO | 0.7491 | 18.44% | 11.56% | 23.94% |
| asLLR with HPRO (full) | 0.7583 | 25.76% | 13.33% | 25.18% |
| Relative lift vs asLLR without HPRO | +1.2% | +39.7% | +15.3% | +5.2% |
Walk the rows from the bottom of the capability ladder up. The recency heuristic, which is what an unsorted shared queue silently implements, puts 3.28 percent future buyers in the top 0.1 percent. A two-stage funnel-plus-CTR pipeline reaches 7.21 percent. The strongest external baseline, DeepFM trained directly on the task, reaches 14.41 percent. The authors’ LLM without the HPRO objective reaches 18.44 percent. With HPRO, 25.76 percent [1].
That framing matters in both directions. The paper’s Table 2 caption is explicit that the relative lift row is computed against asLLR without HPRO [1], which is the conservative comparator, and to the authors’ credit. But it also means the +39.7 percent answers “what does the ranking objective add” rather than “what do I gain over what I run today.” Against the strongest external baseline the lift is larger: 25.76 / 14.41 = 1.79, a 78.8 percent relative gain (derived). Against the recency heuristic it is 25.76 / 3.28 = 7.9x (derived).
Then they ran it live. For 132 days, all sales operations in one anonymized Chinese province were split into control and experimental groups of sales specialists, balanced on 7, 14, and 30-day lock-in counts [1]. Result: a 9.5 percent relative uplift in lead conversion, two-sided t-test, p below 0.001 [1]. The test reports no headcounts and no lead counts, so that p-value cannot be checked from outside the company [1].
The numbers, worked honestly
Two worked examples, then the traps. First, the hot list from the opening paragraph.
Second, the field result, translated to a volume an operator can feel.
The fragile assumption in both examples is the base rate. The paper’s funnels convert at 1.33 to 1.45 percent. Yours will not match, and the extra-deals figure scales linearly with your rate: halve it to 0.665 percent and the gain is 1,000 x 0.00665 x 0.095 = 0.6 extra deals; at a 5 percent rate, warm B2B inbound territory, it is 1,000 x 0.05 x 0.095 = 4.8 (derived). The relative uplift is the portable claim, if it transfers at all. The absolute deal count is entirely a function of your own funnel.
| Base conversion rate | Lock-ins before | Lock-ins after (+9.5%) | Extra deals |
|---|---|---|---|
| 0.67% (half the paper's rate) | 6.7 | 7.3 | +0.6 |
| 1.33% (the paper's industrial rate) | 13.3 | 14.6 | +1.3 |
| 2.66% (double the paper's rate) | 26.6 | 29.1 | +2.5 |
| 5.00% (warm inbound territory) | 50.0 | 54.8 | +4.8 |
Three traps, in plain words, before anyone puts these numbers in a slide.
AUC sounds bigger than it is. An AUC of 0.8161 means the model usually ranks a future buyer above a non-buyer when you compare pairs. It does not mean the model is right 81 percent of the time. At a 1.33 percent base rate, even the best top 1 percent queue here converts at 13.33 percent [1], which means about 87 percent of the priority list still never buys (100 - 13.33, derived). Precision at the top of the list is the metric that matches payroll. AUC is the metric that makes papers look good.
Percent of a percent. The 9.5 percent uplift is relative. It moves conversion from about 1.33 percent to about 1.46 percent (1.33 x 1.095 = 1.456, derived). Any reading that adds 9.5 points to the base rate is wrong by roughly 7x.
The precision levels are not portable constants. If your base rate is different, all the P@K figures move, and not necessarily in proportion. The defensible takeaway is the ratio: roughly 1.8x the buyers of a tuned CTR model at the top of the list (25.76 / 14.41 = 1.79, derived). The buyer counts are illustration.
Why a funnel ladder carries signal
Say this plainly first: HPRO is an engineering paper. It demonstrates the lift. It does not test, or even propose, a behavioral mechanism for why funnel stages carry the signal it exploits. What follows is the lab’s interpretive scaffolding, sourced from primary literature and labeled by strength. Four mechanisms, and they stack rather than compete.
Small commitments predict big ones. The compliance literature has shown since the 1960s that people who take a small voluntary step toward a decision become more likely to take the larger step later, an effect tied to self-consistency [3] [4]. A test drive is not noise on the way to a purchase. It is a behavioral commitment that changes the odds of the purchase. HPRO’s tiers (lock-in beats test drive beats long call beats silence) work in part because each tier marks a deeper commitment. The escalation effect is established; its role as the carrier of this paper’s signal is plausible, since the authors never decompose which stage contributes what.
Behavior is the strongest intent measure. Intentions predict actions with useful but variable accuracy, and the link is strongest when the intent measure is close to the behavior in time, specificity, and cost [5]. Funnel stages are intent measured by what people do, not what they say, which is the strongest form of the signal. There is even a rhyme with the pairwise framing: purchase intentions forecast sales better when collected comparatively rather than one at a time [6]. Established as a principle; plausible as applied to a Chinese NEV funnel’s stage labels.
Comparison is an easier judgment than absolute prediction. This one is about the instrument, not the buyer. Judging which of two things is greater is psychometrically easier and more reliable than assigning each an absolute value, a measurement result that is nearly a century old [7] and that Bradley-Terry formalized for paired comparisons [8]. The learning-theory twin is reward shaping: when the terminal outcome is rare, graded intermediate signals that point toward the goal speed and stabilize learning without changing what counts as success [9]. HPRO’s stage pairs are exactly that: frequent small lessons standing in for a label that arrives 1.33 percent of the time. The ablation supports this reading (25.76 versus 18.44 percent at the top slice [1]) but does not isolate pair density from the hierarchy’s ordering, so dominance is plausible, not shown.
The words in the record carry intent that fields cannot. Decades of computerized text analysis show that word use tracks attention, emotion, and social dynamics [10], and field data shows the affective content of consumer text predicts conversion behavior beyond structured variables [11]. The paper’s own ladder is consistent: semantic loss alone moved AUC from 0.7921 to 0.8081 [1]. If language did not encode intent, that jump should not exist. We label this one plausible and hedge it: the strongest conversion evidence in the literature is about reviews influencing other buyers, not a buyer’s own call language predicting that buyer’s own purchase.
Why a revenue team should care
If those four mechanisms are why this works, the asset is already on your books. The deeper the commitment a stage marks, the more predictive it is, and any funnel that records intermediate actions (meetings held, replies sent, calls logged) already owns the training signal this paper monetizes. Comparison being easier than absolute prediction means even a hand-built tier sort captures part of the lift before any model is trained. That is why the Monday moves below start with a spreadsheet, not a procurement cycle.
What a revenue team does with this on Monday
One move per maturity level.
- Manual. Stop working the queue newest-first. Recency found 1.6 expected buyers per 50-lead hot list; the ranker found 12.9 (worked above). You cannot deploy the model on Monday, but you can deploy its idea by hand: sort today’s queue into three tiers by observed action (took a meeting or demo, engaged a call or reply, no action) and work the tiers top down. That is the paper’s Global Dominance, Key Action, Soft Signal hierarchy [1] done with a spreadsheet.
- Assisted. You have a CRM score. Audit it the way this paper audits baselines. Pull last quarter’s leads, rank by the score, and compute what share of your top 1 percent actually closed. The paper’s best CTR baseline hit 10.14 percent at [email protected]% on a 1.33 percent base rate [1]. If your score does not beat a simple action-tier sort, it is decoration.
- Orchestrated. Routing and sequencing already run on a score. Scope a ranker pilot that trains on funnel-stage preference pairs (won vs lost, meeting vs no meeting, long call vs short call) instead of binary labels. The label change, not model size, drove the lift here, and the backbone was a 1.5 to 1.8 billion parameter model with LoRA [1]: a commodity-compute project, not a frontier-model project.
- Autonomous. Agents already act on your scores, so a ranking error becomes an automated action error. Re-run the comparator audit before trusting any vendor headline: this paper’s +39.7 percent is against its own ablation, +78.8 percent against the external baseline, 7.9x against recency (arithmetic above). Demand vendor lift numbers against your current stack, then hold out a control group for at least a quarter, the way the paper held one province for 132 days [1].
Where it lands, industry by industry
The authors state the transfer rule themselves, and it is the right one.
That is the test: does your funnel emit visible intermediate actions on the way to a rare final outcome? Five archetypes, parameterized rather than named, with the arithmetic carried over from the worked examples and the sensitivity table. Only the first is evidence; the rest are translation at stated base rates.
What this paper does not prove
The honest perimeter. This is an unreviewed preprint on proprietary data from the authors’ own employer, with no public code or dataset to rerun [1]. The field evidence is one province of one brand selling one product for 132 days, and the metric is lock-in conversions, not revenue: no dollar figure appears anywhere in the paper [1]. The A/B test omits the sample sizes behind its p-value [1]. Nothing here tests B2B, committee buying, or outbound prospecting, and nothing here replaces a salesperson: HPRO ranks leads for human follow-up [1]. A ranking model also optimizes the order of the list it is given. It does nothing for a list built on a dead ICP, which is a Market problem no scoring system can fix.
The wider context cuts both ways. The same weekly pull that surfaced this paper surfaced a benchmark for whether LLMs can realistically progress a sale at all, built from 1,805 curated multi-turn scenarios with a trained customer simulator [2]. Its much-shared result, top models scoring 6.74 against a human group’s 6.33, is Chinese-language only, judged by an LLM and a BERT classifier against simulated customers, with a human baseline the authors themselves describe as entry-level to intermediate [2]. The research community is specifying the AI SDR stack in public, component by component, and the honest parts of these papers are more useful than their abstracts.
What would change our read: an independent replication on a B2B funnel, or a public release of code and data. The standing search will catch either.