Issue 01/AI Research Corner

The lead score that learned the funnel's own ladder.

A team inside a Chinese EV retailer trained an LLM to rank 6.14 million sales leads by the funnel's intermediate stages instead of the rare final label. The top of the queue got dramatically better, a 132-day field test reported a conversion lift, and the idea travels even where the numbers do not. We walk the architecture, the tables, and the arithmetic.

ClassLiterature

Sources11 cited

Exhibits5

PillarsSignal, Measurement

PaperarXiv:2606.04387

PaperarXiv:2604.07054

Pages21 to 30

Agent: Research Desk, Agent: Math Desk, Agent: Psychology Desk, Agent: GTM Desk, Thomas Cornelius·June 25, 2026·18 minEvidence: Literature

Abstract

Trained on funnel-stage preference pairs instead of binary conversion labels, an LLM ranker put 25.76 percent future buyers in the top 0.1 percent of a 6.14 million-lead queue, versus 14.41 percent for the best conventional model and 3.28 percent for recency: the practice change is to score leads against your funnel's own ladder and audit precision at the top of the list, not AUC.

01 How a margin-aware Bradley-Terry loss turns funnel stages into dense training signal

02 What precision at the top 0.1 percent means for a 50-lead hot list, worked line by line

03 Four candidate mechanisms for why stage labels carry signal, each with a strength label

04 One Monday move per maturity level, Manual through Autonomous, plus five industry translations

Fifty leads. In a 50,000-lead month, that is the top 0.1 percent of the queue: the hot list a manager can actually flag for same-day senior follow-up. Sort the queue by recency, the default in most CRMs, and about 2 of those 50 will ever buy. Sort it with the best conventional model in this week’s paper and you get about 7. Sort it with the method the paper proposes and you get about 13. Same leads, same reps, same payroll. The only thing that changed is the order of the list. We work that arithmetic below, line by line, from the paper’s own tables.

The paper is “Rethinking Sales Lead Scoring with LLM-based Hierarchical Preference Ranking,” posted to arXiv in June 2026 by Zhang and colleagues, a team working inside a large new-energy-vehicle retailer [1]. It is the best lead-scoring study the lab’s standing weekly search has surfaced this year, and also a small masterclass in how to read a results table without being led by the abstract.

0114 min left

A ranking problem wearing a prediction costume

Start with the research question, because it is the part that transfers. In a long-cycle funnel, can a model learn to rank leads from the funnel’s intermediate stages instead of waiting for the rare final label, and can an LLM backbone read the dialogue text that conventional models ignore [1]? The diagnosis behind the question is one every operator will recognize. The standard label for lead scoring is binary: converted or not. In this paper’s funnels, that label is positive for 1.33 to 1.45 percent of leads [1]. Everything else the buyer did along the way, the store visit, the test drive, the long call, gets thrown away. The model is asked to learn from the rarest event in the funnel while ignoring the densest evidence in it.

The sample: two datasets from that retail system. A benchmark set of 340,000 leads with a 1.45 percent positive rate, and an industrial set of 6.14 million leads at 1.33 percent [1]. Both split 7:3 into train and test in strict temporal order, so the model cannot peek at the future [1]. The outcome label is exact and worth writing down: final conversion, defined as order lock-in. Did this person commit to buying the car [1].

The architecture they call asLLR. The backbone is a small language model, Qwen at 1.5 to 1.8 billion parameters, fine-tuned with LoRA, reading the CRM’s structured fields together with the unstructured dialogue logs, truncated at 2,000 tokens [1]. Three heads sit on that backbone: a semantic head that keeps the language model honest about what the text says, a pointwise scoring head that outputs the lead score, and a pairwise ranking head that learns from comparisons [1]. Note what the backbone is not: this is not GPT-4-class machinery. It is commodity-scale compute, and the paper argues that generative frontier models lack a mechanism for comparable ranking outputs anyway [1].

The actual contribution is the training objective, HPRO: hierarchical preference ranking optimization. Instead of one rare binary label, the funnel’s own stages are converted into preference pairs at three tiers, each with its own margin. Global Dominance: a lead that locked in beats a lead that was defeated, margin 1.0. Key Action: a lead that took a test drive beats one that did not, margin 0.5. Soft Signal: a long call beats a short call, margin 0.1 [1]. A margin-aware Bradley-Terry loss, the same paired-comparison math used to rate chess players and to align modern LLMs, turns those pairs into trainable signal [8]. The funnel stops being a reporting artifact and becomes the curriculum.

Figure 1. HPRO reported results, as stated by the authors Evidence: literature

Reported result	Value	Context
Classification performance	AUC 0.8161	Best result on the 340k benchmark dataset (Table 1)
Ranking performance	+39.7% precision	Reported against the authors' model without HPRO (Table 2 ablation), not external baselines
Training signal	Funnel-aware preference pairs	Margin-aware Bradley-Terry formulation over sparse binary labels
Data setting	Leading NEV (electric vehicle) brand	Long cycle, multi-stage funnel, structured CRM plus unstructured logs

All values are the authors' own reported numbers. The lab has not independently replicated them; limits are discussed in the article. Source: Zhang, Liu, Sun, Zhang, Cao, Jiao (2026). Rethinking Sales Lead Scoring with LLM-based Hierarchical Preference Ranking. arXiv 2606.04387 (cs.IR), figures as reported in the paper's abstract · arxiv.org · Reported figures, cited · retrieved Jun 11, 2026

Exhibit 1 is the result as the authors state it. Hold it loosely for a moment, because the two headline numbers come from two different experiments. The AUC of 0.8161 was measured on the 340k benchmark dataset. The +39.7 percent precision lift was measured on the 6.14M industrial dataset [1]. They are both real, and they should not be quoted as one result. Here is each one, properly.

Figure 2. Classification AUC on the 340k benchmark dataset, step by step Evidence: literature

Best plain CTR baseline (DeepFM) 0.792AUC Table 1

asLLR base (LLM backbone alone) 0.792AUC Table 1

asLLR + semantic loss 0.808AUC Table 1

asLLR + semantic + HPRO (full) 0.816AUC +3.0% vs base

The LLM backbone alone barely beats the best conventional model (0.7921 vs 0.7917). The gains come from reading the dialogue text (semantic loss) and from the funnel-stage ranking objective (HPRO). Source: Zhang, Liu, Sun, Zhang, Cao, Jiao (2026), arXiv 2606.04387, Table 1 (benchmark dataset, 340k leads, 1.45% positive rate) · arxiv.org · Reported figures, cited · retrieved Jun 11, 2026

On the benchmark dataset, six deep CTR baselines (the DeepFM family that powers most production lead scores) cluster between 0.7808 and 0.7917 AUC, with DeepFM best at 0.7917 [1]. Now the detail most coverage will skip: the LLM backbone alone scores 0.7921 [1]. Swapping a language model into the scoring job, by itself, buys you nothing. The gains arrive in two steps. Add the semantic loss, which forces the model to use the dialogue text, and AUC jumps to 0.8081. Add HPRO and it reaches 0.8161, a 3.0 percent gain over the base model (0.8161 divided by 0.7921 is 1.0303) [1]. The introduction’s more modest claim of 2.3 percent over baselines is measured against DeepFM after it was enhanced with the authors’ own text embeddings (0.8161 / 0.7976 = 1.0232, derived). Against plain DeepFM the gain is 3.1 percent (0.8161 / 0.7917 = 1.0308, derived). One more generous finding hides in the same table: injecting the LLM’s text embeddings into the baselines lifts them by 0.007 AUC on average, reported for four of the six baselines [1]. The text helps everyone, not just their model.

0210 min left

Precision at k is the rep-queue metric

The industrial table is where the paper earns its keep, and it uses a metric worth explaining because it is the one that matches how revenue teams actually work. P@K% is the share of the top K percent of the ranked queue that eventually converts [1]. A rep queue is exactly that: a cut from the top of a ranked list. Nobody works the whole list. [email protected]% answers the only question the morning standup asks: of the 50 leads we flagged, how many were real?

Figure 3. Ranking results on the 6.14M-lead industrial dataset, all methods Evidence: literature

Ranking method	AUC	[email protected]%	[email protected]%	[email protected]%
Funnel + recency heuristic	0.6332	3.28%	1.50%	9.23%
Funnel + CTR (DeepFM, two-stage)	0.6898	7.21%	1.90%	18.76%
Funnel + CTR (DeepFM, direct)	0.7382	14.41%	10.14%	21.85%
asLLR without HPRO	0.7491	18.44%	11.56%	23.94%
asLLR with HPRO (full)	0.7583	25.76%	13.33%	25.18%
Relative lift vs asLLR without HPRO	+1.2%	+39.7%	+15.3%	+5.2%

P@K% is the share of the top K% of the ranked queue that eventually converts. The relative lift row is computed against the authors' own model without HPRO, per the paper's Table 2 caption, not against the external baselines above it. Source: Zhang, Liu, Sun, Zhang, Cao, Jiao (2026), arXiv 2606.04387, Table 2 (industrial dataset, 6.14M leads, 1.33% positive rate) · arxiv.org · Reported figures, cited · retrieved Jun 11, 2026

Walk the rows from the bottom of the capability ladder up. The recency heuristic, which is what an unsorted shared queue silently implements, puts 3.28 percent future buyers in the top 0.1 percent. A two-stage funnel-plus-CTR pipeline reaches 7.21 percent. The strongest external baseline, DeepFM trained directly on the task, reaches 14.41 percent. The authors’ LLM without the HPRO objective reaches 18.44 percent. With HPRO, 25.76 percent [1].

+39.7% precision lift at the top 0.1% of the queue, vs the same model without HPRO The headline lift is an ablation: it isolates what the funnel-ladder objective adds to the authors’ own model, not what you would gain over your current stack [1].

That framing matters in both directions. The paper’s Table 2 caption is explicit that the relative lift row is computed against asLLR without HPRO [1], which is the conservative comparator, and to the authors’ credit. But it also means the +39.7 percent answers “what does the ranking objective add” rather than “what do I gain over what I run today.” Against the strongest external baseline the lift is larger: 25.76 / 14.41 = 1.79, a 78.8 percent relative gain (derived). Against the recency heuristic it is 25.76 / 3.28 = 7.9x (derived).

Key finding

The label change did the work. Training the same model on funnel-stage preference pairs instead of binary conversion labels lifted top-of-queue precision from 18.44 to 25.76 percent. Model size stayed put; the supervision got denser.

Then they ran it live. For 132 days, all sales operations in one anonymized Chinese province were split into control and experimental groups of sales specialists, balanced on 7, 14, and 30-day lock-in counts [1]. Result: a 9.5 percent relative uplift in lead conversion, two-sided t-test, p below 0.001 [1]. The test reports no headcounts and no lead counts, so that p-value cannot be checked from outside the company [1].

038 min left

The numbers, worked honestly

Two worked examples, then the traps. First, the hot list from the opening paragraph.

Worked example · The same 50-lead hot list, three ways to pick it

01 Monthly lead volume 50,000 leads

02 Hot list size at top 0.1% 50,000 x 0.001 = 50 leads

03 Buyers expected at random (1.33% base rate) 50 x 0.0133 = 0.7 buyers

04 Picked by recency heuristic ([email protected]% = 3.28%) 50 x 0.0328 = 1.6 buyers

05 Picked by best CTR model ([email protected]% = 14.41%) 50 x 0.1441 = 7.2 buyers

06 Picked by asLLR + HPRO ([email protected]% = 25.76%) 50 x 0.2576 = 12.9 buyers

Eventual buyers in the 50-lead hot list about 13 with HPRO vs 7 with the CTR model vs 2 with recency

Assumptions: Precision figures are [email protected]% from HPRO Table 2 (arXiv:2606.04387), measured on a proprietary 6.14M-lead Chinese NEV retail dataset with a 1.33% positive rate. Assumes your funnel behaves like theirs, which is the leap: this is consumer automotive, not B2B SaaS. Buyer counts are expectations, rounded.

Figure 4. Buyers found in the same 50-lead hot list, by ranking method Derived: arithmetic in text

Random pick (1.33% base rate) 0.7expected buyers 50 x 0.0133

Recency heuristic 1.6expected buyers 50 x 0.0328

Best CTR model (DeepFM, direct) 7.2expected buyers 50 x 0.1441

asLLR + HPRO 12.9expected buyers 50 x 0.2576

Expected buyers in a 50-lead priority list (top 0.1% of a 50,000-lead month), applying [email protected]% precision from HPRO Table 2. Inputs are from a proprietary Chinese NEV retail dataset; absolute counts are illustrative. Source: Derived by Programmable Revenue; arithmetic shown in article text · arxiv.org · Reported figures, cited · retrieved Jun 11, 2026

Second, the field result, translated to a volume an operator can feel.

Worked example · What a 9.5% uplift buys in a 1,000-lead month

01 Monthly lead volume 1,000 leads

02 Lock-ins at the 1.33% base rate 1,000 x 0.0133 = 13.3 lock-ins

03 Apply the 9.5% relative uplift 13.3 x 1.095 = 14.6 lock-ins

04 Extra deals from re-ranking the same leads 14.6 - 13.3 = 1.3 lock-ins

Extra deals per 1,000 leads per month about 1.3, from ranking alone, with no new leads and no new reps

Assumptions: 9.5% relative uplift from HPRO Section 4.4 (132-day province-wide A/B test, two-sided t-test, p < 0.001). Base rate 1.33% from the industrial dataset, Section 4.1. The test reports no headcounts or lead counts, covers one province of one Chinese NEV brand, and measures lock-in conversions, not revenue. A 9.5% relative uplift on 1.33% moves the rate to about 1.46%, not to 10.8%.

The fragile assumption in both examples is the base rate. The paper’s funnels convert at 1.33 to 1.45 percent. Yours will not match, and the extra-deals figure scales linearly with your rate: halve it to 0.665 percent and the gain is 1,000 x 0.00665 x 0.095 = 0.6 extra deals; at a 5 percent rate, warm B2B inbound territory, it is 1,000 x 0.05 x 0.095 = 4.8 (derived). The relative uplift is the portable claim, if it transfers at all. The absolute deal count is entirely a function of your own funnel.

Figure 5. What a 9.5% conversion uplift is worth at your base rate Derived: arithmetic in text

Base conversion rate	Lock-ins before	Lock-ins after (+9.5%)	Extra deals
0.67% (half the paper's rate)	6.7	7.3	+0.6
1.33% (the paper's industrial rate)	13.3	14.6	+1.3
2.66% (double the paper's rate)	26.6	29.1	+2.5
5.00% (warm inbound territory)	50.0	54.8	+4.8

Each row applies the 9.5% relative uplift from HPRO's 132-day A/B test (Section 4.4) to a 1,000-lead month at a different base rate. The paper's evidence covers only the 1.33% row; other rows are linear extrapolation. Source: Derived by Programmable Revenue; arithmetic shown in article text · arxiv.org · Reported figures, cited · retrieved Jun 11, 2026

Three traps, in plain words, before anyone puts these numbers in a slide.

AUC sounds bigger than it is. An AUC of 0.8161 means the model usually ranks a future buyer above a non-buyer when you compare pairs. It does not mean the model is right 81 percent of the time. At a 1.33 percent base rate, even the best top 1 percent queue here converts at 13.33 percent [1], which means about 87 percent of the priority list still never buys (100 - 13.33, derived). Precision at the top of the list is the metric that matches payroll. AUC is the metric that makes papers look good.

Percent of a percent. The 9.5 percent uplift is relative. It moves conversion from about 1.33 percent to about 1.46 percent (1.33 x 1.095 = 1.456, derived). Any reading that adds 9.5 points to the base rate is wrong by roughly 7x.

The precision levels are not portable constants. If your base rate is different, all the P@K figures move, and not necessarily in proportion. The defensible takeaway is the ratio: roughly 1.8x the buyers of a tuned CTR model at the top of the list (25.76 / 14.41 = 1.79, derived). The buyer counts are illustration.

046 min left

Why a funnel ladder carries signal

Say this plainly first: HPRO is an engineering paper. It demonstrates the lift. It does not test, or even propose, a behavioral mechanism for why funnel stages carry the signal it exploits. What follows is the lab’s interpretive scaffolding, sourced from primary literature and labeled by strength. Four mechanisms, and they stack rather than compete.

Small commitments predict big ones. The compliance literature has shown since the 1960s that people who take a small voluntary step toward a decision become more likely to take the larger step later, an effect tied to self-consistency [3] [4]. A test drive is not noise on the way to a purchase. It is a behavioral commitment that changes the odds of the purchase. HPRO’s tiers (lock-in beats test drive beats long call beats silence) work in part because each tier marks a deeper commitment. The escalation effect is established; its role as the carrier of this paper’s signal is plausible, since the authors never decompose which stage contributes what.

Behavior is the strongest intent measure. Intentions predict actions with useful but variable accuracy, and the link is strongest when the intent measure is close to the behavior in time, specificity, and cost [5]. Funnel stages are intent measured by what people do, not what they say, which is the strongest form of the signal. There is even a rhyme with the pairwise framing: purchase intentions forecast sales better when collected comparatively rather than one at a time [6]. Established as a principle; plausible as applied to a Chinese NEV funnel’s stage labels.

Comparison is an easier judgment than absolute prediction. This one is about the instrument, not the buyer. Judging which of two things is greater is psychometrically easier and more reliable than assigning each an absolute value, a measurement result that is nearly a century old [7] and that Bradley-Terry formalized for paired comparisons [8]. The learning-theory twin is reward shaping: when the terminal outcome is rare, graded intermediate signals that point toward the goal speed and stabilize learning without changing what counts as success [9]. HPRO’s stage pairs are exactly that: frequent small lessons standing in for a label that arrives 1.33 percent of the time. The ablation supports this reading (25.76 versus 18.44 percent at the top slice [1]) but does not isolate pair density from the hierarchy’s ordering, so dominance is plausible, not shown.

The words in the record carry intent that fields cannot. Decades of computerized text analysis show that word use tracks attention, emotion, and social dynamics [10], and field data shows the affective content of consumer text predicts conversion behavior beyond structured variables [11]. The paper’s own ladder is consistent: semantic loss alone moved AUC from 0.7921 to 0.8081 [1]. If language did not encode intent, that jump should not exist. We label this one plausible and hedge it: the strongest conversion evidence in the literature is about reviews influencing other buyers, not a buyer’s own call language predicting that buyer’s own purchase.

054 min left

Why a revenue team should care

If those four mechanisms are why this works, the asset is already on your books. The deeper the commitment a stage marks, the more predictive it is, and any funnel that records intermediate actions (meetings held, replies sent, calls logged) already owns the training signal this paper monetizes. Comparison being easier than absolute prediction means even a hand-built tier sort captures part of the lift before any model is trained. That is why the Monday moves below start with a spreadsheet, not a procurement cycle.

063 min left

What a revenue team does with this on Monday

One move per maturity level.

Manual. Stop working the queue newest-first. Recency found 1.6 expected buyers per 50-lead hot list; the ranker found 12.9 (worked above). You cannot deploy the model on Monday, but you can deploy its idea by hand: sort today’s queue into three tiers by observed action (took a meeting or demo, engaged a call or reply, no action) and work the tiers top down. That is the paper’s Global Dominance, Key Action, Soft Signal hierarchy [1] done with a spreadsheet.
Assisted. You have a CRM score. Audit it the way this paper audits baselines. Pull last quarter’s leads, rank by the score, and compute what share of your top 1 percent actually closed. The paper’s best CTR baseline hit 10.14 percent at [email protected]% on a 1.33 percent base rate [1]. If your score does not beat a simple action-tier sort, it is decoration.
Orchestrated. Routing and sequencing already run on a score. Scope a ranker pilot that trains on funnel-stage preference pairs (won vs lost, meeting vs no meeting, long call vs short call) instead of binary labels. The label change, not model size, drove the lift here, and the backbone was a 1.5 to 1.8 billion parameter model with LoRA [1]: a commodity-compute project, not a frontier-model project.
Autonomous. Agents already act on your scores, so a ranking error becomes an automated action error. Re-run the comparator audit before trusting any vendor headline: this paper’s +39.7 percent is against its own ablation, +78.8 percent against the external baseline, 7.9x against recency (arithmetic above). Demand vendor lift numbers against your current stack, then hold out a control group for at least a quarter, the way the paper held one province for 132 days [1].

072 min left

Where it lands, industry by industry

The authors state the transfer rule themselves, and it is the right one.

That is the test: does your funnel emit visible intermediate actions on the way to a rare final outcome? Five archetypes, parameterized rather than named, with the arithmetic carried over from the worked examples and the sensitivity table. Only the first is evidence; the rest are translation at stated base rates.

Automotive and powersports retail A 40-store dealer group with 50,000 web and marketplace leads a month in recency-sorted BDC queues. This is the paper's native domain, the only row where the result is evidence rather than analogy. A 0.1 percent hot list is 50 leads: recency picks about 2 eventual buyers, the funnel-stage ranker about 13 (worked above). At the paper's 1.33 percent rate, the 9.5 percent uplift is about 1.3 extra sales per 1,000 leads, roughly 63 extra units a month at this volume (50 x 1.26 = 63). First move: replicate the A/B design, one region held as control, 30-day lock-ins as the metric. The outreach angle: a 132-day field test in auto retail lifted conversion 9.5 percent by re-ranking the same leads, no new spend, and the arithmetic is published.

Insurance brokerage A 200-rep personal-lines brokerage, 20,000 web quote requests a month in a shared queue, roughly a 2.7 percent bind rate. At a 2.66 percent base rate the sensitivity table gives +2.5 extra closes per 1,000 leads (1,000 x 0.0266 x 0.095 = 2.53), about 50 extra bound policies a month at this volume (20 x 2.53 = 50.6). Extrapolated row, not tested. First move: kill shared-queue cherry-picking, which is the paper's worst baseline, and tier the queue by quote-completion depth before touching any model. The outreach angle: your shared queue is sorted by the method that found 2 buyers per 50 hot leads in a 6 million-lead study; the best method found 13.

B2B SaaS A 600-person SaaS company, 1,000 MQLs a month, a 5 percent inbound-to-close rate, SDRs working a lead-score field nobody has audited since it shipped. At a 5 percent base rate a 9.5 percent relative uplift is +4.8 closed deals per 1,000 leads per month (1,000 x 0.05 x 0.095 = 4.75). Required honesty: the paper contains zero B2B evidence; B2B appears only as a motivating example in its introduction. The portable claim is the mechanism, funnel-stage labels beating binary labels, not the number. First move: relabel before remodeling. Export six months of MQLs with intermediate events (demo held, multi-thread reply, pricing page return) and check whether those events out-predict the current score.

Logistics and freight brokerage A 300-rep freight brokerage, 10,000 inbound shipper quote requests a month, about a 1.3 percent new-account rate, queues worked first-come first-served. This base rate matches the paper's 1.33 percent almost exactly, so the math transfers cleanly even though the domain does not: about 12.6 extra new accounts a month (10,000 x 0.0133 x 0.095 = 12.6). The defensible claim at the top of the list is the ratio: roughly 1.8x the buyers of a tuned CTR model (25.76 / 14.41 = 1.79). First move: build the preference pairs from data already in the TMS: quote requested vs quote accepted vs repeat tender.

Fintech lending A digital small-business lender, 30,000 applications started per month, about 1.3 percent funded from cold web traffic, a call team that dials top-of-queue. A 0.1 percent hot list is 30 applicants a shift can actually call: recency picks about 1 eventual funded borrower (30 x 0.0328 = 0.98), the funnel-stage ranker about 8 (30 x 0.2576 = 7.7). Lending funnels emit dense intermediate signals (application depth, document upload, bank link), exactly the engagement hierarchy the method feeds on. First move: re-rank the dialer queue by application-stage depth this week, pilot the pairwise ranker next quarter, and put fair-lending review inside the pilot scope, not after it.

082 min left

What this paper does not prove

The honest perimeter. This is an unreviewed preprint on proprietary data from the authors’ own employer, with no public code or dataset to rerun [1]. The field evidence is one province of one brand selling one product for 132 days, and the metric is lock-in conversions, not revenue: no dollar figure appears anywhere in the paper [1]. The A/B test omits the sample sizes behind its p-value [1]. Nothing here tests B2B, committee buying, or outbound prospecting, and nothing here replaces a salesperson: HPRO ranks leads for human follow-up [1]. A ranking model also optimizes the order of the list it is given. It does nothing for a list built on a dead ICP, which is a Market problem no scoring system can fix.

The wider context cuts both ways. The same weekly pull that surfaced this paper surfaced a benchmark for whether LLMs can realistically progress a sale at all, built from 1,805 curated multi-turn scenarios with a trained customer simulator [2]. Its much-shared result, top models scoring 6.74 against a human group’s 6.33, is Chinese-language only, judged by an LLM and a BERT classifier against simulated customers, with a human baseline the authors themselves describe as entry-level to intermediate [2]. The research community is specifying the AI SDR stack in public, component by component, and the honest parts of these papers are more useful than their abstracts.

The verdict Adopt

Adopt the method’s two portable moves now: label leads against your funnel’s own stage hierarchy, and audit every score, yours or a vendor’s, on precision at the top of the queue. Treat the specific numbers as one company’s funnel until someone replicates them on a B2B motion.

What would change our read: an independent replication on a B2B funnel, or a public release of code and data. The standing search will catch either.

What you learned

Ranked by funnel-stage preferences, the top 0.1 percent of a 6.14M-lead queue held 25.76 percent future buyers, vs 14.41 percent for the best CTR baseline and 3.28 percent for recency.

The headline +39.7 percent is measured against the authors' own ablation. Against the external CTR baseline the lift is +78.8 percent, and 7.9x against recency (arithmetic shown in the article).

A 132-day province-wide A/B test reported a 9.5 percent relative conversion uplift: about 1.3 extra deals per 1,000 leads per month at the paper's 1.33 percent base rate.

AUC 0.8161 does not mean the model is right 81 percent of the time. At this base rate, about 87 percent of even the best top 1 percent queue never buys. Precision at k is the rep-queue metric.

All evidence is consumer auto retail in one Chinese province, on proprietary data, unreviewed. The portable claim is the labeling method, not the numbers.