What the directories are signalling: the research community is industrializing the SDR stack.
The lab's first standing weekly pull of the academic directories returned 45 GTM-relevant papers from the last 90 days. The pattern in them is not subtle.
A fixed weekly query of five research directories returned 45 GTM-relevant papers in one 90-day window, and together they specify an industrialized SDR stack: hold the standing search, because vendor pitches tend to arrive about a year behind the papers.
Every Sunday night the lab runs the same queries against the same five research directories: arXiv, OpenAlex, Semantic Scholar, the Hugging Face daily papers feed, and the NBER new-papers stream [5]. The queries never change. That is the point. A fixed instrument read on a schedule turns “what is everyone talking about” into a measurement. This week the instrument returned its first reading, and the reading has a shape. The research community is not asking whether AI can help sales. It is writing the parts list for an industrialized SDR stack, one subsystem at a time.
The count after deduplication across directories. The census is the lab’s own measurement of where serious research effort is going, not a popularity ranking.
The pull
The directory census is below. The more interesting census is by topic.
Forty-five papers survived dedupe [5]. Read the titles and four clusters dominate. Each cluster is a subsystem of the same machine, and each one earns a one-sentence claim and a one-sentence move.
- Lead scoring is being rebuilt on LLMs. The strongest candidate of the week proposes hierarchical preference ranking over CRM logs for long-cycle B2B funnels [1]. So what: if your funnel is long-cycle and your CRM logs intermediate events, this research line trains on your exact data shape. The move: audit your current score’s labels now, before the pitches built on this work arrive.
- Selling skill is being benchmarked. A bilingual benchmark of realistic multi-turn selling, with a trained customer simulator and outcome-level scoring, tests whether models can actually progress a deal [2]. Role-inversion rate, in the paper’s definition, is the percentage of generated turns in which a model drops its assigned role and adopts the salesperson persona, pitching products instead of playing its part [2]; outcome scoring is its counterpart, grading the conversation on deal progression rather than fluency. So what: the due-diligence vocabulary for AI SDR vendors is now public. The move: open every demo with role-inversion rate and outcome scoring; the question sorts vendors in one meeting. Nobody builds a benchmark for a capability nobody intends to deploy.
- The AI seller is shipping in fast-cycle markets first. A socially intelligent virtual host for live commerce [3] is an AI seller with a target, deployed where deals close in minutes. So what: live commerce is the proving ground, and what survives a minutes-long cycle gets repackaged for longer cycles next. The move: watch the adjacent market. It is the cheapest forward indicator a B2B operator can hold.
- Attribution is becoming a modeling target of its own. Treatment-gated uplift modeling for B2B sales [4] asks not “who will convert” but “who converts because we acted.” So what: that is the budget owner’s question, and it is now a modeling target. The move: start logging treatment events now. Any team reporting raw conversion against no counterfactual should expect to argue against this framing within a budget cycle.
| Cluster | What the paper attempts | The operator question |
|---|---|---|
| Lead scoring | Hierarchical preference ranking over CRM logs for long-cycle B2B funnels | What labels does your score train on? |
| Selling skill | Outcome-scored benchmark of realistic multi-turn selling with a trained customer simulator | Show me role-inversion rate and outcome scoring |
| The AI seller | A socially intelligent virtual host selling in live commerce, where deals close in minutes | What survived the minutes-long cycle? |
| Attribution | Treatment-gated uplift modeling: who converts because we acted, not who will convert | What is your counterfactual? |
Why the directories beat the feed
None of these papers trended this week [5]. That is not a flaw in the reading. It is the reason the reading exists, and two documented mechanisms from the psychology literature explain why feed-driven scanning misleads an operator.
Availability (established). People judge how frequent or important something is by how easily examples come to mind [6]. A feed is an availability machine: it makes the viral effortless to recall, so the operator who reads only feeds systematically overweights what is entertaining and underweights what is merely important. The papers above are merely important.
Social influence dynamics (established). When popularity signals are visible, popularity feeds on itself, and which items win becomes far less predictable from underlying quality [7]. A ranked feed therefore measures the social cascade, not the supply of serious work underneath it. A fixed query set read on a schedule measures the supply.
To be plain about the evidence: this census tests neither mechanism. It is a sourcing instrument, not an experiment. The two mechanisms are the reason to trust a fixed instrument over a ranked feed, and both are established in the primary literature.
What holding the search costs
The practice change is modest and concrete. Someone on your team holds the standing search. Nobody reads 45 papers. The arithmetic of the load:
| Quantity | Value | How |
|---|---|---|
| Unique papers, 90-day window | 45 | Reported in the W24 pull, after dedupe |
| Weeks in the window | 12.9 | 90 / 7 (derived) |
| Surviving papers per week | 3.5 | 45 / 12.9 (derived) |
Three and a half papers a week is one coffee, and the return is timing: the operator who holds the search reads the vocabulary of next year’s vendor pitches before the vendors do the pitching. If a vendor demos an “AI SDR” next quarter, the operator who has read the benchmark paper [2] knows the first question to ask: show me your role-inversion rate and your outcome scoring, not your demo script.
Monday morning, by maturity level
- Manual. Assign the standing search. One person in RevOps holds the fixed query set weekly and sends a one-paragraph note to the GTM leads each Monday.
- Assisted. Wire the fixed queries into directory alerts so the holder curates instead of searches. The deliverable stays the same Monday note.
- Orchestrated. Route the weekly note into your vendor-evaluation checklist, and open every AI SDR demo with role-inversion rate and outcome scoring [2].
- Autonomous. If agents already run parts of your outreach, score them on the outcome-level criteria the benchmark cluster describes [2], and export clean CRM stage-event history now, so you are pilot-ready when the scoring research ships as product [1].
Where this lands
The honest caveats
A sourcing census is not a popularity ranking, and it is not a results paper. The four cluster claims describe what each paper attempts, not what it achieves, and no performance numbers print here. All four cluster papers are arXiv preprints, and none has passed peer review, which is exactly why this census reports what they attempt rather than what they achieve. Our query set has a recall bias toward the terms we chose, the 90-day window favors arXiv’s posting cadence, and one directory was rate-limited this run, stated in the figure [5]. “A year before it ships as product” is the lab’s editorial claim about directories versus feeds, not a measured lag. The week a reading is thin, this section will say so or sit out. That is the standing deal.