What the directories are signalling: the research community is industrializing the SDR stack.

Every Sunday night the lab runs the same queries against the same five research directories: arXiv, OpenAlex, Semantic Scholar, the Hugging Face daily papers feed, and the NBER new-papers stream [5]. The queries never change. That is the point. A fixed instrument read on a schedule turns “what is everyone talking about” into a measurement. This week the instrument returned its first reading, and the reading has a shape. The research community is not asking whether AI can help sales. It is writing the parts list for an industrialized SDR stack, one subsystem at a time.

45 unique GTM-relevant papers, one 90-day window, five directories

The count after deduplication across directories. The census is the lab’s own measurement of where serious research effort is going, not a popularity ranking.

015 min left

The pull

The directory census is below. The more interesting census is by topic.

Figure 1. Candidates returned by the standing weekly query set, by directory Evidence: field

OpenAlex 18 papers

arXiv 12 papers

NBER RSS 8 papers

HF daily papers 6 papers engagement-filtered

Semantic Scholar 1 papers rate-limited this run

45 unique candidates after dedupe. The pull is the lab's own measurement of where GTM-relevant research is appearing; it is a sourcing census, not a popularity ranking. Source: Programmable Revenue newsroom candidates pull, week 2026-W24 (scripts/newsroom/fetch-candidates.mjs), standing query set over a 90-day window, deduped by identifier · Tenbound original measurement · retrieved Jun 11, 2026

Forty-five papers survived dedupe [5]. Read the titles and four clusters dominate. Each cluster is a subsystem of the same machine, and each one earns a one-sentence claim and a one-sentence move.

Lead scoring is being rebuilt on LLMs. The strongest candidate of the week proposes hierarchical preference ranking over CRM logs for long-cycle B2B funnels [1]. So what: if your funnel is long-cycle and your CRM logs intermediate events, this research line trains on your exact data shape. The move: audit your current score’s labels now, before the pitches built on this work arrive.
Selling skill is being benchmarked. A bilingual benchmark of realistic multi-turn selling, with a trained customer simulator and outcome-level scoring, tests whether models can actually progress a deal [2]. Role-inversion rate, in the paper’s definition, is the percentage of generated turns in which a model drops its assigned role and adopts the salesperson persona, pitching products instead of playing its part [2]; outcome scoring is its counterpart, grading the conversation on deal progression rather than fluency. So what: the due-diligence vocabulary for AI SDR vendors is now public. The move: open every demo with role-inversion rate and outcome scoring; the question sorts vendors in one meeting. Nobody builds a benchmark for a capability nobody intends to deploy.
The AI seller is shipping in fast-cycle markets first. A socially intelligent virtual host for live commerce [3] is an AI seller with a target, deployed where deals close in minutes. So what: live commerce is the proving ground, and what survives a minutes-long cycle gets repackaged for longer cycles next. The move: watch the adjacent market. It is the cheapest forward indicator a B2B operator can hold.
Attribution is becoming a modeling target of its own. Treatment-gated uplift modeling for B2B sales [4] asks not “who will convert” but “who converts because we acted.” So what: that is the budget owner’s question, and it is now a modeling target. The move: start logging treatment events now. Any team reporting raw conversion against no counterfactual should expect to argue against this framing within a budget cycle.

Figure 2. The four clusters in the week 2026-W24 pull, and the operator question each one earns Evidence: field

Cluster	What the paper attempts	The operator question
Lead scoring	Hierarchical preference ranking over CRM logs for long-cycle B2B funnels	What labels does your score train on?
Selling skill	Outcome-scored benchmark of realistic multi-turn selling with a trained customer simulator	Show me role-inversion rate and outcome scoring
The AI seller	A socially intelligent virtual host selling in live commerce, where deals close in minutes	What survived the minutes-long cycle?
Attribution	Treatment-gated uplift modeling: who converts because we acted, not who will convert	What is your counterfactual?

A qualitative census of the four dominant clusters. No performance numbers appear because the census reports research effort, not results. Source: Programmable Revenue newsroom candidates pull, week 2026-W24 (scripts/newsroom/fetch-candidates.mjs); cluster characterizations describe what each paper attempts, as read in the article text, not measured results · Tenbound original measurement · retrieved Jun 11, 2026

Key finding

Four subsystems of an SDR stack (scoring, skill evaluation, the agent itself, attribution) got rigorous public treatment inside one 90-day window, and none of it trended on social.

023 min left

Why the directories beat the feed

None of these papers trended this week [5]. That is not a flaw in the reading. It is the reason the reading exists, and two documented mechanisms from the psychology literature explain why feed-driven scanning misleads an operator.

Availability (established). People judge how frequent or important something is by how easily examples come to mind [6]. A feed is an availability machine: it makes the viral effortless to recall, so the operator who reads only feeds systematically overweights what is entertaining and underweights what is merely important. The papers above are merely important.

Social influence dynamics (established). When popularity signals are visible, popularity feeds on itself, and which items win becomes far less predictable from underlying quality [7]. A ranked feed therefore measures the social cascade, not the supply of serious work underneath it. A fixed query set read on a schedule measures the supply.

To be plain about the evidence: this census tests neither mechanism. It is a sourcing instrument, not an experiment. The two mechanisms are the reason to trust a fixed instrument over a ranked feed, and both are established in the primary literature.

032 min left

What holding the search costs

The practice change is modest and concrete. Someone on your team holds the standing search. Nobody reads 45 papers. The arithmetic of the load:

Worked example · The weekly reading load, derived from the pull

01 Weeks in the 90-day window 90 / 7 = 12.9 weeks

02 Surviving papers per week (derived) 45 / 12.9 = 3.5 papers

Weekly load for the search holder about 3.5 surviving papers, roughly 3 worth a note

Assumptions: Window length and dedupe count as reported in the week 2026-W24 pull. The cadence of surfacing the three that matter is this section's editorial standard, not a measured rate.

Figure 3. The weekly reading load, derived from the week 2026-W24 pull Derived: arithmetic in text

Quantity	Value	How
Unique papers, 90-day window	45	Reported in the W24 pull, after dedupe
Weeks in the window	12.9	90 / 7 (derived)
Surviving papers per week	3.5	45 / 12.9 (derived)

The cost of holding the standing search: about 3.5 surviving papers a week, not 45. The cadence of surfacing the few that matter is the section's editorial standard, not a measured rate. Source: Derived by Programmable Revenue; arithmetic shown in article text · Tenbound original measurement · retrieved Jun 11, 2026

Three and a half papers a week is one coffee, and the return is timing: the operator who holds the search reads the vocabulary of next year’s vendor pitches before the vendors do the pitching. If a vendor demos an “AI SDR” next quarter, the operator who has read the benchmark paper [2] knows the first question to ask: show me your role-inversion rate and your outcome scoring, not your demo script.

041 min left

Monday morning, by maturity level

Manual. Assign the standing search. One person in RevOps holds the fixed query set weekly and sends a one-paragraph note to the GTM leads each Monday.
Assisted. Wire the fixed queries into directory alerts so the holder curates instead of searches. The deliverable stays the same Monday note.
Orchestrated. Route the weekly note into your vendor-evaluation checklist, and open every AI SDR demo with role-inversion rate and outcome scoring [2].
Autonomous. If agents already run parts of your outreach, score them on the outcome-level criteria the benchmark cluster describes [2], and export clean CRM stage-event history now, so you are pilot-ready when the scoring research ships as product [1].

051 min left

Where this lands

B2B SaaS A 400-person SaaS company with 8 SDRs, an unaudited CRM lead score, and at least one AI SDR pitch expected next quarter: three of the four clusters point here. The scoring research targets its data shape, the benchmark arms its next vendor evaluation, and uplift attribution reframes how marketing-sourced pipeline gets defended. First move: one person in RevOps holds the standing search, about 3.5 surviving papers a week, and writes the Monday note. Vendor screen: open the demo with role-inversion rate and outcome scoring, not the demo script. Angle: your next vendor pitch is being written in the directories now; we publish the three papers that matter each week.

Outsourced SDR agencies A 300-seat provider whose clients are asking whether the seats should be agents: the benchmark cluster is the direct hit, and agencies sit on both sides of it. They will be asked the benchmark question and should ask it of every tool they resell. First move: run the agency's own scripts and every resold AI tool against outcome-level criteria before a client asks, and put the answer in the sales deck. Vendor screen, applied inward: an agency that cannot state its own role-inversion rate loses the evaluation it could have won. Angle: the agencies that can answer 'show me your role-inversion rate' before being asked win those evaluations.

E-commerce and live retail A consumer brand running 20 live shopping events a month with human hosts, evaluating AI host pilots: this market sits at the front of the deployment curve, so the census is present tense here, not a forward indicator. First move: pilot with a benchmark mindset, outcome-level scoring per event (deal progression, not engagement vanity), with matched human-hosted events as the comparison. Vendor screen: ask the host vendor how it scores outcomes, and reject engagement-only reporting. Angle: the papers and the evaluation method are both public at once, which is rare and exploitable; we track what is moving from lab to product in your category.

Commercial insurance brokerage A 150-rep commercial-lines brokerage with multi-month placement cycles and a CRM full of stage events (submission, quote, proposal, bind): the scoring cluster's stated target is exactly this data shape, and the uplift question is the principal's question about every marketing dollar. First move: inventory the stage events the CRM already logs and export clean stage-event history, so the firm is pilot-ready before the demos start. Vendor screen: ask which funnel-stage labels the model was trained on, and compare them to the firm's own event inventory. Angle: the papers are public about a year before the products, on our read of the directories; we brief your team on what they actually claim.

061 min left

The honest caveats

A sourcing census is not a popularity ranking, and it is not a results paper. The four cluster claims describe what each paper attempts, not what it achieves, and no performance numbers print here. All four cluster papers are arXiv preprints, and none has passed peer review, which is exactly why this census reports what they attempt rather than what they achieve. Our query set has a recall bias toward the terms we chose, the 90-day window favors arXiv’s posting cadence, and one directory was rate-limited this run, stated in the figure [5]. “A year before it ships as product” is the lab’s editorial claim about directories versus feeds, not a measured lag. The week a reading is thin, this section will say so or sit out. That is the standing deal.