Spend the AI dividend on the pillars that compound, not on volume.
AI hands every revenue team hours back, and most teams spend them on more sends. A field experiment on 758 consultants working both sides of the capability frontier says where the dividend actually pays: inside the frontier, on Signal and Message, with a human holding every judgment call.
In a 758-consultant field experiment, AI made elite knowledge workers 25.1 percent faster and up to 42.5 percent better on tasks inside its capability frontier, and 19 points less correct on a judgment task just outside it: spend the AI dividend on Signal and Message, and keep disposition calls human until you have mapped your own frontier.
Every revenue team is about to collect the same windfall. The hours that used to go into drafting, researching, and queueing come back, and they come back at every competitor’s desk at the same time. So the strategic question of the next eight quarters is not how large your AI dividend is. It is what you spend it on. The prevailing instinct says volume: more touches, more sequences, more sends. A field experiment on 758 elite consultants says that instinct funds the one thing your competitors can copy for free, and starves the two things they cannot.
Start with why volume is the wrong account. Volume was an advantage while human hours capped it. The moment AI removes the cap for you, it removes it for everyone, and an advantage everyone holds is background noise. The buyer’s inbox is where the noise accumulates. This issue’s feature documents what attention does under load: the value of an inbound touch decays within the first hour, and most funnels already respond far outside that window [2]. Multiplying the number of touches does not widen the window. It narrows everyone’s.
A bad message was always bad. AI just makes it faster. If the binding constraint on your pipeline is relevance, and for most teams past Manual it is, adding volume tightens the constraint you were already losing to. The better allocation comes from reading the strongest evidence we have on where AI actually pays, and that evidence is unusually specific. The specificity is the strategy.
What the 758-consultant experiment actually did
In spring 2023, a nine-author team ran two pre-registered randomized field experiments inside Boston Consulting Group [1]. The subjects were 758 individual contributor strategy consultants, about 7 percent of BCG’s global cohort at that level, volunteers who each gave five hours [1]. The tool was GPT-4 as it stood at the end of April 2023, default settings, accessed through a company platform [1].
The design is what makes the paper a strategy document. Every subject first completed an unaided assessment task, a skill baseline. Then each was randomized into one of three conditions: no AI, GPT-4 access (the paper’s “GPT Only” arm), or GPT-4 plus a prompt engineering overview with videos and documents (the “GPT + Overview” arm), stratified on gender, location, tenure, openness to innovation, and native English status [1].
Then the researchers split the sample across the AI capability boundary, with no overlap. 385 consultants worked inside the frontier: 18 questions on conceptualizing a footwear product for niche markets, spanning creativity, analytical reasoning, writing, and persuasion, in a fixed 90 minutes. 373 worked outside it: a business case on a fictional company that combined spreadsheet financials with insider interviews, deliberately constructed so that GPT-4 would err in the analysis [1].
| Task type | Effect of AI use | As reported |
|---|---|---|
| Inside the frontier (18 realistic consulting tasks) | More output, faster, better | 12.2% more tasks completed, 25.1% faster, quality significantly higher |
| Outside the frontier (task selected to exceed AI capability) | Worse decisions | 19 percentage points less likely to produce correct solutions |
The outcomes were defined exactly, and the exactness matters. Inside the frontier: quality, scored by two human graders per response and averaged into a composite across the 18 questions, with a parallel GPT-4-graded composite; completion, the share of the 18 questions finished; speed, seconds to reach the final question. Outside the frontier: a binary, did the consultant deliver the accurate recommendation, plus a 1 to 10 recommendation quality score against an author-built rubric [1].
Inside the frontier, AI looked like a gift with no price. Consultants with AI “completed 12.2% more tasks on average, and completed tasks 25.1% more quickly” [1]. On human-graded quality, the trained arm scored 1.75 points above a control mean of 4.1, a 42.5 percent gain; the untrained arm gained 38 percent. Completion rose from 82.4 percent of tasks to about 93 and 91 percent across the two arms. A sweep of 108 regressions across every pre-registered quality variable found a significant positive effect in all of them [1].
| Outcome | Control | GPT Only | GPT + Overview |
|---|---|---|---|
| Quality, human grades (composite, 1 to 10) | 4.1 | +1.56 (38%) | +1.75 (42.5%) |
| Quality, GPT-4 grades | 7.2 | +1.22 (16.8%) | +1.35 (18.6%) |
| Tasks completed (share of 18) | 82.4% | about 91% | about 93% |
| Time to final question | 5,023 seconds | 27.6% faster | 22.5% faster |
Outside the frontier, the sign flipped. Consultants working unaided gave the correct recommendation 84.4 percent of the time. With GPT-4, 70 percent. With GPT-4 plus prompt training, 60 percent [1].
On the task built to exceed GPT-4’s reliable capability, AI users fell from 84.4 percent correct to 70 and 60 percent across the two arms.
And the wrong answers did not arrive looking shaky. AI users outside the frontier finished faster too, 30 percent faster in the trained arm and 18 percent in the untrained, and their graded recommendation quality rose even when the recommendation itself was wrong [1]. That is the shape of the cliff: not error, but polished, accelerated, confident error.
Two more results sharpen the allocation. First, the distribution of the gains: the bottom half of the skill baseline improved 43 percent against their own prior scores, the top half 17 percent [1]. The dividend raises the floor before it raises the ceiling. Second, the behavior. The modal consultant kept AI output at about 0.87 similarity to what the model produced, close to copy-paste, and higher retention correlated with higher grades. The ideas that came out were better graded and less varied, and a simulated condition of 100 independent ChatGPT sessions with no human at all was the most homogeneous of everything measured [1].
From 244 consultants on the outside-frontier task, the authors also describe two working styles: Centaurs, who divide the work at a clean line and hand chosen subtasks to the model while keeping the rest, and Cyborgs, who weave the model through every step. The typology is descriptive. The paper does not establish that either style performs better, and the authors say the analysis is ongoing [1]. What the typology does establish is that integration style is a choice, which means it can be a policy.
The dividend and the tax, in working numbers
Now translate it to a pipeline, with the transfer assumption stated out loud: drafting first touches, summarizing accounts, and writing persuasively are the same task family as the experiment’s inside-frontier questions, and a qualify-or-kill call that blends data with context has the structure of its outside-frontier case. The numbers below are ours, derived from the paper’s coefficients with the arithmetic shown.
Note what the coverage line says. The control group only finished 82.4 percent of its tasks [1]. Your unaided team has the same gap: leads that never get a real first touch because the clock ran out. The roughly 100 extra covered leads per 1,000 are the cheapest pipeline in the building, because they were already paid for.
Then the other side of the ledger.
How fragile are these numbers? Halve the speed gain to 12.6 percent and the 1,000-lead month still frees 12.6 hours (100 x 0.126 = 12.6). Halve the correctness drop to 9.6 points and the 200-lead queue still takes about 19 extra wrong calls (200 x 0.096 = 19.2), roughly one per ten leads. The directional claims survive a halving. The exact counts do not, and we do not need them to.
| Scenario | Speed gain | Hours freed per 100 drafting hours | Correctness drop | Extra wrong calls per 200-lead queue |
|---|---|---|---|---|
| Effect halves | 12.6% | 12.6 | 9.6 pp | 19 |
| As reported | 25.1% | 25.1 | 19.2 pp | 38 |
| Effect doubles | 50.2% | 50.2 | 38.4 pp | 77 |
Three traps to carry out of this section. The completion gain (“12.2 percent”) is relative, about 10 points on an 82.4 percent base, while the correctness loss (“19 percentage points”) is absolute, a 22.7 percent relative drop (0.192 / 0.844 = 0.227): quoted side by side as bare percentages, the gain flatters itself and the loss understates. The denominator picks the headline: the same quality effect reads as 42.5 percent against human graders and 18.6 percent against the more lenient GPT-4 grader, whose generosity inflates the control baseline [1]. And graded quality is not revenue: the paper contains no dollars, pipeline, or conversion anywhere, so every hour and lead count above is our arithmetic on their coefficients, not their finding.
Why people follow the tool over the cliff
The experiment measured outcomes, not mechanisms. Its two mechanism-adjacent analyses are descriptive, and the authors say so: the retention measure cannot distinguish abdication of judgment from high-quality iterative prompting, and the Centaur and Cyborg work is early [1]. So the explanations below are imported from the primary literature, each with a strength label. They are candidate mechanisms for the pattern, not findings of the featured paper.
Automation bias, established as a general finding. When a tool is right most of the time, people stop checking it, and they accept its output on exactly the cases where it fails [4]. Recent human-computer interaction work shows people accept incorrect AI recommendations unless the interface forces a moment of thought first [5]. Its application to the 19-point penalty is plausible, not tested. The GTM translation: a rep who has watched the model nail twenty drafts in a row ships the twenty-first without reading it, and that is the draft that costs the deal.
Miscalibrated trust, established as a principle. Appropriate reliance requires that trust track what the tool can do case by case, but people calibrate on surface cues: past success, confident tone, apparent task similarity [6]. People often weight algorithmic advice more heavily than human advice, especially outside their own expertise [7]. This is the best fit for the study’s most uncomfortable result, the trained arm doing worse on correctness, since training plausibly raised trust and retention without raising verification. The paper does not test that reading, and the literature also documents the opposite pattern, algorithm aversion, under other conditions [7].
Skill compression, established and replicated. A generative model carries a compressed version of expert practice, so less experienced workers borrow it directly and gain the most. The pattern shows up in a field experiment on customer support agents, where the newest workers gained most [8], in an incentivized writing experiment where inequality between writers fell [9], and in the featured paper’s own 43 versus 17 percent split [1]. Expect the dividend to arrive as consistency before brilliance.
Anchoring on the fluent draft, plausible. People adjust insufficiently from a starting value [10], and offloading work to an available aid is the default low-effort path [11]. Together these predict heavy retention of model output and convergence across users, which is what the paper observes descriptively: retention near copy-paste and less varied ideas [1]. This mechanism is the pollution argument’s engine. When every team starts from the same fluent draft, every team’s outbound converges on the same message, which is exactly how a volume dividend becomes noise.
The allocation
Hold the study’s two sides in one frame and the pillar allocation writes itself, claiming only what the evidence supports.
Signal and Message get the dividend. Drafting a relevance-grounded first touch, summarizing an account’s public record, writing persuasively from a researched signal: this is the experiment’s inside-frontier task family, where quality, speed, and completion all rose [1]. Reading a funnel’s preference hierarchy out of CRM logs, as this issue’s Research Corner paper does, is the same direction of travel [3]. These investments compound: a better signal read improves every downstream touch, a better message lifts every conversion after it.
Motion gets discipline, not volume. The win in Motion is responding inside the decay window, not touching more names [2]. And the homogenization result is a warning label on volume specifically: more sends drafted from the same model converge on the same message [1].
Mastery gets the frontier map. The boundary is invisible from inside the work, and prompt training is not a seatbelt: the trained arm lost more correctness than the untrained one [1]. Someone on the team owns knowing what the tools reliably do, what they reliably ruin, and retesting quarterly, because the study covers one model at one moment and the frontier moves with every release.
Volume gets nothing. Not because volume never worked, but because it is now the one input every competitor can replicate at zero marginal cost. Strategy is spending where they cannot follow by copying a prompt.
The practice change is accounting. Take this quarter’s AI time savings and book them explicitly, as if they were budget, because they are. Then allocate on paper: what share went to more touches, what share to better signal reads and messages, what share to mapping the frontier. Teams that do this honestly tend to find they spent the whole dividend on volume without ever deciding to. The deciding is the strategy.
Monday morning, by maturity level
Manual. Give every rep AI for one task only, first-touch drafting, and time it: minutes per draft this week unaided, next week assisted. The study’s benchmark is a 25.1 percent speed gain on creative and persuasive writing [1]; even at half that, a 100-hour drafting month frees 12.6 hours (100 x 0.126). Keep AI away from qualification. You have no instrumentation yet to catch the tax.
Assisted. Run a frontier audit of your sequence. List every step a rep or tool touches and mark it drafting (inside) or judgment (outside: qualify or kill, route or hold, price or pass). Pull AI out of exactly one judgment step where it currently recommends decisions. The benchmark cost of misplacing that step is 28 to 49 extra wrong calls per 200-lead queue, per the worked example above.
Orchestrated. Instrument acceptance. The modal consultant kept AI output near copy-paste, and the authors could not tell abdication from good iteration [1]. Add an edit-distance metric on drafts and a reversal metric on AI-recommended dispositions, and hold a hard human gate on dispositions until the reversal rate is a number you have seen.
Autonomous. Run a permanent unaided holdout: route 10 percent of the queue to humans with no AI assist and compare disposition correctness monthly. In the study, the trained arm lost more correctness than the untrained arm, a sign that better prompting does not substitute for this [1]. Only a control group tells you which side of the frontier each automated step sits on this quarter.
Where this lands, industry by industry
The five sketches below are parameterized archetypes, not real firms. Every number is the study’s coefficients applied to stated archetype volumes, scaling linearly with the 6-minute draft assumption from the worked example.
Limits and caveats
The study ran on consulting tasks at one elite firm, with volunteers who are high skill in absolute terms even in its “bottom half,” on one model at one moment, April 2023 GPT-4 at default settings [1]. We are transferring its structure, the jagged frontier, not its point estimates; every worked number above carries that assumption on its face, and the sensitivity table shows what halving does to it. The outside-frontier task was built so GPT-4 would err, so the 19-point drop estimates the cost of misjudging the frontier, not an average across all work. The paper reports grader scores, completion, correctness, and seconds: no revenue, cost, or ROI of any kind, which means every hour and lead figure in this essay is our arithmetic, labeled derived. Our locators are working paper pages, and the published Organization Science version may shift numbers in revision. The four mechanisms are imported explanations, not tested by the experiment, and the Centaur and Cyborg typology carries no performance estimate yet. Finally, the pollution argument is field judgment from the Tenbound and CIENCE operating record, stated as such, not a measured elasticity. And the whole allocation assumes a team past Manual on Measurement: if you cannot see your conversion rates, this issue’s instrument comes first, because you cannot allocate a dividend you cannot count.