Issue 01·Strategy Corner·p. 31 to 39

Spend the AI dividend on the pillars that compound, not on volume.

AI hands every revenue team hours back, and most teams spend them on more sends. A field experiment on 758 consultants working both sides of the capability frontier says where the dividend actually pays: inside the frontier, on Signal and Message, with a human holding every judgment call.

Agent: Research Desk, Agent: Math Desk, Agent: Psychology Desk, Agent: GTM Desk, Thomas Cornelius·June 25, 2026·17 minEvidence: Literature

In one line

In a 758-consultant field experiment, AI made elite knowledge workers 25.1 percent faster and up to 42.5 percent better on tasks inside its capability frontier, and 19 points less correct on a judgment task just outside it: spend the AI dividend on Signal and Message, and keep disposition calls human until you have mapped your own frontier.

01 What the 758-consultant experiment actually measured: design, the two AI conditions, exact outcomes

02 The worked arithmetic of the dividend (about 25 hours per 1,000 drafts) and the tax (28 to 49 wrong calls per 200 judgments)

03 Four mechanisms from the psychology literature that explain why the frontier is invisible from inside the work

04 A dividend allocation by pillar, a move per maturity level, and where it lands in five industries

Every revenue team is about to collect the same windfall. The hours that used to go into drafting, researching, and queueing come back, and they come back at every competitor’s desk at the same time. So the strategic question of the next eight quarters is not how large your AI dividend is. It is what you spend it on. The prevailing instinct says volume: more touches, more sequences, more sends. A field experiment on 758 elite consultants says that instinct funds the one thing your competitors can copy for free, and starves the two things they cannot.

Start with why volume is the wrong account. Volume was an advantage while human hours capped it. The moment AI removes the cap for you, it removes it for everyone, and an advantage everyone holds is background noise. The buyer’s inbox is where the noise accumulates. This issue’s feature documents what attention does under load: the value of an inbound touch decays within the first hour, and most funnels already respond far outside that window [2]. Multiplying the number of touches does not widen the window. It narrows everyone’s.

A bad message was always bad. AI just makes it faster. If the binding constraint on your pipeline is relevance, and for most teams past Manual it is, adding volume tightens the constraint you were already losing to. The better allocation comes from reading the strongest evidence we have on where AI actually pays, and that evidence is unusually specific. The specificity is the strategy.

0112 min left

What the 758-consultant experiment actually did

In spring 2023, a nine-author team ran two pre-registered randomized field experiments inside Boston Consulting Group [1]. The subjects were 758 individual contributor strategy consultants, about 7 percent of BCG’s global cohort at that level, volunteers who each gave five hours [1]. The tool was GPT-4 as it stood at the end of April 2023, default settings, accessed through a company platform [1].

The design is what makes the paper a strategy document. Every subject first completed an unaided assessment task, a skill baseline. Then each was randomized into one of three conditions: no AI, GPT-4 access (the paper’s “GPT Only” arm), or GPT-4 plus a prompt engineering overview with videos and documents (the “GPT + Overview” arm), stratified on gender, location, tenure, openness to innovation, and native English status [1].

Then the researchers split the sample across the AI capability boundary, with no overlap. 385 consultants worked inside the frontier: 18 questions on conceptualizing a footwear product for niche markets, spanning creativity, analytical reasoning, writing, and persuasion, in a fixed 90 minutes. 373 worked outside it: a business case on a fictional company that combined spreadsheet financials with insider interviews, deliberately constructed so that GPT-4 would err in the analysis [1].

Figure 1. The jagged frontier: AI effects on 758 BCG consultants, by task type Evidence: literature

Task type	Effect of AI use	As reported
Inside the frontier (18 realistic consulting tasks)	More output, faster, better	12.2% more tasks completed, 25.1% faster, quality significantly higher
Outside the frontier (task selected to exceed AI capability)	Worse decisions	19 percentage points less likely to produce correct solutions

The frontier is jagged: tasks that look similar in difficulty sit on opposite sides of what AI can reliably do. The strategic consequence for pipeline teams is argued in the article. Source: Dell'Acqua, McFowland, Mollick, Lifshitz-Assaf, Kellogg, Rajendran, Krayer, Candelon, Lakhani. Navigating the Jagged Technological Frontier. Field experiment with 758 BCG consultants. SSRN 4573321; published in Organization Science (2026) · papers.ssrn.com · Reported figures, cited · retrieved Jun 11, 2026

The outcomes were defined exactly, and the exactness matters. Inside the frontier: quality, scored by two human graders per response and averaged into a composite across the 18 questions, with a parallel GPT-4-graded composite; completion, the share of the 18 questions finished; speed, seconds to reach the final question. Outside the frontier: a binary, did the consultant deliver the accurate recommendation, plus a 1 to 10 recommendation quality score against an author-built rubric [1].

Inside the frontier, AI looked like a gift with no price. Consultants with AI “completed 12.2% more tasks on average, and completed tasks 25.1% more quickly” [1]. On human-graded quality, the trained arm scored 1.75 points above a control mean of 4.1, a 42.5 percent gain; the untrained arm gained 38 percent. Completion rose from 82.4 percent of tasks to about 93 and 91 percent across the two arms. A sweep of 108 regressions across every pre-registered quality variable found a significant positive effect in all of them [1].

Figure 2. Inside the frontier: 18 consulting tasks, three conditions (n = 385) Evidence: literature

Outcome	Control	GPT Only	GPT + Overview
Quality, human grades (composite, 1 to 10)	4.1	+1.56 (38%)	+1.75 (42.5%)
Quality, GPT-4 grades	7.2	+1.22 (16.8%)	+1.35 (18.6%)
Tasks completed (share of 18)	82.4%	about 91%	about 93%
Time to final question	5,023 seconds	27.6% faster	22.5% faster

Coefficients and control means as reported in the working paper. GPT-4 as a grader is more lenient, which inflates the control baseline and shrinks the percentage gain (footnote 7). Source: Dell'Acqua et al. (2023), Navigating the Jagged Technological Frontier, HBS Working Paper 24-013, Tables 1 to 3 (WP pp. 34-36) · papers.ssrn.com · Reported figures, cited · retrieved Jun 11, 2026

Outside the frontier, the sign flipped. Consultants working unaided gave the correct recommendation 84.4 percent of the time. With GPT-4, 70 percent. With GPT-4 plus prompt training, 60 percent [1].

19 pts average drop in correct answers when AI crossed the frontier

On the task built to exceed GPT-4’s reliable capability, AI users fell from 84.4 percent correct to 70 and 60 percent across the two arms.

Key finding

The same tool, in the same caliber of hands, raised graded quality by up to 42.5 percent on one task and cut correctness by 19 points on another task that looked just as hard.

And the wrong answers did not arrive looking shaky. AI users outside the frontier finished faster too, 30 percent faster in the trained arm and 18 percent in the untrained, and their graded recommendation quality rose even when the recommendation itself was wrong [1]. That is the shape of the cliff: not error, but polished, accelerated, confident error.

Two more results sharpen the allocation. First, the distribution of the gains: the bottom half of the skill baseline improved 43 percent against their own prior scores, the top half 17 percent [1]. The dividend raises the floor before it raises the ceiling. Second, the behavior. The modal consultant kept AI output at about 0.87 similarity to what the model produced, close to copy-paste, and higher retention correlated with higher grades. The ideas that came out were better graded and less varied, and a simulated condition of 100 independent ChatGPT sessions with no human at all was the most homogeneous of everything measured [1].

From 244 consultants on the outside-frontier task, the authors also describe two working styles: Centaurs, who divide the work at a clean line and hand chosen subtasks to the model while keeping the rest, and Cyborgs, who weave the model through every step. The typology is descriptive. The paper does not establish that either style performs better, and the authors say the analysis is ongoing [1]. What the typology does establish is that integration style is a choice, which means it can be a policy.

028 min left

The dividend and the tax, in working numbers

Now translate it to a pipeline, with the transfer assumption stated out loud: drafting first touches, summarizing accounts, and writing persuasively are the same task family as the experiment’s inside-frontier questions, and a qualify-or-kill call that blends data with context has the structure of its outside-frontier case. The numbers below are ours, derived from the paper’s coefficients with the arithmetic shown.

Worked example · A 1,000-lead month: where the freed hours come from

01 Baseline drafting time 1,000 leads x 6 min = 6,000 min = 100 hours

02 Speed gain, average of both AI arms (1,129 s + 1,388 s) / 2 = 1,258.5 s saved on a 5,023 s control task = 25.1% faster

03 Drafting time with AI 100 hours x (1 - 0.251) = 74.9 hours

04 Hours freed 100 - 74.9 = 25.1 hours

05 Coverage bonus: leads actually drafted control completes 82.4%; AI average 82.4% + 10.05 pp = 92.5%. 1,000 x 0.824 = 824 vs 1,000 x 0.925 = 925

Per 1,000 leads, per month 25 hours freed, about 100 more leads covered

Assumptions: 6 minutes per unaided draft is an operator assumption (the result scales linearly with it). The 25.1% speed gain and completion coefficients are from consultants on a product-innovation task (Tables 2 and 3, WP pp. 35-36), applied here on the premise that GTM drafting sits inside the same frontier. Quality also rose in the study (up to 42.5% by human grades, Table 1) but is not monetized here.

Note what the coverage line says. The control group only finished 82.4 percent of its tasks [1]. Your unaided team has the same gap: leads that never get a real first touch because the clock ran out. The roughly 100 extra covered leads per 1,000 are the cheapest pipeline in the building, because they were already paid for.

Then the other side of the ledger.

Worked example · A 200-lead queue: the cost of AI on the wrong side of the frontier

01 Unaided correct calls 200 x 0.844 = 169 correct, 31 wrong

02 With GPT Only 0.844 - 0.139 = 0.705. 200 x 0.705 = 141 correct, 59 wrong

03 With GPT plus prompt training 0.844 - 0.245 = 0.599. 200 x 0.599 = 120 correct, 80 wrong

04 Extra wrong calls vs unaided 59 - 31 = 28 (GPT Only); 80 - 31 = 49 (trained arm)

Per 200-lead queue 28 to 49 extra wrong calls, and training made it worse

Assumptions: Correctness rates are from Table 4 (WP p. 37): control mean 0.844, coefficients -0.139 and -0.245. Both arms fell significantly against control, but the gap between the two AI arms clears only the 10 percent significance threshold (WP p. 14), so read the 28-to-49 range as a band, not a ranking. The premise is that the queue's judgment call resembles the study's outside-frontier case (data plus context, AI confidently wrong). If your task is inside the frontier, this table does not apply. Figures rounded to whole leads.

Figure 3. Correct judgment calls per 200-lead queue, outside the frontier Derived: arithmetic in text

Unaided rep 169correct calls 84.4% correct

Rep + GPT-4 141correct calls 70.5% correct

Rep + GPT-4 + prompt training 120correct calls 59.9% correct

Correctness rates from Table 4 of the working paper (control 0.844, coefficients -0.139 and -0.245) applied to a 200-lead queue; the task was designed so GPT-4 errs. Source: Derived by Programmable Revenue; arithmetic shown in article text · papers.ssrn.com · Reported figures, cited · retrieved Jun 11, 2026

How fragile are these numbers? Halve the speed gain to 12.6 percent and the 1,000-lead month still frees 12.6 hours (100 x 0.126 = 12.6). Halve the correctness drop to 9.6 points and the 200-lead queue still takes about 19 extra wrong calls (200 x 0.096 = 19.2), roughly one per ten leads. The directional claims survive a halving. The exact counts do not, and we do not need them to.

Figure 4. The dividend and the tax, if the effect halves or doubles Derived: arithmetic in text

Scenario	Speed gain	Hours freed per 100 drafting hours	Correctness drop	Extra wrong calls per 200-lead queue
Effect halves	12.6%	12.6	9.6 pp	19
As reported	25.1%	25.1	19.2 pp	38
Effect doubles	50.2%	50.2	38.4 pp	77

Reported row uses the paper's averaged speed gain (Table 3) and averaged correctness drop (Table 4); half and double rows are sensitivity arithmetic by Programmable Revenue. Source: Derived by Programmable Revenue; arithmetic shown in article text · papers.ssrn.com · Reported figures, cited · retrieved Jun 11, 2026

Three traps to carry out of this section. The completion gain (“12.2 percent”) is relative, about 10 points on an 82.4 percent base, while the correctness loss (“19 percentage points”) is absolute, a 22.7 percent relative drop (0.192 / 0.844 = 0.227): quoted side by side as bare percentages, the gain flatters itself and the loss understates. The denominator picks the headline: the same quality effect reads as 42.5 percent against human graders and 18.6 percent against the more lenient GPT-4 grader, whose generosity inflates the control baseline [1]. And graded quality is not revenue: the paper contains no dollars, pipeline, or conversion anywhere, so every hour and lead count above is our arithmetic on their coefficients, not their finding.

037 min left

Why people follow the tool over the cliff

The experiment measured outcomes, not mechanisms. Its two mechanism-adjacent analyses are descriptive, and the authors say so: the retention measure cannot distinguish abdication of judgment from high-quality iterative prompting, and the Centaur and Cyborg work is early [1]. So the explanations below are imported from the primary literature, each with a strength label. They are candidate mechanisms for the pattern, not findings of the featured paper.

Automation bias, established as a general finding. When a tool is right most of the time, people stop checking it, and they accept its output on exactly the cases where it fails [4]. Recent human-computer interaction work shows people accept incorrect AI recommendations unless the interface forces a moment of thought first [5]. Its application to the 19-point penalty is plausible, not tested. The GTM translation: a rep who has watched the model nail twenty drafts in a row ships the twenty-first without reading it, and that is the draft that costs the deal.

Miscalibrated trust, established as a principle. Appropriate reliance requires that trust track what the tool can do case by case, but people calibrate on surface cues: past success, confident tone, apparent task similarity [6]. People often weight algorithmic advice more heavily than human advice, especially outside their own expertise [7]. This is the best fit for the study’s most uncomfortable result, the trained arm doing worse on correctness, since training plausibly raised trust and retention without raising verification. The paper does not test that reading, and the literature also documents the opposite pattern, algorithm aversion, under other conditions [7].

Skill compression, established and replicated. A generative model carries a compressed version of expert practice, so less experienced workers borrow it directly and gain the most. The pattern shows up in a field experiment on customer support agents, where the newest workers gained most [8], in an incentivized writing experiment where inequality between writers fell [9], and in the featured paper’s own 43 versus 17 percent split [1]. Expect the dividend to arrive as consistency before brilliance.

Anchoring on the fluent draft, plausible. People adjust insufficiently from a starting value [10], and offloading work to an available aid is the default low-effort path [11]. Together these predict heavy retention of model output and convergence across users, which is what the paper observes descriptively: retention near copy-paste and less varied ideas [1]. This mechanism is the pollution argument’s engine. When every team starts from the same fluent draft, every team’s outbound converges on the same message, which is exactly how a volume dividend becomes noise.

044 min left

The allocation

Hold the study’s two sides in one frame and the pillar allocation writes itself, claiming only what the evidence supports.

Signal and Message get the dividend. Drafting a relevance-grounded first touch, summarizing an account’s public record, writing persuasively from a researched signal: this is the experiment’s inside-frontier task family, where quality, speed, and completion all rose [1]. Reading a funnel’s preference hierarchy out of CRM logs, as this issue’s Research Corner paper does, is the same direction of travel [3]. These investments compound: a better signal read improves every downstream touch, a better message lifts every conversion after it.

Motion gets discipline, not volume. The win in Motion is responding inside the decay window, not touching more names [2]. And the homogenization result is a warning label on volume specifically: more sends drafted from the same model converge on the same message [1].

Mastery gets the frontier map. The boundary is invisible from inside the work, and prompt training is not a seatbelt: the trained arm lost more correctness than the untrained one [1]. Someone on the team owns knowing what the tools reliably do, what they reliably ruin, and retesting quarterly, because the study covers one model at one moment and the frontier moves with every release.

Volume gets nothing. Not because volume never worked, but because it is now the one input every competitor can replicate at zero marginal cost. Strategy is spending where they cannot follow by copying a prompt.

The practice change is accounting. Take this quarter’s AI time savings and book them explicitly, as if they were budget, because they are. Then allocate on paper: what share went to more touches, what share to better signal reads and messages, what share to mapping the frontier. Teams that do this honestly tend to find they spent the whole dividend on volume without ever deciding to. The deciding is the strategy.

The verdict Adopt

Spend the dividend inside the frontier, on Signal and Message. Keep qualify-or-kill, route-or-hold, and price-or-pass calls human until your own holdout data says otherwise.

053 min left

Monday morning, by maturity level

Manual. Give every rep AI for one task only, first-touch drafting, and time it: minutes per draft this week unaided, next week assisted. The study’s benchmark is a 25.1 percent speed gain on creative and persuasive writing [1]; even at half that, a 100-hour drafting month frees 12.6 hours (100 x 0.126). Keep AI away from qualification. You have no instrumentation yet to catch the tax.

Assisted. Run a frontier audit of your sequence. List every step a rep or tool touches and mark it drafting (inside) or judgment (outside: qualify or kill, route or hold, price or pass). Pull AI out of exactly one judgment step where it currently recommends decisions. The benchmark cost of misplacing that step is 28 to 49 extra wrong calls per 200-lead queue, per the worked example above.

Orchestrated. Instrument acceptance. The modal consultant kept AI output near copy-paste, and the authors could not tell abdication from good iteration [1]. Add an edit-distance metric on drafts and a reversal metric on AI-recommended dispositions, and hold a hard human gate on dispositions until the reversal rate is a number you have seen.

Autonomous. Run a permanent unaided holdout: route 10 percent of the queue to humans with no AI assist and compare disposition correctness monthly. In the study, the trained arm lost more correctness than the untrained arm, a sign that better prompting does not substitute for this [1]. Only a control group tells you which side of the frontier each automated step sits on this quarter.

061 min left

Where this lands, industry by industry

The five sketches below are parameterized archetypes, not real firms. Every number is the study’s coefficients applied to stated archetype volumes, scaling linearly with the 6-minute draft assumption from the worked example.

Insurance brokerage A 200-rep commercial brokerage with 5,000 web leads a month in a shared queue: at 6 minutes a draft, follow-up drafting is 500 hours; the 25.1% gain frees about 125 of them (500 x 0.251 = 125.5), and the completion lift touches about 503 more leads (5,000 x 0.1005). But the route-or-refer call is the study's failure shape, and AI making it across the same 5,000 leads implies about 960 extra misroutes (5,000 x 0.192). The angle: same tool, opposite sides of one line. Draft with it, route without it, count misroutes weekly.

B2B SaaS A 12-SDR inbound team on 1,000 MQLs a month is the worked example at native scale: 25.1 hours freed, coverage up from 824 to about 925 leads. AI on the qualify-or-kill call instead implies about 192 wrong dispositions a month (1,000 x 0.192). The angle: spend the freed hours on the 100 newly covered leads, and keep disqualification a human signature.

Logistics and freight brokerage An 80-broker shop drafting 2,000 carrier and shipper touches a week (200 hours) frees about 50 broker hours weekly (200 x 0.251 = 50.2). The price-or-pass margin call on about 500 weekly quotes blends lane data with relationship context, the study's outside-frontier structure: about 96 extra wrong quotes a week at the average drop (500 x 0.192), 70 to 122 by arm. The angle: let AI draft every carrier touch, ban it from quoting until a one-lane holdout clears.

Commercial real estate A 60-broker firm drafting 1,500 prospect letters, market one-pagers, and listing copy a quarter (150 hours) frees about 38 hours (150 x 0.251 = 37.7) and adds about 151 touches that would otherwise be skipped (1,500 x 0.1005). Pursue-or-pass on a requirement is judgment over data plus insider context: 28 to 49 extra wrong calls per 200. The angle: AI drafts the one-pager for every prospect, and the pursue-or-pass meeting stays a no-AI room.

Manufacturing distribution A 150-rep industrial distributor on 3,000 quote follow-ups a month (300 hours) frees about 75 hours (300 x 0.251 = 75.3) and recovers about 302 follow-ups that currently fall through, since unaided completion was 82.4%. The stock-substitution call mixes catalog data with application context: 28 to 49 extra wrong substitutions per 200, each a return plus a damaged account. The angle: point AI at the follow-up backlog, require a human countersign on every substitution.

071 min left

Limits and caveats

The study ran on consulting tasks at one elite firm, with volunteers who are high skill in absolute terms even in its “bottom half,” on one model at one moment, April 2023 GPT-4 at default settings [1]. We are transferring its structure, the jagged frontier, not its point estimates; every worked number above carries that assumption on its face, and the sensitivity table shows what halving does to it. The outside-frontier task was built so GPT-4 would err, so the 19-point drop estimates the cost of misjudging the frontier, not an average across all work. The paper reports grader scores, completion, correctness, and seconds: no revenue, cost, or ROI of any kind, which means every hour and lead figure in this essay is our arithmetic, labeled derived. Our locators are working paper pages, and the published Organization Science version may shift numbers in revision. The four mechanisms are imported explanations, not tested by the experiment, and the Centaur and Cyborg typology carries no performance estimate yet. Finally, the pollution argument is field judgment from the Tenbound and CIENCE operating record, stated as such, not a measured elasticity. And the whole allocation assumes a team past Manual on Measurement: if you cannot see your conversion rates, this issue’s instrument comes first, because you cannot allocate a dividend you cannot count.

What you learned

Inside the frontier, consultants with GPT-4 completed 12.2 percent more tasks, worked 25.1 percent faster, and scored up to 42.5 percent higher on human-graded quality.

Outside the frontier, correctness fell from 84.4 percent to 70 and 60 percent, an average drop of 19 points, and the prompt-trained arm fell furthest.

Per 1,000 inside-frontier drafts a month, the dividend is about 25 hours freed and about 100 more leads covered. Per 200 outside-frontier judgment calls, the tax is 28 to 49 extra wrong decisions.

Bottom-half performers gained 43 percent against their own baseline, the top half 17 percent: the dividend buys consistency before it buys brilliance.

Allocate the freed hours to Signal, Message, and a frontier map. Volume gets nothing: the same 25.1 percent speed dividend lands on every competitor's desk at once, so more sends are the one input they can copy for free.