Measure AI Against What People Would Do Instead

The right baseline for diagnostic AI is the decisions people would have made without it, not physicians. Most published research ignores that counterfactual, so we don't know whether AI actually improves care or merely matches clinical benchmarks in lab conditions.

The Daily Letter Desk

Written with LLMs · Edited by humans

Apr 19·3 sources

esearch asks whether AI can match doctors. The urgent question is whether AI changes patients' choices and outcomes compared with the information they'd have had without it — and the literature rarely measures that.

What happened

Ethan Mollick flagged a neglected problem in AI-and-diagnosis research: evidence is uneven. Some studies suggest models can give strong diagnostic answers, but “We only have spotty information about this very important topic.” He highlighted a mismatch between experiments and reality: controlled comparisons focus on clinician-level accuracy using older models, not on how modern systems change what laypeople would have done. As Mollick put it, “Most of the published research uses old models & compares to doctors. How do new models compare to the info people would have gotten without AI?” He warned against declaring complex judgment forever off-limits to AI and argued the field should keep testing rather than setting limits a priori.

“I am not convinced that we should be comfortable calling "problem solving" or "judgement" or whatever as skills that are impossible for AI to do well.”
— x.com

Why it matters

Policy, procurement, and clinical workflows hinge on the counterfactual. If a chatbot gives a slightly better differential than an internet search but patients ignore it, call fewer doctors, or misunderstand advice, accuracy gains mean nothing or could cause harm. Research framed as “AI vs doctor” answers the rhetorical question of parity, not the operational question of impact. Vendors and journals favor clinician-comparison studies because they are easier to run and more flattering. That bias produces precise lab gains that often fail to translate into better triage, fewer missed diagnoses, or safer care pathways. Regulators and researchers must demand randomized, real-world comparisons against the real alternatives people would use — web search, symptom checkers, waiting to see, or calling a nurse line. Until studies measure those counterfactuals, claims that AI improves diagnosis are provisional at best and misleading at worst.

“We only have spotty information about this very important topic.”
— x.com

Counterpoint

Mollick also cautions against declaring judgment and problem solving forever beyond AI’s reach: “I am not convinced that we should be comfortable calling "problem solving" or "judgement" or whatever as skills that are impossible for AI to do well.” He notes existing evidence sometimes supports AI’s diagnostic strength. Those points matter: keep testing new models, but use the right baselines.

What to watch

Who is the real comparator in field trials — search engines, nurse lines, or no-intervention? Do modern models change patient behavior or only lab metrics? Which outcomes (triage accuracy, time-to-care, downstream harms) should regulators require in deployed studies?

● End of story

Want tomorrow's letter in your inbox?

One edition per day. Seven stories. Zero LinkedIn energy.