The Most Uncomfortable Study in AI
In December 2025, METR (Model Evaluation and Threat Research) published what may be the most methodologically rigorous study ever conducted on AI coding tool productivity. A randomized controlled trial with experienced open-source developers working on real tasks in repositories they maintained.
The result: developers using AI tools were 19% slower than those working without them.
The perception gap is extraordinary. Before the study, participants predicted a 24% speedup. After using the tools, they believed they had been 20% faster. The actual measurement showed a 19% slowdown. That is a roughly 40-percentage-point gap between perceived and actual performance.
The Perception Bias Problem
This finding is consistent across the broader literature. An Uplevel study at Microsoft found no statistically significant improvement in pull request cycle time despite developers reporting feeling more productive. GitClear’s analysis found code churn increased as AI tool adoption grew.
The mechanism is well-understood in psychology. When a tool generates code quickly, the moment of generation feels productive. The subsequent time debugging and integrating doesn’t register as “AI tool overhead” — it registers as normal development work.
Any AI tool ROI measurement based on developer surveys is systematically and significantly unreliable, with a consistent upward bias of 20-40 points.
When Throughput Metrics Mislead
Faros AI’s 2025 study of 10,000+ developers showed headline numbers that looked great: +21% tasks completed, +98% PRs merged. But code review time increased by 91%.
This is Amdahl’s Law applied to software development. AI tools accelerated code writing (one station on the assembly line), but the pile-up at code review absorbed the gains. The 98% increase in merged PRs may simply reflect the same work split into more granular units.
The Macro-Level Evidence Gap
An NBER working paper surveying nearly 6,000 CEOs found minimal measurable impact on firm-level productivity. Daron Acemoglu at MIT estimated AI would add only 0.53-0.66 percentage points to GDP growth over the next decade.
The Historical Parallel
When electric motors replaced steam engines in factories circa 1900, initial productivity gains were minimal — sometimes negative. Factories simply replaced the central steam engine with a central electric motor. It took 20-30 years for manufacturers to realize electricity enabled fundamentally different factory designs.
BCG’s 2025 research confirmed this for AI: organizations that “layer AI on existing workflows” see the smallest effects. Organizations that redesign processes around AI see significantly larger gains — but these redesigns take years.
We are in the substitution phase. The transformation phase has not yet arrived at scale.
What the Data Actually Supports
- AI tools provide genuine assistance for specific tasks — code generation, boilerplate, documentation
- Task-level gains do not translate to process-level gains without addressing downstream bottlenecks
- Self-reported data is unreliable — 20-40 point upward bias consistently
- Experience level mediates outcomes — senior developers use tools more effectively
- We are early — historical parallels suggest 10-20 years for macro productivity gains
What This Means for Your Organization
- Replace self-reported measurement with objective measurement. Track cycle time, defect rates, deployment frequency — not satisfaction surveys.
- Measure the full process, not individual steps. A 50% reduction in code writing time means nothing if review time doubles.
- Invest in the bottlenecks AI tools create. Automated code review, AI-assisted testing, deployment validation.
- Set realistic expectations. Frame AI tool adoption as a multi-year process redesign initiative, not a quarterly productivity boost.
Sources
- METR, “Measuring the Impact of Early AI Assistance on Open-Source Development,” 2025
- Faros AI Developer Productivity Report, 2025
- BCG, “From Potential to Profit with GenAI,” 2025
- Paul David, “The Dynamo and the Computer,” 1990