Frontier coding models vs work productivity

Frontier Models vs Productivity

Capability compounds; productivity negotiates with reality. Compare benchmark capability, enterprise dev productivity, small-team leverage, non-tech AI work, and vibe-coding access proxies.

44.7xfrontier coding capability index since Claude 2 baseline
~1.3xbest measured enterprise task-level lift cluster
2.17xsmall-team code-production proxy from 54% AI code share
10x?vibe-coding prototype outlier, not net productivity
Y scale
Log scale makes multiplicative differences comparable.

Outlier audit

What can distort the story
Medium-high evidence GPT-5.5 / Opus 4.7 endpoint

Keep the endpoint, but mark the benchmark-mix caveat. OpenAI reports GPT-5.5 at 58.6% on SWE-Bench Pro and 82.7% on Terminal-Bench 2.0; Anthropic confirms Opus 4.7 as a coding-focused release. The normalized index should not imply every benchmark is interchangeable.

Fact strong, productivity weak YC W25: 95% AI-generated codebases

Keep only as a detached code-production outlier. Foundations/GeekWire adds a midpoint: 68% of 22 AI-native startups report AI writes more than 80% of production code, while 13.6% are below 50%. YC remains high-end, not the trend.

Strong proxy Jellyfish top adopters: 1.8x PR throughput

Add as audit evidence, not the main line. It supports a real code-flow outlier for top AI adopters, while Harness, Lightrun, Sonar, and CloudBees explain why review, debugging, and governance can absorb that gain.

Low as productivity Vibe coding: 10x prototype ceiling

Keep as access/prototyping leverage. Lovable reports roughly 8M users and 100k products/day; UX Tools shows designers building tools with AI; Replit says its audience is mostly non-technical. None of this proves maintained production software.

Risk evidence Vibe/security drag

Strengthen the warning. Axios/Lovable, Cloud Security Alliance, Lightrun, and CloudBees all point to privacy, security, production, and maintenance debt when generated code moves faster than review capacity.

Chart rule Access is not productivity

The steepest outliers are real capability/access shifts. Detached markers show code-share extremes. Connected lines show the more conservative trend, so one startup anecdote cannot dominate the visual story.

All source metrics

Year Source Metric family Metric Value Unit Signal Chart use Caveat URL

The chart uses normalized index values so different evidence types can be compared visually. The table preserves raw public metrics and caveats. "Vibe coding" is intentionally treated as access/prototyping leverage, not audited software-delivery productivity.