Keep the endpoint, but mark the benchmark-mix caveat. OpenAI reports GPT-5.5 at 58.6% on SWE-Bench Pro and 82.7% on Terminal-Bench 2.0; Anthropic confirms Opus 4.7 as a coding-focused release. The normalized index should not imply every benchmark is interchangeable.
Frontier Models vs Productivity
Capability compounds; productivity negotiates with reality. Compare benchmark capability, enterprise dev productivity, small-team leverage, non-tech AI work, and vibe-coding access proxies.
Outlier audit
What can distort the storyKeep only as a detached code-production outlier. Foundations/GeekWire adds a midpoint: 68% of 22 AI-native startups report AI writes more than 80% of production code, while 13.6% are below 50%. YC remains high-end, not the trend.
Add as audit evidence, not the main line. It supports a real code-flow outlier for top AI adopters, while Harness, Lightrun, Sonar, and CloudBees explain why review, debugging, and governance can absorb that gain.
Keep as access/prototyping leverage. Lovable reports roughly 8M users and 100k products/day; UX Tools shows designers building tools with AI; Replit says its audience is mostly non-technical. None of this proves maintained production software.
Strengthen the warning. Axios/Lovable, Cloud Security Alliance, Lightrun, and CloudBees all point to privacy, security, production, and maintenance debt when generated code moves faster than review capacity.
The steepest outliers are real capability/access shifts. Detached markers show code-share extremes. Connected lines show the more conservative trend, so one startup anecdote cannot dominate the visual story.
All source metrics
| Year | Source | Metric family | Metric | Value | Unit | Signal | Chart use | Caveat | URL |
|---|
The chart uses normalized index values so different evidence types can be compared visually. The table preserves raw public metrics and caveats. "Vibe coding" is intentionally treated as access/prototyping leverage, not audited software-delivery productivity.