Claude Haiku 4.5 Wrote 62% More Code But Scored 16% Lower Than Sonnet 4.5
Testing Anthropic's newly released Claude Haiku 4.5 on a WebSocket refactoring task revealed a surprising paradox: it produced the most code but delivered some of the lowest quality. Here's what that tells us about the "more is better" assumption in AI-generated code.
Anthropic just released Claude Haiku 4.5, so I immediately ran it through a complex TypeScript task: refactoring a fragile WebSocket client to add exponential backoff, connection state management, and message queuing.
The results were unexpected. Haiku 4.5 wrote 13,666 tokens—the most of all 8 models tested. You'd think more code equals better solution, right?
Haiku scored 74.4/100. Sonnet 4.5 wrote 8,425 tokens (38% less) and scored 89.0/100.
Haiku produced 62% more code but delivered 16% lower quality.
The Test
I ran the same WebSocket refactoring task across 8 current AI models: GPT-5, OpenAI o3, Claude Opus 4.1, Claude Sonnet 4.5, Claude Haiku 4.5, Grok 4, Gemini 2.5 Pro, and GLM 4.6.
Each model received identical inputs: the original fragile code and a task description requesting robust reconnection logic, exponential backoff, state management, and message queuing. All outputs were evaluated by a dynamic judge (GPT-5, the current #1 model on the platform) across 5 criteria: code quality, completeness, correctness, performance, and security.
Full Rankings (All 8 Models)
| Rank | Model | Score | Tokens | Cost | Speed |
|---|---|---|---|---|---|
| 1 | GPT-5 | 93.4 | 7,919 | $0.080 | 77s |
| 2 | Claude Sonnet 4.5 | 89.0 | 8,425 | $0.128 | 98s |
| 2 | o3 | 89.0 | 5,191 | $0.043 | 38s |
| 4 | Gemini 2.5 Pro | 86.6 | 2,621 | $0.027 | 36s |
| 5 | GLM 4.6 | 84.4 | 3,334 | $0.006 | 51s |
| 6 | Claude Opus 4.1 | 81.6 | 6,052 | $0.464 | 112s |
| 7 | Claude Haiku 4.5 | 74.4 | 13,666 | $0.069 | 73s |
| 8 | Grok 4 | 70.0 | 888 | $0.024 | 101s |
Haiku 4.5 wrote the most code (13,666 tokens) but ranked 7th out of 8 models.
Why Did Haiku Score So Low?
Despite producing the most code, Haiku 4.5's quality suffered across multiple dimensions. Here's the breakdown from the dynamic judge (GPT-5):
Haiku 4.5 Score Breakdown:
- 60/100Code Quality:
"Overly verbose with probable duplicate methods and mixed concerns"
- 65/100Correctness:
"Duplicated send definitions and heavy boilerplate that risks errors"
- 75/100Performance:
"Heavy metrics and layers may add unnecessary overhead"
- 82/100Security:
"Parses safely but complexity can hide edge cases"
- 90/100Completeness:
"Covers backoff, queueing, heartbeats, metrics extensively"
Average: 74.4/100
Haiku tried to do everything—covering all features comprehensively (90/100 on completeness)—but introduced bugs and code quality issues in the process. The extra code wasn't adding value; it was adding complexity.
The Over-Engineering Problem
This is a textbook case of over-engineering. Haiku optimized for thoroughness (adding metrics, extensive logging, multiple abstraction layers) but sacrificed:
- Code quality: 60/100 vs Sonnet's 88/100
- Correctness: 65/100 vs Sonnet's 90/100 (duplicate methods, boilerplate errors)
- Maintainability: Verbose, complex code with mixed concerns
Compare to Sonnet 4.5:
- • Wrote 8,425 tokens (38% less code)
- • Scored 89.0/100 (20% higher quality)
- • Clean structure, no duplicates, production-ready
- • Code quality: 88/100, Correctness: 90/100
For production WebSocket code, would you rather deploy Haiku's 13,666 tokens with duplicate methods and potential bugs? Or Sonnet's 8,425 tokens with cleaner structure and higher correctness?
When Is More Code Actually Worse?
This result challenges a common assumption in AI-generated code: that comprehensive solutions are inherently better. Sometimes "thorough" is just another word for "bloated."
The trade-off:
More code gives you:
- ✅ Comprehensive feature coverage
- ✅ Extensive error handling
- ✅ Detailed logging and metrics
But risks:
- ❌ Code duplication
- ❌ Mixed concerns
- ❌ Harder to maintain
- ❌ More surface area for bugs
Code Efficiency: Tokens per Quality Point
Looking at tokens per quality point reveals which models write efficient code:
| Model | Tokens/Point | Efficiency |
|---|---|---|
| Gemini 2.5 Pro | 30 | Most efficient |
| GLM 4.6 | 40 | Very efficient |
| o3 | 58 | Good |
| GPT-5 | 85 | Acceptable |
| Claude Sonnet 4.5 | 95 | Fair |
| Claude Haiku 4.5 | 184 | Least efficient |
Haiku produces 3.2x more code per quality point than o3, and 1.9x more than Sonnet.
⚠️ Important Context (n=1 evaluation)
This analysis is based on a single evaluation. While the patterns are clear and the judge's reasoning is detailed, this is directional data, not a definitive benchmark.
I'm running 10+ more Haiku 4.5 tests across different task types (security, refactoring, optimization, architecture) to see if this over-engineering pattern holds.
Full evaluation data: View complete results with all model outputs and scoring details →
Test It On Your Code
Want to see how Haiku 4.5, Sonnet, and 6 other models handle your specific use case?
Compare 8 Models on Your Task →Free to use • See side-by-side outputs • Vote on winner
📬 Get Model Comparison Insights
Get notified when we publish new AI model comparisons, benchmarks, and analysis. No spam, unsubscribe anytime.