ℹ️ CodeLens has shut down. Read more about why we shut down or check out FlouState for developer productivity tracking.

Back to Blog
Model Comparison8 min read

Claude Haiku 4.5 Wrote 62% More Code But Scored 16% Lower Than Sonnet 4.5

Testing Anthropic's newly released Claude Haiku 4.5 on a WebSocket refactoring task revealed a surprising paradox: it produced the most code but delivered some of the lowest quality. Here's what that tells us about the "more is better" assumption in AI-generated code.

Anthropic just released Claude Haiku 4.5, so I immediately ran it through a complex TypeScript task: refactoring a fragile WebSocket client to add exponential backoff, connection state management, and message queuing.

The results were unexpected. Haiku 4.5 wrote 13,666 tokens—the most of all 8 models tested. You'd think more code equals better solution, right?

Haiku scored 74.4/100. Sonnet 4.5 wrote 8,425 tokens (38% less) and scored 89.0/100.

Haiku produced 62% more code but delivered 16% lower quality.

The Test

I ran the same WebSocket refactoring task across 8 current AI models: GPT-5, OpenAI o3, Claude Opus 4.1, Claude Sonnet 4.5, Claude Haiku 4.5, Grok 4, Gemini 2.5 Pro, and GLM 4.6.

Each model received identical inputs: the original fragile code and a task description requesting robust reconnection logic, exponential backoff, state management, and message queuing. All outputs were evaluated by a dynamic judge (GPT-5, the current #1 model on the platform) across 5 criteria: code quality, completeness, correctness, performance, and security.

Full Rankings (All 8 Models)

RankModelScoreTokensCostSpeed
1GPT-593.47,919$0.08077s
2Claude Sonnet 4.589.08,425$0.12898s
2o389.05,191$0.04338s
4Gemini 2.5 Pro86.62,621$0.02736s
5GLM 4.684.43,334$0.00651s
6Claude Opus 4.181.66,052$0.464112s
7Claude Haiku 4.574.413,666$0.06973s
8Grok 470.0888$0.024101s

Haiku 4.5 wrote the most code (13,666 tokens) but ranked 7th out of 8 models.

Why Did Haiku Score So Low?

Despite producing the most code, Haiku 4.5's quality suffered across multiple dimensions. Here's the breakdown from the dynamic judge (GPT-5):

Haiku 4.5 Score Breakdown:

  • 60/100
    Code Quality:

    "Overly verbose with probable duplicate methods and mixed concerns"

  • 65/100
    Correctness:

    "Duplicated send definitions and heavy boilerplate that risks errors"

  • 75/100
    Performance:

    "Heavy metrics and layers may add unnecessary overhead"

  • 82/100
    Security:

    "Parses safely but complexity can hide edge cases"

  • 90/100
    Completeness:

    "Covers backoff, queueing, heartbeats, metrics extensively"

Average: 74.4/100

Haiku tried to do everything—covering all features comprehensively (90/100 on completeness)—but introduced bugs and code quality issues in the process. The extra code wasn't adding value; it was adding complexity.

The Over-Engineering Problem

This is a textbook case of over-engineering. Haiku optimized for thoroughness (adding metrics, extensive logging, multiple abstraction layers) but sacrificed:

  • Code quality: 60/100 vs Sonnet's 88/100
  • Correctness: 65/100 vs Sonnet's 90/100 (duplicate methods, boilerplate errors)
  • Maintainability: Verbose, complex code with mixed concerns

Compare to Sonnet 4.5:

  • • Wrote 8,425 tokens (38% less code)
  • • Scored 89.0/100 (20% higher quality)
  • • Clean structure, no duplicates, production-ready
  • • Code quality: 88/100, Correctness: 90/100

For production WebSocket code, would you rather deploy Haiku's 13,666 tokens with duplicate methods and potential bugs? Or Sonnet's 8,425 tokens with cleaner structure and higher correctness?

When Is More Code Actually Worse?

This result challenges a common assumption in AI-generated code: that comprehensive solutions are inherently better. Sometimes "thorough" is just another word for "bloated."

The trade-off:

More code gives you:

  • ✅ Comprehensive feature coverage
  • ✅ Extensive error handling
  • ✅ Detailed logging and metrics

But risks:

  • ❌ Code duplication
  • ❌ Mixed concerns
  • ❌ Harder to maintain
  • ❌ More surface area for bugs

Code Efficiency: Tokens per Quality Point

Looking at tokens per quality point reveals which models write efficient code:

ModelTokens/PointEfficiency
Gemini 2.5 Pro30Most efficient
GLM 4.640Very efficient
o358Good
GPT-585Acceptable
Claude Sonnet 4.595Fair
Claude Haiku 4.5184Least efficient

Haiku produces 3.2x more code per quality point than o3, and 1.9x more than Sonnet.

⚠️ Important Context (n=1 evaluation)

This analysis is based on a single evaluation. While the patterns are clear and the judge's reasoning is detailed, this is directional data, not a definitive benchmark.

I'm running 10+ more Haiku 4.5 tests across different task types (security, refactoring, optimization, architecture) to see if this over-engineering pattern holds.

Full evaluation data: View complete results with all model outputs and scoring details →

Test It On Your Code

Want to see how Haiku 4.5, Sonnet, and 6 other models handle your specific use case?

Compare 8 Models on Your Task →

Free to use • See side-by-side outputs • Vote on winner

📬 Get Model Comparison Insights

Get notified when we publish new AI model comparisons, benchmarks, and analysis. No spam, unsubscribe anytime.