10 AI Models, One WebSocket Task: What Code Volume Actually Tells Us

We gave 10 AI models the same challenge: refactor a fragile WebSocket client to add exponential backoff, connection state management, message queuing, and proper cleanup. Each model received identical inputs—the original buggy code and detailed requirements.

The results revealed something unexpected about the relationship between code volume and quality.

Output sizes ranged from 888 tokens (Grok 4) to 13,666 tokens (Claude Haiku 4.5)—a 15.4x difference.

The highest-scoring model (Claude Haiku 4.5, 95.5/100) wrote the MOST code—13,666 tokens. The smallest output (Grok 4) scored just 81.7/100. Sometimes more code really is better code.

How We Tested

All 10 models were tested: GPT-5, OpenAI o3, Claude Opus 4.1, Claude Sonnet 4.5, Claude Haiku 4.5, Grok 4, Grok Code Fast 1, Gemini 2.5 Pro, GLM 4.6, and MiniMax M2.

Each model received identical inputs: the original fragile code and a task description requesting robust reconnection logic, exponential backoff, state management, and message queuing. All outputs were evaluated by 3 independent AI judges (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro) across 5 criteria: code quality, completeness, correctness, performance, and security.

How scoring works: Each model got 3 independent scores per criterion from different judges. We take the middle score (median) to avoid bias from any single judge. This is standard practice for AI-judged evaluations.

The Task: Fixing a Fragile WebSocket Client

Here's the exact challenge all models received:

Task Description:

"This WebSocket client has fragile connection handling that fails silently, doesn't recover from network interruptions, and can spam the server with rapid reconnection attempts. Implement robust reconnection logic with exponential backoff, connection state management, message queuing during disconnections, and proper cleanup. The solution should handle network failures gracefully and maintain a stable connection in production environments."

The problems: No reconnection logic, messages lost during disconnections, no exponential backoff, no connection state tracking, no message queue. A production nightmare. (See Appendix for full code)

Full Rankings (All 10 Models)

Rank	Model	Score	Tokens	Cost
1	Claude Haiku 4.5	95.5	13,666	$0.069
2	Claude Opus 4.1	94.7	6,052	$0.464
3	GPT-5	94.3	7,919	$0.080
4	Claude Sonnet 4.5	92.9	8,425	$0.128
5	GLM 4.6	91.6	3,334	$0.006
6	MiniMax M2	91.5	4,856	$0.006
7	OpenAI o3	91.2	5,191	$0.043
8	Gemini 2.5 Pro	90.5	2,621	$0.027
9	Grok Code Fast 1	87.9	1,754	$0.003
10	Grok 4	81.7	888	$0.024

Judges: 3-judge ensemble (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro) with median aggregation across 5 criteria. Scores shown are ensemble averages.

Three Patterns Emerged: 14 Points Separate #1 from #10

The 10 models fell into three distinct tiers based on their approach to code volume and quality scores:

Top Tier (94-96): Verbose Excellence

Models: Claude Haiku 4.5 (95.5), Claude Opus 4.1 (94.7), GPT-5 (94.3)

These models wrote substantial code (6,052-13,666 tokens) and prioritized completeness—comprehensive error handling, metrics, logging, and testing infrastructure. Haiku excelled at completeness (98.3/100 ensemble average), with comprehensive metrics, event systems, and logging. The extra code wasn't bloat—it was production-ready engineering. Sometimes verbose really is better.

Example: Haiku implemented bounded message queues with overflow handling, exponential backoff with jitter (preventing thundering herd), and connection lifecycle events for monitoring. Grok 4 omitted all three—its "minimalist" approach was incomplete, not elegant.

Mid Tier (91-93): Balanced Competence

Models: Claude Sonnet 4.5 (92.9), GLM 4.6 (91.6), MiniMax M2 (91.5), o3 (91.2)

These models wrote moderate amounts of code (2,621-8,425 tokens) and scored in the 91-93 range. A balanced approach that prioritizes correctness and code quality without excessive verbosity. All are production-viable solutions.

Lower Tier (82-91): Minimal Incompleteness

Models: Gemini 2.5 Pro (90.5), Grok Code Fast 1 (87.9), Grok 4 (81.7)

Grok 4 wrote the least code by far (888 tokens) and scored lowest (81.7/100), missing critical features like jitter, bounded queues, and proper cleanup. The minimalist approach sacrificed completeness (77.7/100 ensemble average) for brevity. Brevity without completeness isn't better code—it's incomplete code.

The Code Volume Paradox

Code volume alone is a terrible predictor of quality. The data shows no clear correlation:

The Spectrum:

• 888 tokens (Grok 4): 81.7/100 — Too minimal, missing features
• 5,191 tokens (o3): 91.2/100 — Efficient and correct, but not the most complete
• 13,666 tokens (Haiku): 95.5/100 — Verbose but thorough, HIGHEST quality

The lesson: For production code, more code often IS better code. Haiku's verbosity included comprehensive error handling, metrics, logging, and testing—features that matter in real systems. The "perfect balance" narrative was wrong. Thoroughness wins.

Code Efficiency: Tokens per Quality Point

Dividing tokens by score reveals which models write the most efficient code. Think of this as "cost per unit of quality"—how much code does each model need to achieve each quality point? Lower numbers mean more efficient:

Rank	Model	Tokens/Point	Quality Score
1	Grok 4	11	81.7
2	Grok Code Fast 1	20	87.9
3	Gemini 2.5 Pro	30	90.5
4	GLM 4.6	37	91.6
5	MiniMax M2	53	91.5
6	o3	57	91.2
7	Claude Opus 4.1	64	94.7
8	GPT-5	85	94.3
9	Claude Sonnet 4.5	94	92.9
10	Claude Haiku 4.5	143	95.5

The efficiency paradox resolved: Haiku is the least efficient (143 tokens/point) but achieved the HIGHEST quality (95.5/100). Meanwhile, Grok 4 is the most efficient (11 tokens/point) but scored lowest overall (81.7/100).

This proves efficiency ≠ quality. For production systems, verbose thoroughness (Haiku, Opus) beats minimal elegance (Grok 4). The extra code includes error handling, metrics, logging, and testing—features that matter in real systems.

ℹ️ Transparency & Full Data

Want to see the raw scores and judge reasoning? All data is public and verifiable.

View full evaluation: See all 3 judges' scores, detailed reasoning, and complete model outputs →

Test All 10 Models On Your Code

Want to see how Haiku, GPT-5, Opus, and 7 other models handle your specific task?

Compare 10 Models on Your Task →

Free to use • See side-by-side outputs • Vote on winner