10 AI Models, One WebSocket Task: What Code Volume Actually Tells Us
We tested 10 AI models on the same refactoring task. Output sizes varied 15x (888 to 13,666 tokens), but quality didn't follow a simple pattern. Here's what code volume actually reveals about AI performance.
We gave 10 AI models the same challenge: refactor a fragile WebSocket client to add exponential backoff, connection state management, message queuing, and proper cleanup. Each model received identical inputs—the original buggy code and detailed requirements.
The results revealed something unexpected about the relationship between code volume and quality.
Output sizes ranged from 888 tokens (Grok 4) to 13,666 tokens (Claude Haiku 4.5)—a 15.4x difference.
The highest-scoring model (Claude Haiku 4.5, 95.5/100) wrote the MOST code—13,666 tokens. The smallest output (Grok 4) scored just 81.7/100. Sometimes more code really is better code.
How We Tested
All 10 models were tested: GPT-5, OpenAI o3, Claude Opus 4.1, Claude Sonnet 4.5, Claude Haiku 4.5, Grok 4, Grok Code Fast 1, Gemini 2.5 Pro, GLM 4.6, and MiniMax M2.
Each model received identical inputs: the original fragile code and a task description requesting robust reconnection logic, exponential backoff, state management, and message queuing. All outputs were evaluated by 3 independent AI judges (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro) across 5 criteria: code quality, completeness, correctness, performance, and security.
How scoring works: Each model got 3 independent scores per criterion from different judges. We take the middle score (median) to avoid bias from any single judge. This is standard practice for AI-judged evaluations.
The Task: Fixing a Fragile WebSocket Client
Here's the exact challenge all models received:
Task Description:
"This WebSocket client has fragile connection handling that fails silently, doesn't recover from network interruptions, and can spam the server with rapid reconnection attempts. Implement robust reconnection logic with exponential backoff, connection state management, message queuing during disconnections, and proper cleanup. The solution should handle network failures gracefully and maintain a stable connection in production environments."
The problems: No reconnection logic, messages lost during disconnections, no exponential backoff, no connection state tracking, no message queue. A production nightmare. (See Appendix for full code)
Full Rankings (All 10 Models)
| Rank | Model | Score | Tokens | Cost |
|---|---|---|---|---|
| 1 | Claude Haiku 4.5 | 95.5 | 13,666 | $0.069 |
| 2 | Claude Opus 4.1 | 94.7 | 6,052 | $0.464 |
| 3 | GPT-5 | 94.3 | 7,919 | $0.080 |
| 4 | Claude Sonnet 4.5 | 92.9 | 8,425 | $0.128 |
| 5 | GLM 4.6 | 91.6 | 3,334 | $0.006 |
| 6 | MiniMax M2 | 91.5 | 4,856 | $0.006 |
| 7 | OpenAI o3 | 91.2 | 5,191 | $0.043 |
| 8 | Gemini 2.5 Pro | 90.5 | 2,621 | $0.027 |
| 9 | Grok Code Fast 1 | 87.9 | 1,754 | $0.003 |
| 10 | Grok 4 | 81.7 | 888 | $0.024 |
Judges: 3-judge ensemble (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro) with median aggregation across 5 criteria. Scores shown are ensemble averages.
Three Patterns Emerged: 14 Points Separate #1 from #10
The 10 models fell into three distinct tiers based on their approach to code volume and quality scores:
Top Tier (94-96): Verbose Excellence
Models: Claude Haiku 4.5 (95.5), Claude Opus 4.1 (94.7), GPT-5 (94.3)
These models wrote substantial code (6,052-13,666 tokens) and prioritized completeness—comprehensive error handling, metrics, logging, and testing infrastructure. Haiku excelled at completeness (98.3/100 ensemble average), with comprehensive metrics, event systems, and logging. The extra code wasn't bloat—it was production-ready engineering. Sometimes verbose really is better.
Example: Haiku implemented bounded message queues with overflow handling, exponential backoff with jitter (preventing thundering herd), and connection lifecycle events for monitoring. Grok 4 omitted all three—its "minimalist" approach was incomplete, not elegant.
Mid Tier (91-93): Balanced Competence
Models: Claude Sonnet 4.5 (92.9), GLM 4.6 (91.6), MiniMax M2 (91.5), o3 (91.2)
These models wrote moderate amounts of code (2,621-8,425 tokens) and scored in the 91-93 range. A balanced approach that prioritizes correctness and code quality without excessive verbosity. All are production-viable solutions.
Lower Tier (82-91): Minimal Incompleteness
Models: Gemini 2.5 Pro (90.5), Grok Code Fast 1 (87.9), Grok 4 (81.7)
Grok 4 wrote the least code by far (888 tokens) and scored lowest (81.7/100), missing critical features like jitter, bounded queues, and proper cleanup. The minimalist approach sacrificed completeness (77.7/100 ensemble average) for brevity. Brevity without completeness isn't better code—it's incomplete code.
The Code Volume Paradox
Code volume alone is a terrible predictor of quality. The data shows no clear correlation:
The Spectrum:
- • 888 tokens (Grok 4): 81.7/100 — Too minimal, missing features
- • 5,191 tokens (o3): 91.2/100 — Efficient and correct, but not the most complete
- • 13,666 tokens (Haiku): 95.5/100 — Verbose but thorough, HIGHEST quality
The lesson: For production code, more code often IS better code. Haiku's verbosity included comprehensive error handling, metrics, logging, and testing—features that matter in real systems. The "perfect balance" narrative was wrong. Thoroughness wins.
Code Efficiency: Tokens per Quality Point
Dividing tokens by score reveals which models write the most efficient code. Think of this as "cost per unit of quality"—how much code does each model need to achieve each quality point? Lower numbers mean more efficient:
| Rank | Model | Tokens/Point | Quality Score |
|---|---|---|---|
| 1 | Grok 4 | 11 | 81.7 |
| 2 | Grok Code Fast 1 | 20 | 87.9 |
| 3 | Gemini 2.5 Pro | 30 | 90.5 |
| 4 | GLM 4.6 | 37 | 91.6 |
| 5 | MiniMax M2 | 53 | 91.5 |
| 6 | o3 | 57 | 91.2 |
| 7 | Claude Opus 4.1 | 64 | 94.7 |
| 8 | GPT-5 | 85 | 94.3 |
| 9 | Claude Sonnet 4.5 | 94 | 92.9 |
| 10 | Claude Haiku 4.5 | 143 | 95.5 |
The efficiency paradox resolved: Haiku is the least efficient (143 tokens/point) but achieved the HIGHEST quality (95.5/100). Meanwhile, Grok 4 is the most efficient (11 tokens/point) but scored lowest overall (81.7/100).
This proves efficiency ≠quality. For production systems, verbose thoroughness (Haiku, Opus) beats minimal elegance (Grok 4). The extra code includes error handling, metrics, logging, and testing—features that matter in real systems.
ℹ️ Transparency & Full Data
Want to see the raw scores and judge reasoning? All data is public and verifiable.
View full evaluation: See all 3 judges' scores, detailed reasoning, and complete model outputs →
Test All 10 Models On Your Code
Want to see how Haiku, GPT-5, Opus, and 7 other models handle your specific task?
Compare 10 Models on Your Task →Free to use • See side-by-side outputs • Vote on winner
📬 Get Model Comparison Insights
Get notified when we publish new AI model comparisons, benchmarks, and analysis. No spam, unsubscribe anytime.
Appendix: The Original Fragile Code
Here's the complete WebSocket client implementation that all 10 models were asked to refactor:
What's wrong with this code?
This 70-line client has 6 critical flaws: no reconnection logic, no error recovery, no message queuing (messages lost during disconnections), no connection state tracking, no exponential backoff (can spam server), and no cleanup on disconnect.
// websocket-client.ts - Fragile WebSocket implementation
class ChatClient {
private ws: WebSocket | null = null;
private url: string;
private isConnected: boolean = false;
constructor(url: string) {
this.url = url;
}
// Basic connect - no error handling
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
console.log('Connected');
this.isConnected = true;
};
this.ws.onmessage = (event) => {
console.log('Message:', event.data);
// Process message (no error handling)
const data = JSON.parse(event.data);
this.handleMessage(data);
};
this.ws.onerror = (error) => {
console.error('Error:', error);
// What now? No recovery logic
};
this.ws.onclose = () => {
console.log('Disconnected');
this.isConnected = false;
// Just log it, no reconnection
};
}
// Send message - doesn't check connection state
send(message: any) {
if (this.ws) {
this.ws.send(JSON.stringify(message));
}
// If no connection, message is silently lost!
}
// Handle incoming messages
private handleMessage(data: any) {
switch (data.type) {
case 'chat':
console.log(`${data.user}: ${data.message}`);
break;
case 'notification':
console.log(`Notification: ${data.text}`);
break;
default:
console.log('Unknown message type');
}
}
// Disconnect - but can't reconnect after calling this
disconnect() {
if (this.ws) {
this.ws.close();
this.ws = null;
}
}
}