🔄 CodeLens is transitioning to a partnership. Join the waitlist to be notified when we launch.

Back to Blog
Model Comparison•10 min read

10 AI Models, One WebSocket Task: What Code Volume Actually Tells Us

We tested 10 AI models on the same refactoring task. Output sizes varied 15x (888 to 13,666 tokens), but quality didn't follow a simple pattern. Here's what code volume actually reveals about AI performance.

We gave 10 AI models the same challenge: refactor a fragile WebSocket client to add exponential backoff, connection state management, message queuing, and proper cleanup. Each model received identical inputs—the original buggy code and detailed requirements.

The results revealed something unexpected about the relationship between code volume and quality.

Output sizes ranged from 888 tokens (Grok 4) to 13,666 tokens (Claude Haiku 4.5)—a 15.4x difference.

The highest-scoring model (Claude Haiku 4.5, 95.5/100) wrote the MOST code—13,666 tokens. The smallest output (Grok 4) scored just 81.7/100. Sometimes more code really is better code.

How We Tested

All 10 models were tested: GPT-5, OpenAI o3, Claude Opus 4.1, Claude Sonnet 4.5, Claude Haiku 4.5, Grok 4, Grok Code Fast 1, Gemini 2.5 Pro, GLM 4.6, and MiniMax M2.

Each model received identical inputs: the original fragile code and a task description requesting robust reconnection logic, exponential backoff, state management, and message queuing. All outputs were evaluated by 3 independent AI judges (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro) across 5 criteria: code quality, completeness, correctness, performance, and security.

How scoring works: Each model got 3 independent scores per criterion from different judges. We take the middle score (median) to avoid bias from any single judge. This is standard practice for AI-judged evaluations.

The Task: Fixing a Fragile WebSocket Client

Here's the exact challenge all models received:

Task Description:

"This WebSocket client has fragile connection handling that fails silently, doesn't recover from network interruptions, and can spam the server with rapid reconnection attempts. Implement robust reconnection logic with exponential backoff, connection state management, message queuing during disconnections, and proper cleanup. The solution should handle network failures gracefully and maintain a stable connection in production environments."

The problems: No reconnection logic, messages lost during disconnections, no exponential backoff, no connection state tracking, no message queue. A production nightmare. (See Appendix for full code)

Full Rankings (All 10 Models)

RankModelScoreTokensCost
1Claude Haiku 4.595.513,666$0.069
2Claude Opus 4.194.76,052$0.464
3GPT-594.37,919$0.080
4Claude Sonnet 4.592.98,425$0.128
5GLM 4.691.63,334$0.006
6MiniMax M291.54,856$0.006
7OpenAI o391.25,191$0.043
8Gemini 2.5 Pro90.52,621$0.027
9Grok Code Fast 187.91,754$0.003
10Grok 481.7888$0.024

Judges: 3-judge ensemble (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro) with median aggregation across 5 criteria. Scores shown are ensemble averages.

Three Patterns Emerged: 14 Points Separate #1 from #10

The 10 models fell into three distinct tiers based on their approach to code volume and quality scores:

Top Tier (94-96): Verbose Excellence

Models: Claude Haiku 4.5 (95.5), Claude Opus 4.1 (94.7), GPT-5 (94.3)

These models wrote substantial code (6,052-13,666 tokens) and prioritized completeness—comprehensive error handling, metrics, logging, and testing infrastructure. Haiku excelled at completeness (98.3/100 ensemble average), with comprehensive metrics, event systems, and logging. The extra code wasn't bloat—it was production-ready engineering. Sometimes verbose really is better.

Example: Haiku implemented bounded message queues with overflow handling, exponential backoff with jitter (preventing thundering herd), and connection lifecycle events for monitoring. Grok 4 omitted all three—its "minimalist" approach was incomplete, not elegant.

Mid Tier (91-93): Balanced Competence

Models: Claude Sonnet 4.5 (92.9), GLM 4.6 (91.6), MiniMax M2 (91.5), o3 (91.2)

These models wrote moderate amounts of code (2,621-8,425 tokens) and scored in the 91-93 range. A balanced approach that prioritizes correctness and code quality without excessive verbosity. All are production-viable solutions.

Lower Tier (82-91): Minimal Incompleteness

Models: Gemini 2.5 Pro (90.5), Grok Code Fast 1 (87.9), Grok 4 (81.7)

Grok 4 wrote the least code by far (888 tokens) and scored lowest (81.7/100), missing critical features like jitter, bounded queues, and proper cleanup. The minimalist approach sacrificed completeness (77.7/100 ensemble average) for brevity. Brevity without completeness isn't better code—it's incomplete code.

The Code Volume Paradox

Code volume alone is a terrible predictor of quality. The data shows no clear correlation:

The Spectrum:

  • • 888 tokens (Grok 4): 81.7/100 — Too minimal, missing features
  • • 5,191 tokens (o3): 91.2/100 — Efficient and correct, but not the most complete
  • • 13,666 tokens (Haiku): 95.5/100 — Verbose but thorough, HIGHEST quality

The lesson: For production code, more code often IS better code. Haiku's verbosity included comprehensive error handling, metrics, logging, and testing—features that matter in real systems. The "perfect balance" narrative was wrong. Thoroughness wins.

Code Efficiency: Tokens per Quality Point

Dividing tokens by score reveals which models write the most efficient code. Think of this as "cost per unit of quality"—how much code does each model need to achieve each quality point? Lower numbers mean more efficient:

RankModelTokens/PointQuality Score
1Grok 41181.7
2Grok Code Fast 12087.9
3Gemini 2.5 Pro3090.5
4GLM 4.63791.6
5MiniMax M25391.5
6o35791.2
7Claude Opus 4.16494.7
8GPT-58594.3
9Claude Sonnet 4.59492.9
10Claude Haiku 4.514395.5

The efficiency paradox resolved: Haiku is the least efficient (143 tokens/point) but achieved the HIGHEST quality (95.5/100). Meanwhile, Grok 4 is the most efficient (11 tokens/point) but scored lowest overall (81.7/100).

This proves efficiency ≠ quality. For production systems, verbose thoroughness (Haiku, Opus) beats minimal elegance (Grok 4). The extra code includes error handling, metrics, logging, and testing—features that matter in real systems.

ℹ️ Transparency & Full Data

Want to see the raw scores and judge reasoning? All data is public and verifiable.

View full evaluation: See all 3 judges' scores, detailed reasoning, and complete model outputs →

Test All 10 Models On Your Code

Want to see how Haiku, GPT-5, Opus, and 7 other models handle your specific task?

Compare 10 Models on Your Task →

Free to use • See side-by-side outputs • Vote on winner

📬 Get Model Comparison Insights

Get notified when we publish new AI model comparisons, benchmarks, and analysis. No spam, unsubscribe anytime.

Appendix: The Original Fragile Code

Here's the complete WebSocket client implementation that all 10 models were asked to refactor:

What's wrong with this code?

This 70-line client has 6 critical flaws: no reconnection logic, no error recovery, no message queuing (messages lost during disconnections), no connection state tracking, no exponential backoff (can spam server), and no cleanup on disconnect.

// websocket-client.ts - Fragile WebSocket implementation
class ChatClient {
  private ws: WebSocket | null = null;
  private url: string;
  private isConnected: boolean = false;

  constructor(url: string) {
    this.url = url;
  }

  // Basic connect - no error handling
  connect() {
    this.ws = new WebSocket(this.url);

    this.ws.onopen = () => {
      console.log('Connected');
      this.isConnected = true;
    };

    this.ws.onmessage = (event) => {
      console.log('Message:', event.data);
      // Process message (no error handling)
      const data = JSON.parse(event.data);
      this.handleMessage(data);
    };

    this.ws.onerror = (error) => {
      console.error('Error:', error);
      // What now? No recovery logic
    };

    this.ws.onclose = () => {
      console.log('Disconnected');
      this.isConnected = false;
      // Just log it, no reconnection
    };
  }

  // Send message - doesn't check connection state
  send(message: any) {
    if (this.ws) {
      this.ws.send(JSON.stringify(message));
    }
    // If no connection, message is silently lost!
  }

  // Handle incoming messages
  private handleMessage(data: any) {
    switch (data.type) {
      case 'chat':
        console.log(`${data.user}: ${data.message}`);
        break;
      case 'notification':
        console.log(`Notification: ${data.text}`);
        break;
      default:
        console.log('Unknown message type');
    }
  }

  // Disconnect - but can't reconnect after calling this
  disconnect() {
    if (this.ws) {
      this.ws.close();
      this.ws = null;
    }
  }
}