10 AI Models, One WebSocket Task: What Code Volume Actually Tells Us
We tested 10 AI models on the same refactoring task. Output sizes varied 15x (888 to 13,666 tokens), but quality didn't follow a simple pattern. Haiku 4.5 wrote the most code and scored the highest, challenging the "less is more" narrative. Here's what code volume actually reveals about AI performance.