The 90x Carbon Gap
We launched CodeLens.AI five days ago with carbon tracking built in. After analyzing 28 real developer tasks (173 AI executions across 7-8 models each), one finding stands out:
Simple JavaScript bug fix
392 output tokens
Complex Python debugging
11,910 output tokens
But here's what surprised us even more: when we looked at averages across all tasks, the gap between providers was still massive—and it had less to do with model efficiency than you'd think.
Why Google Beats Everyone (It's Not the Model)
We expected GPT-5 to be the most carbon-intensive (it's the most verbose). What we didn't expect: grid carbon intensity would matter more than model efficiency.
Model | Avg Carbon | vs. Gemini | Grid Source | Sample Size |
---|---|---|---|---|
🥇 Gemini 2.5 Pro | 0.65g CO2 | — | GCP (75%+ renewable) | 34 |
Claude Opus 4.1 | 0.89g CO2 | 1.4x | GCP (carbon-neutral) | 29 |
Grok 4 | 1.06g CO2 | 1.6x | Unknown (US avg) | 34 |
Claude Sonnet 4.5 | 1.10g CO2 | 1.7x | GCP (carbon-neutral) | 31 |
OpenAI o3 | 1.20g CO2 | 1.8x | AWS US East | 34 |
GPT-5 | 2.31g CO2 | 3.6x | AWS US East | 32 |
GLM 4.6 (Zhipu AI) | 3.09g CO2 | 4.8x | China (coal grid) | 5 |
The takeaway: Google's 75%+ renewable energy infrastructure gives Gemini a structural advantage that has nothing to do with the model itself. AWS US East (where OpenAI runs) relies heavily on coal, and it shows: GPT-5 produces 3.6x more CO2 per query than Gemini on average.
Output Length Matters More Than You Think
The second-biggest factor? How much the model outputs. We bucketed our 173 executions by output length and found a clear pattern:
Output Length | Executions | Avg Carbon | Range | vs. Smallest |
---|---|---|---|---|
0-1K tokens | 24 | 0.39g CO2 | 0.07-1.51g | — |
1K-3K tokens | 68 | 0.54g CO2 | 0.26-1.06g | 1.4x |
3K-5K tokens | 54 | 1.15g CO2 | 0.48-5.37g | 2.9x |
5K-10K tokens | 38 | 2.07g CO2 | 0.77-5.08g | 5.3x |
10K+ tokens | 15 | 4.05g CO2 | 1.94-6.64g | 10.4x |
10x carbon increase from shortest to longest outputs. This explains why GPT-5 ranks worst on average—it outputs nearly 8K tokens per query, compared to Gemini's 2.9K tokens.
Cost vs Carbon: The Tradeoff Nobody Talks About
Here's where it gets interesting: the greenest model is also the cheapest.
Model | Avg Cost | Avg Carbon | Cost Premium vs Gemini | Carbon Premium vs Gemini |
---|---|---|---|---|
🥇 Gemini 2.5 Pro | $0.031 | 0.65g | — | — |
Grok 4 | $0.050 | 1.06g | 1.6x ($0.019 more) | 1.6x |
Claude Sonnet 4.5 | $0.072 | 1.10g | 2.3x ($0.041 more) | 1.7x |
Claude Opus 4.1 | $0.269 | 0.89g | 8.7x ($0.238 more) | 1.4x |
OpenAI o3 | $0.031 | 1.20g | 1.0x (same as Gemini) | 1.8x |
GPT-5 | $0.081 | 2.31g | 2.6x ($0.050 more) | 3.6x |
GPT-5 costs 2.6x more per query than Gemini and produces 3.6x more CO2. There's no tradeoff here—Gemini wins on both dimensions.
Real-World Impact: Should You Care?
Individual queries have tiny footprints (under 3 grams for most models). But at scale, the differences become meaningful:
Annual Carbon Projections (100 queries/day)
What Developers Are Actually Using AI For
Our 28 evaluations came from 12 real developers. Here's what they submitted:
Task Type | Language | Count | Avg Carbon | Range |
---|---|---|---|---|
Security Analysis | JavaScript | 4 | 1.06g CO2 | 0.22-4.16g |
Bug Fixing | JavaScript | 3 | 0.46g CO2 | 0.07-1.87g |
Security Analysis | Python | 3 | 0.93g CO2 | 0.37-2.29g |
Optimization | Python | 2 | 2.27g CO2 | 0.61-5.25g |
Optimization | TypeScript | 2 | 0.80g CO2 | 0.42-2.06g |
Feature Implementation | Webix | 2 | 1.05g CO2 | 0.31-2.87g |
Insight: Simple JavaScript bug fixes averaged 0.46g CO2, while Python optimization tasks averaged 2.27g—nearly 5x higher. Task complexity matters as much as model choice.
Methodology: How We Calculate Carbon
We use Epoch AI's 2025 research as our baseline for energy consumption, then multiply by provider-specific carbon intensities. Our calculations are directional (±30-50% accuracy), not precise measurements.
Limitations (We're Being Honest Here)
Big Caveats
- •Tiny dataset: Only 28 evaluations from 12 developers (5 days post-launch, growing daily)
- •GPT-4o baseline for all models: Real model efficiency varies (we don't have per-model energy data)
- •Unknown data center locations: Using provider averages (actual locations may differ)
- •o3 reasoning tokens: Hidden reasoning likely underestimates true carbon (we track visible output only)
- •±30-50% accuracy: Directional insights, not precise measurements
We're not claiming to be carbon accounting experts. We're sharing what we've learned from real developer tasks in the hope that imperfect data beats no data.
See the Carbon Footprint of Your Code Task
Submit your code challenge and get instant carbon tracking across all 8 models. It's free, takes 3 minutes to process, and you'll see exactly which AI produces the least CO2 for your specific use case.
About CodeLens.AI: We're building the world's most accurate benchmark of AI model performance on real developer tasks. Community-driven, transparent methodology, and carbon tracking built in. Launched October 8, 2025.