How We Evaluate AI Models
Transparent methodology. No black boxes. Every claim backed by code.
On This Page
The 3-Judge Ensemble
Every model output is scored by 3 frontier AI judges from different providers. Final score = median of all three. No model judges itself.
Why 3 judges?
- No single-vendor bias (Anthropic, OpenAI, Google all represented)
- Median score filters outliers
- Disagreement between judges is informative
Why static selection?
- Consistent, reproducible results
- No circular dependency (no "best model judges others")
- Frontier models are updated regularly anyway
5-Criteria Weighted Scoring
Each judge scores outputs 0-100 based on 5 weighted criteria. Weights reflect real-world priorities: working code matters most, security is critical.
| Criterion | Weight | What It Measures |
|---|---|---|
| Correctness | 35% | Does it solve the problem correctly? |
| Security | 20% | Is it safe? No SQL injection, XSS, command injection, exposed secrets. |
| Code Quality | 20% | Is it clean, readable, well-structured? |
| Efficiency | 15% | Is it performant and optimized? |
| Completeness | 10% | Does it handle edge cases? |
Scoring Guide
Why these weights?
- Correctness (35%): Broken code is worthless, regardless of other qualities
- Security (20%): AI-generated vulnerabilities are a real problem (SQL injection, XSS, etc.)
- Code Quality (20%): Maintainability matters for production code
- Efficiency (15%): Performance matters but less than correctness and safety
- Completeness (10%): Edge case handling is important but secondary
Blind Pairwise Voting
Beyond AI scores, the community votes on outputs in blind head-to-head matchups. Model names are hidden until after you vote.
Two outputs shown side-by-side
Same prompt, different models - labeled "Model A" and "Model B"
You vote for the better solution
Or mark as tie if both are equally good
Model names revealed after vote
See ELO changes and AI scores for both models
Why blind voting?
- Eliminates brand bias (can't pick "Claude" just because you like Anthropic)
- Forces evaluation of actual output quality
- Random ordering prevents position bias ("Model A" isn't always better)
ELO Ranking System
Community votes update model ELO ratings using a chess-style rating system. Win against a stronger opponent = bigger rating gain.
How ELO works
- K-factor = 32: Standard chess K-factor for faster convergence
- Base rating = 1200: All models start here, seeded by AI score
- Expected score formula: 1 / (1 + 10^((opponent_elo - your_elo) / 400))
Pair selection (70/30 bracket matching)
- 70% bracket matches: Similar ELO models compete (fair matchups)
- 30% random: Any model can face any other (ensures coverage)
Anti-gaming measures
- Blind voting (can't target specific models)
- Authentication required (easy to ban abusers)
- One vote per pair per user (DB constraint)
- Rate limiting (100 votes per day per user)
Transparency Measures
Clickable data points
Every evaluation links to full results. Inspect all model outputs, AI scores, and community votes.
All results public
No hidden data or cherry-picked results. What you see is what's in the database.
Real-time updates
Leaderboard calculates rankings on-demand from live database. No cached or manipulated stats.
Open methodology
This page documents exactly how scoring works. No secret sauce or proprietary algorithms.
What This Measures
Understanding what CodeLens evaluates—and what it doesn't—helps you interpret the results correctly.
✓ What We Measure
- Code quality and readability
- Best practices adherence
- Solution approach and structure
- Maintainability signals
- Security awareness (static analysis)
✗ What We Don't Measure
- Execution correctness (we don't run code)
- Performance benchmarks
- Test pass rates
- Runtime behavior
- Memory usage or efficiency
Limitations & Tradeoffs
Every benchmark has tradeoffs. Here's what you should know about ours.
LLM-as-judge bias
AI judges have correlated preferences—model families may favor similar outputs. Using 3 different providers (Anthropic, OpenAI, Google) reduces but doesn't eliminate this bias.
Small sample variance
Rankings with few evaluations have high variance. Results stabilize as more prompts are tested. Look for ⚠️ warnings on low-sample models.
Not execution testing
We measure code quality signals (readability, best practices, structure)—not whether code actually runs. A high-scoring solution might still have runtime bugs.
Non-deterministic scoring
Same code could score slightly differently on retry due to LLM variance. We use median of 3 judges to reduce this, but small fluctuations are expected.
Directional signal, not ground truth
Use this as guidance for model selection, not absolute truth. Rankings indicate relative strengths on community-submitted coding tasks.
How to use this benchmark
- Use it alongside other benchmarks (HumanEval, SWE-Bench, LiveCodeBench)
- Check sample sizes—models with ⚠️ need more evaluations
- Submit your own tasks to test models on your specific use cases
- Vote on outputs to improve community rankings
Help Build Better Benchmarks
Submit your coding tasks and vote on results to grow the dataset.