How We Evaluate AI Models

Transparent methodology. No black boxes. Every claim backed by code.

300+
Models via OpenRouter
3
Frontier AI Judges
5
Scoring Criteria
A vs B
Pairwise Blind Voting

The 3-Judge Ensemble

Every model output is scored by 3 frontier AI judges from different providers. Final score = median of all three. No model judges itself.

Claude Opus 4.5
Anthropic
GPT-5.2
OpenAI
Gemini 3 Pro
Google

Why 3 judges?

  • No single-vendor bias (Anthropic, OpenAI, Google all represented)
  • Median score filters outliers
  • Disagreement between judges is informative

Why static selection?

  • Consistent, reproducible results
  • No circular dependency (no "best model judges others")
  • Frontier models are updated regularly anyway

5-Criteria Weighted Scoring

Each judge scores outputs 0-100 based on 5 weighted criteria. Weights reflect real-world priorities: working code matters most, security is critical.

CriterionWeightWhat It Measures
Correctness35%Does it solve the problem correctly?
Security20%Is it safe? No SQL injection, XSS, command injection, exposed secrets.
Code Quality20%Is it clean, readable, well-structured?
Efficiency15%Is it performant and optimized?
Completeness10%Does it handle edge cases?

Scoring Guide

90-100: Excellent - correct, secure, clean, efficient, handles edge cases
70-89: Good - mostly correct, no major security issues, minor problems
50-69: Acceptable - works but has notable issues
30-49: Poor - significant issues, security vulnerabilities, or incomplete
0-29: Failing - doesn't solve the task or has critical security flaws

Why these weights?

  • Correctness (35%): Broken code is worthless, regardless of other qualities
  • Security (20%): AI-generated vulnerabilities are a real problem (SQL injection, XSS, etc.)
  • Code Quality (20%): Maintainability matters for production code
  • Efficiency (15%): Performance matters but less than correctness and safety
  • Completeness (10%): Edge case handling is important but secondary

Blind Pairwise Voting

Beyond AI scores, the community votes on outputs in blind head-to-head matchups. Model names are hidden until after you vote.

1

Two outputs shown side-by-side

Same prompt, different models - labeled "Model A" and "Model B"

2

You vote for the better solution

Or mark as tie if both are equally good

3

Model names revealed after vote

See ELO changes and AI scores for both models

Why blind voting?

  • Eliminates brand bias (can't pick "Claude" just because you like Anthropic)
  • Forces evaluation of actual output quality
  • Random ordering prevents position bias ("Model A" isn't always better)

ELO Ranking System

Community votes update model ELO ratings using a chess-style rating system. Win against a stronger opponent = bigger rating gain.

How ELO works

  • K-factor = 32: Standard chess K-factor for faster convergence
  • Base rating = 1200: All models start here, seeded by AI score
  • Expected score formula: 1 / (1 + 10^((opponent_elo - your_elo) / 400))

Pair selection (70/30 bracket matching)

  • 70% bracket matches: Similar ELO models compete (fair matchups)
  • 30% random: Any model can face any other (ensures coverage)

Anti-gaming measures

  • Blind voting (can't target specific models)
  • Authentication required (easy to ban abusers)
  • One vote per pair per user (DB constraint)
  • Rate limiting (100 votes per day per user)

Transparency Measures

Clickable data points

Every evaluation links to full results. Inspect all model outputs, AI scores, and community votes.

All results public

No hidden data or cherry-picked results. What you see is what's in the database.

Real-time updates

Leaderboard calculates rankings on-demand from live database. No cached or manipulated stats.

Open methodology

This page documents exactly how scoring works. No secret sauce or proprietary algorithms.

What This Measures

Understanding what CodeLens evaluates—and what it doesn't—helps you interpret the results correctly.

✓ What We Measure

  • Code quality and readability
  • Best practices adherence
  • Solution approach and structure
  • Maintainability signals
  • Security awareness (static analysis)

✗ What We Don't Measure

  • Execution correctness (we don't run code)
  • Performance benchmarks
  • Test pass rates
  • Runtime behavior
  • Memory usage or efficiency

Limitations & Tradeoffs

Every benchmark has tradeoffs. Here's what you should know about ours.

LLM-as-judge bias

AI judges have correlated preferences—model families may favor similar outputs. Using 3 different providers (Anthropic, OpenAI, Google) reduces but doesn't eliminate this bias.

Small sample variance

Rankings with few evaluations have high variance. Results stabilize as more prompts are tested. Look for ⚠️ warnings on low-sample models.

Not execution testing

We measure code quality signals (readability, best practices, structure)—not whether code actually runs. A high-scoring solution might still have runtime bugs.

Non-deterministic scoring

Same code could score slightly differently on retry due to LLM variance. We use median of 3 judges to reduce this, but small fluctuations are expected.

Directional signal, not ground truth

Use this as guidance for model selection, not absolute truth. Rankings indicate relative strengths on community-submitted coding tasks.

How to use this benchmark

  1. Use it alongside other benchmarks (HumanEval, SWE-Bench, LiveCodeBench)
  2. Check sample sizes—models with ⚠️ need more evaluations
  3. Submit your own tasks to test models on your specific use cases
  4. Vote on outputs to improve community rankings

Help Build Better Benchmarks

Submit your coding tasks and vote on results to grow the dataset.