AI Model Leaderboard

Community-driven rankings based on real developer evaluations

Monthly Top Models

Get a monthly digest of the top-performing AI models, ranking changes, and new challengers. Data-driven insights from real developer evaluations, not marketing hype.

How Rankings Work

Judge Score (Primary Metric)

Each model output is scored by 3 frontier AI judges (Claude Opus 4.5, GPT-5.2, Gemini 3 Pro). We take the median score to reduce individual judge bias. Scores are based on 5 weighted criteria: Correctness (35%), Security (20%), Code Quality (20%), Efficiency (15%), and Completeness (10%).

Community Vote

Developers vote on model outputs in blind head-to-head comparisons. Win rate shows the percentage of matchups where this model was chosen. Every vote requires a comment explaining the reasoning.

Response Time

The mean time each model takes to generate a response. Faster isn't always better—some models trade speed for quality—but this helps you understand performance trade-offs.

Cost per Evaluation

Average API cost for each model based on actual token usage. Pricing data sourced from OpenRouter. Helps you balance quality against budget.

Aggregation

Rankings aggregate all public evaluations. Same prompt from different users contributes to the same benchmark. This builds a community-driven dataset that reflects real-world usage, not synthetic benchmarks.

Why This Matters

Traditional benchmarks like HumanEval and SWE-Bench use synthetic tasks that don't reflect real-world usage. CodeLens rankings are built on actual code challenges from real developers, making it a more accurate benchmark for choosing which AI model to use for your specific needs.

Read our full methodology

Contribute to the Leaderboard

Run evaluations on your own prompts and help build the most accurate AI model benchmark. Pro users' results contribute to these public rankings.