AI Model Leaderboard
Community-driven rankings based on real developer evaluations
Monthly Top Models
Get a monthly digest of the top-performing AI models, ranking changes, and new challengers. Data-driven insights from real developer evaluations, not marketing hype.
How Rankings Work
Judge Score (Primary Metric)
Each model output is scored by 3 frontier AI judges (Claude Opus 4.5, GPT-5.2, Gemini 3 Pro). We take the median score to reduce individual judge bias. Scores are based on 5 weighted criteria: Correctness (35%), Security (20%), Code Quality (20%), Efficiency (15%), and Completeness (10%).
Community Vote
Developers vote on model outputs in blind head-to-head comparisons. Win rate shows the percentage of matchups where this model was chosen. Every vote requires a comment explaining the reasoning.
Response Time
The mean time each model takes to generate a response. Faster isn't always better—some models trade speed for quality—but this helps you understand performance trade-offs.
Cost per Evaluation
Average API cost for each model based on actual token usage. Pricing data sourced from OpenRouter. Helps you balance quality against budget.
Aggregation
Rankings aggregate all public evaluations. Same prompt from different users contributes to the same benchmark. This builds a community-driven dataset that reflects real-world usage, not synthetic benchmarks.
Why This Matters
Traditional benchmarks like HumanEval and SWE-Bench use synthetic tasks that don't reflect real-world usage. CodeLens rankings are built on actual code challenges from real developers, making it a more accurate benchmark for choosing which AI model to use for your specific needs.
Read our full methodologyContribute to the Leaderboard
Run evaluations on your own prompts and help build the most accurate AI model benchmark. Pro users' results contribute to these public rankings.