AI Model Leaderboard
Community-driven rankings based on real developer evaluations and votes
Help Build the Benchmark
Vote on completed evaluations to make the leaderboard more accurate. Your votes matter!
How Rankings Work
⭐ Average Score (Primary Metric)
The mean of all AI judge scores across evaluations, scored out of 100 on quality, correctness, security, and performance. This metric is more reliable than win rate during beta, as it provides granular assessment even with low vote counts. Example: A model averaging 87.5/100 across 50 evaluations ranks higher than one averaging 85.0/100.
🗳️ Win Rate (Supplementary)
The percentage of evaluations where developers selected this model as the winner. Shown below each model name alongside evaluation count. As we accumulate more votes, win rate will become increasingly important.
⚡ Average Response Time
The mean time (in seconds) each model takes to generate a response. Faster isn't always better - some models trade speed for quality - but this helps you understand performance trade-offs. Shown alongside win rate below each model name.
🎯 Task-Specific Rankings
Filter by task type (refactoring, security, architecture, code review) to see which model excels in specific areas. A model might dominate security audits but rank lower in refactoring tasks.
💬 Required Comments
Every vote requires a comment explaining why you picked that model. This qualitative data helps us understand not just which models win, but why they win.
🔄 Dynamic Updates
Rankings update in real-time as new votes come in. The leaderboard reflects the collective wisdom of the developer community, not vendor marketing claims or synthetic benchmarks.
Why This Matters
Traditional benchmarks like HumanEval and SWE-Bench use synthetic tasks that don't reflect real-world usage. CodeLens.AI is built on actual code challenges from real developers, making it the most accurate benchmark for choosing which AI model to use for YOUR specific needs.
Contribute to the benchmark →