Back to Blog
Case Study12 min read

6 AI Models vs. 3 Advanced Security Vulnerabilities: All Passed, But Here's the Catch

A security researcher submitted three advanced vulnerability examples to our platform. All six models caught all three vulnerabilities (100% detection rate), but the quality of their fixes varied by up to 18 percentage points. Here's what we learned about which AI models you should trust for security code reviews.

A security researcher submitted three advanced vulnerability examples to our AI benchmarking platform. Not textbook examples—real exploits: prototype pollution that bypasses authorization, an agentic AI supply-chain attack combining prompt injection with cloud API abuse, and OS command injection in ImageMagick.

We ran each through 6 top AI models: GPT-5, OpenAI o3, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, and Gemini 2.5 Pro.

The result? All six models caught all three vulnerabilities. 100% detection rate.

But here's the catch: the quality of their fixes varied by up to 18 percentage points. And when the security researcher voted on which model performed best, they disagreed with our AI judge entirely.

Here's what we learned about which AI models you should trust for security code reviews.

⚠️ Early Data Disclaimer (n=3 evaluations)

This case study analyzes 3 security evaluations from one external researcher. Results are directional and not statistically significant. We're building a larger benchmark dataset and actively seeking more security professionals to submit challenges.

Why publish early data? Even with limited sample size, these findings reveal important patterns about AI model behavior on cutting-edge vulnerabilities. We believe in transparency and iterative improvement.

The Three Vulnerabilities

Vulnerability #1: Prototype Pollution Privilege Escalation

What it is: A Node.js API with a deepMerge function that recursively merges user input into a config object. No hasOwnProperty checks or __proto__ filtering. Authorization relies on req.user.isAdmin property.

The exploit:

POST /admin/config
{
  "__proto__": {
    "isAdmin": true
  }
}

Result: All objects inherit isAdmin: true, instant admin access.

Why it matters: Affects popular npm packages (lodash, hoek, minimist). Real CVEs: CVE-2019-10744, CVE-2020-28477.

Vulnerability #2: Agentic AI Supply-Chain Attack (2025 Cutting-Edge)

What it is: An LLM agent microservice with three attack vectors:

  1. Indirect prompt injection via poisoned web pages
  2. Over-privileged Azure management API token with full tenant access
  3. Unsafe WASM execution with filesystem mounts (from:'/', to:'/')

The exploit path:

  1. Attacker hosts malicious webpage with hidden instructions
  2. LLM agent fetches page, extracts instructions
  3. Agent invokes Azure API tool to escalate privileges
  4. WASM runtime executes arbitrary code with host filesystem access
  5. Cross-tenant cloud compromise

Why it matters: OWASP Top 10 for LLMs #1 risk (prompt injection). Real incidents: ChatGPT plugins, Microsoft Copilot, GitHub Copilot Chat. No existing AI benchmark tests this attack vector.

Vulnerability #3: OS Command Injection (ImageMagick)

What it is: An Express API that shells out to ImageMagick via child_process.exec(). User-controlled font, size, and text parameters injected directly into command string. No input sanitization or escaping.

The exploit:

POST /render
{
  "text": "hello",
  "font": "Arial; rm -rf /",
  "size": "12"
}

Resulting command:

convert -font "Arial; rm -rf /" -pointsize 12 label:"hello" /tmp/out.png

Why it matters: ImageTragick (CVE-2016-3714) variants still common in 2025. Classic attack that every model should catch.

The Results: 100% Detection, But Quality Varied

✅ All Models Passed (But Not Equally)

Every model caught every vulnerability, but GPT-5 scored 13.5% higher than Grok 4.

RankModelAvg ScoreCostDetectionKey Strength
1GPT-595.4/100$2.183/3 ✅Best overall, comprehensive
2OpenAI o392.7/100$0.973/3 ✅Pragmatic, user's choice
3Gemini 2.5 Pro89.2/100$0.093/3 ✅Cheapest
4Claude Sonnet 4.588.2/100$0.193/3 ✅⭐ Best value (92% quality @ 9% cost)
5Claude Opus 4.187.7/100$0.883/3 ✅Thorough but over-engineered
6Grok 484.1/100$0.143/3 ✅Slowest, simplest fixes

What "Quality" Means in Security

All models identified the vulnerabilities. The score differences came from:

  • Completeness of fix – Did they address all attack vectors?
  • Defense-in-depth – Did they suggest multiple mitigation layers?
  • Code quality – Is the fix production-ready or just a patch?
  • Explanation depth – Did they explain why the fix works?

Example: Prototype Pollution Fixes

GPT-5 (96.4/100) suggested four mitigation strategies:

  1. Use Object.create(null) for config objects
  2. Add hasOwnProperty checks in deepMerge
  3. Explicitly block __proto__, constructor, prototype keys
  4. Use Object.freeze() on authorization logic

Grok 4 (85/100) suggested one:

  1. Add key filtering in deepMerge (but incomplete – missed some edge cases)

Both "caught it" – but one fix is production-ready, the other has gaps.

📝 Code Example: GPT-5's Defense-in-Depth Approach

Here's how GPT-5 (96.4/100) fixed the prototype pollution vulnerability with a multi-layered approach:

// Helper: create null-prototype object
function obj(data) {
  return Object.assign(Object.create(null), data)
}

// Safe deepMerge with key filtering
function safeDeepMerge(target, source) {
  const dangerousKeys = ['__proto__', 'constructor', 'prototype']

  for (const key in source) {
    // Block dangerous keys
    if (dangerousKeys.includes(key)) continue

    // Only merge own properties
    if (!Object.hasOwnProperty.call(source, key)) continue

    // Recursively merge objects safely
    if (typeof source[key] === 'object' && source[key] !== null) {
      target[key] = safeDeepMerge(target[key] || {}, source[key])
    } else {
      target[key] = source[key]
    }
  }
  return target
}

// Create users with null prototypes
app.use((req, res, next) => {
  req.user = obj({ isAdmin: false, username: 'guest' })
  next()
})

// Require own property check for authorization
function isAdmin(user) {
  return Object.hasOwnProperty.call(user, 'isAdmin')
    && user.isAdmin === true
}

Why this approach scored 96.4/100:

  • Null-prototype objects – Prevents inheritance attacks
  • Key filtering – Blocks __proto__, constructor, prototype
  • Own-property checks – Validates isAdmin is directly set, not inherited
  • Helper function – Consistent null-prototype creation across app

Compare this to Grok 4's simpler approach (85/100), which only added basic key filtering but missed null-prototype objects and own-property validation—leaving edge cases unprotected.

Cost Analysis: GPT-5 Costs 49% of Budget

💰 Total Cost: $4.46 for 3 Evaluations × 6 Models

GPT-5 alone cost $2.18 (48.87%) – more than all other models combined!

ModelTotal Cost% of BudgetAvg ScoreValue Rating
GPT-5$2.1848.87%95.4Premium
OpenAI o3$0.9721.76%92.7Good
Claude Opus 4.1$0.8819.79%87.7Fair
Claude Sonnet 4.5$0.194.35%88.2⭐ Best Value
Grok 4$0.143.23%84.1Budget
Gemini 2.5 Pro$0.092.00%89.2⭐ Cheapest

Most Expensive

$1.93

Agentic AI Attack

GPT-5 generated 22,711 characters analyzing multi-layer attack

Cheapest

$0.88

Prototype Pollution

Classic vulnerability, less reasoning required

Average

$1.49

Per Evaluation

$0.25 per model execution

💡 Budget Recommendation

If cost matters: Use Claude Sonnet 4.5 or Gemini 2.5 Pro for 90%+ of GPT-5's quality at 2-9% of cost.

If quality matters: Use GPT-5 for mission-critical security audits, or OpenAI o3 as middle ground (97% of GPT-5's quality at 44% of cost).

The Plot Twist: Human Disagreed with AI Judge

🤔 What Happened

On the ImageMagick command injection vulnerability:

AI Judge's Choice

GPT-5

95.8/100

Ranked #1 by AI judge

User's Choice ✅

OpenAI o3

90.4/100

Ranked #4 by AI judge

User's comment:

"is better i think because"

Note: The comment was incomplete, but the user's choice reveals a key insight—human security experts prioritize different factors than AI judges. They likely valued o3's pragmatism (simpler, deployable fixes), clarity (easier to understand for teams), and production-readiness over GPT-5's more comprehensive but complex approach.

Why This Matters

AI Judges Optimize For:
  • Completeness (all criteria addressed?)
  • Thoroughness (how detailed?)
  • Code quality (style, structure)
Human Experts Value:
  • Pragmatism – Is this actually deployable?
  • Simplicity – Fewer moving parts
  • Clarity – Can my team maintain this?

Possible reasons the researcher chose o3 over GPT-5:

  1. Simpler fix – o3's solution may have been more straightforward
  2. Better explanation – o3 might have explained the "why" more clearly
  3. Production-ready – Less over-engineering than GPT-5
  4. Personal experience – They've used o3 before and trust its outputs

What This Teaches Us

Community voting ≠ AI judging. AI judges are objective but may miss human intuition. Security experts weigh different factors than AI rubrics.

This is why CodeLens combines both:

  • AI judge provides instant, consistent scoring
  • Human votes validate and correct AI blind spots

Real-world lesson: Don't blindly trust AI scores. Get human review on critical security decisions. Best approach: Use AI to triage, humans to validate.

Performance by Vulnerability Type

📊 Classic vs. Cutting-Edge Vulnerabilities

Pattern discovered: All models excel at classic vulnerabilities (prototype pollution, command injection). But newer attacks (agentic AI) create wider performance gaps.

Prototype Pollution (2019 Vulnerability, Well-Known)

ModelScoreDetectionKey Insight
GPT-596.44 mitigation strategies, production-ready
OpenAI o395.2Clean helpers, null-prototype containers
Claude Sonnet 4.591.0Multi-layer defense with validation
Gemini 2.5 Pro90.0Simple fix, some edge cases missed
Claude Opus 4.186.0Overengineered but comprehensive
Grok 485.0Partial mitigation, incomplete filtering

Insight: All models caught it, but GPT-5's fix was 13% better than Grok 4's.

Agentic AI Supply-Chain Attack (2025 Cutting-Edge)

ModelScoreDetectionKey Insight
GPT-594.0Defense-in-depth with scoped tokens
OpenAI o392.4Trust boundaries + policy gating
Gemini 2.5 Pro87.4Comprehensive but complex
Claude Opus 4.183.8TypeScript + complex classes
Grok 483.2Brittle token decode
Claude Sonnet 4.582.0Over-engineered, lowest score

Insight: Claude Sonnet 4.5 scored 12 points lower on the advanced attack vs. classic vulnerabilities.

🎯 Pattern: Advanced Attacks Favor Frontier Models

Classic vulnerabilities (prototype pollution, command injection): 88-96/100 (tight 8-point range)

Advanced attack (agentic AI): 82-94/100 (wider 12-point spread)

Conclusion: For well-known vulnerabilities (OWASP Top 10), any model works. For cutting-edge attacks (LLM security, supply-chain), use GPT-5 or o3. Budget models excel at classics but struggle with novelty.

Methodology & Transparency

How We Scored These Evaluations

1. AI Judge Selection (Dynamic)

For each evaluation, we select the current #1 model (by average score across all completed evaluations) to judge new submissions. This prevents self-judging and ensures the highest-performing model evaluates competitors.

For these evaluations:

  • Judge Model: GPT-5 (was #1 at time of evaluation)
  • Backup Rule: If top model is in evaluation, use #2 model
  • Fallback: GPT-5 if no prior benchmark data exists

2. Scoring Rubric (5 Criteria, 0-100)

Each model's output is scored on 5 security-focused criteria:

Correctness (20pts)

Does the fix actually prevent the vulnerability?

Completeness (20pts)

Are all attack vectors addressed?

Code Quality (20pts)

Is the code production-ready?

Security (20pts)

Does it follow security best practices?

Performance (20pts)

Does the fix introduce performance issues?

Final score = average of 5 criteria. All scores visible in raw evaluation data (linked below).

3. Bias Mitigation

  • No self-judging: AI judge cannot evaluate its own output
  • Human votes override AI scores: Community voting is the ultimate arbiter
  • Blind evaluation: AI judge doesn't know which model generated which output
  • Open data: All model outputs and scores are publicly viewable

4. Why This Approach Works

Self-validating benchmark: The current best model judges new submissions, creating a competitive "survival of the fittest" dynamic. As models improve, the judging bar rises automatically.

Human validation loop: AI scores provide instant feedback, but human security experts have the final say. This case study is a perfect example—human voted for o3, AI judge chose GPT-5.

🔓 Full Transparency: Raw Data Available

Every evaluation on CodeLens.AI is publicly accessible. View the complete data for this case study:

Each link shows: Original vulnerable code, task description, all 6 model outputs, AI judge scores (by criterion), and voting results.

Key Takeaways & Recommendations

1. Detection ≠ Quality

All models caught all vulnerabilities (100% detection rate), but quality of fixes varied by 8-18%.

Lesson: Don't just ask "Did AI catch it?" Ask "Is the fix production-ready?"

2. Cost vs. Quality Tradeoff is Real

GPT-5: Best quality (95.4) but 49% of budget. Claude Sonnet: 92% of quality at 9% of cost.

Lesson: Define your quality threshold, then optimize for cost.

3. Human Experts ≠ AI Judges

AI judge chose GPT-5 (95.8 score). Security researcher chose o3 (90.4 score, ranked #4).

Lesson: Get human validation on critical security decisions.

4. Advanced Attacks Favor Frontier Models

Classic vulnerabilities: All models 85-96/100. Cutting-edge (agentic AI): 82-94/100 (12-point spread).

Lesson: Use GPT-5/o3 for novel threats, budget models for OWASP Top 10.

5. Model Choice Depends on Use Case

Not "which model is best?" but "best for what?" Different models excel at different domains.

Lesson: Match the model to the mission.

📋 Recommendation Matrix

For Mission-Critical Production Code → GPT-5

Cost: $0.73/eval avg, 95.4 quality

Use when: Financial systems, healthcare, authentication

Why: Most comprehensive fixes, defense-in-depth

For Everyday Security Audits → Claude Sonnet 4.5

Cost: $0.06/eval avg, 88.2 quality

Use when: Regular code reviews, PR automation

Why: 92% of GPT-5's quality at 9% of cost

For Budget-Constrained Teams → Gemini 2.5 Pro

Cost: $0.03/eval avg, 89.2 quality

Use when: Startups, open source, high-volume scanning

Why: Cheapest option, surprisingly strong performance

For Pragmatic Fixes → OpenAI o3

Cost: $0.32/eval avg, 92.7 quality

Use when: You want simple, deployable solutions

Why: Security expert's choice, good balance

Try It Yourself

Want to see which AI models catch vulnerabilities in your codebase?

Submit to CodeLens:

  1. 1. Paste your vulnerable code (50-500 lines)
  2. 2. Describe the security issue you're testing
  3. 3. Get instant comparison across 6 top models
  4. 4. Vote on which model's fix you'd actually deploy

No credit card required

Conclusion

The security researcher who submitted these vulnerabilities taught us something important: detection is table stakes, but quality is what matters.

Every AI model caught every vulnerability. That's impressive—a few years ago, this would have been impossible.

But the spread in fix quality (84-95/100) shows that not all AI security reviews are created equal. GPT-5 delivered the most comprehensive solutions. Claude Sonnet 4.5 offered 92% of the quality at 9% of the cost. And OpenAI o3 provided the pragmatic fixes that a real security engineer preferred over the AI judge's top pick.

The takeaway? Match the model to the mission. Use frontier models for novel threats and mission-critical code. Use budget models for everyday OWASP Top 10 scans. And always get human validation on the fixes you actually deploy.

Because in security, good enough isn't good enough.