EVMbench Shows AI Can Exploit 70% of Critical Bugs—Should We Trust AI Auditors?

The smart contract security landscape just shifted dramatically. Last month, OpenAI and Paradigm released EVMbench—an open benchmark evaluating AI agents’ ability to detect, patch, and exploit vulnerabilities in smart contracts. The results are both impressive and concerning.

The Numbers That Matter

GPT-5.3-Codex can now exploit over 70% of critical, fund-draining bugs from Code4rena competitions. When this project started, that number was below 20%. The benchmark dataset includes 120 curated vulnerabilities across 40 real audits, covering the full spectrum from reentrancy attacks to complex business logic flaws.

But here’s where it gets interesting: while AI excels at exploitation (70%+), performance drops significantly on detection and patching tasks. Why? Because exploitation is pattern matching—find the vulnerable code path and execute it. Detection and patching require understanding context, business requirements, and subtle edge cases that AI still struggles with.

Hybrid Approach: The Real Winner

The most compelling finding isn’t about AI alone. Hybrid AI+human audits catch 95%+ of vulnerabilities, compared to 60-70% for manual-only or 70-85% for AI-only approaches. And they do it at 40-60% lower cost with much faster turnaround.

This makes sense when you understand the division of labor. AI excels at:

  • Pattern-based vulnerabilities: Reentrancy, access control bugs, integer overflows
  • Scale and speed: Analyzing 50,000+ contracts monthly
  • Comprehensive coverage: Following every code path exhaustively

Humans excel at:

  • Business logic validation: Does the code do what it’s supposed to do?
  • Economic attack vectors: Game theory, oracle manipulation, governance exploits
  • Novel patterns: Attack vectors not present in training data
  • Context: Understanding how contracts interact across complex DeFi protocols

My Experience as an Auditor :memo:

I’ve been testing AI audit tools for the past three months, and the results align with EVMbench’s findings. AI is amazing at catching the obvious stuff—reentrancy guards, access modifiers, unchecked math. But it completely misses business logic bugs.

Last week, I audited a lending protocol. The AI tool gave it a clean bill of health. My human review found a critical flaw where the liquidation logic could be manipulated to drain the protocol under specific market conditions. The code was technically correct, but the economic model was broken.

Training Data Bias: AI models are trained on historical exploits. They’re excellent at finding vulnerabilities similar to past attacks but may miss entirely novel attack vectors. The next major exploit will likely come from a pattern the AI has never seen.

Multi-Contract Complexity: AI struggles with cross-contract interactions—precisely where the most catastrophic vulnerabilities hide.

The .8 Billion Question

From 2024-2025, .8 billion was lost to smart contract exploits. Could AI have prevented these? Some, absolutely. But analyzing past exploits shows that roughly 40% were business logic failures that current AI tools wouldn’t catch.

The Path Forward

I’m not arguing against AI in security—it’s a game-changer for efficiency. But we need realistic expectations:

  1. Use AI for pre-screening: Let AI catch the low-hanging fruit (saves 30-40% of audit time)
  2. Human review for critical logic: Business rules, economic models, governance mechanisms
  3. AI verification of fixes: After patching, run AI again to ensure no regressions

My recommendation: For any protocol holding user funds, use hybrid AI+human audits. The cost savings from AI make human review more affordable, not obsolete.

Question for developers: Have you tried AI audit tools on your contracts? What did they catch vs miss? Let’s build collective knowledge here. :magnifying_glass_tilted_left:

This is an excellent summary, Sarah. Your lending protocol example perfectly illustrates the gap between code correctness and economic security. :locked:

The Academic Perspective

I’ve been researching this intersection of AI and security from a formal verification standpoint. What EVMbench reveals is fundamentally a limitation in how we train AI models. They’re pattern matchers, not reasoning systems.

Consider reentrancy: it has a clear structural pattern (external call → state change). An AI trained on 1,000 reentrancy examples will catch the 1,001st. But a novel economic exploit—like the liquidation manipulation you found—has no clear pattern to match against. It requires understanding:

  • Protocol economics and incentive structures
  • Market dynamics and edge cases
  • Game-theoretic attack strategies
  • Cross-protocol composability risks

Current AI systems struggle with multi-contract interactions precisely because the attack surface is combinatorial. Three independently secure contracts might create a vulnerability when combined. AI tools analyze contracts in isolation and miss these emergent risks.

The Adversarial Angle :warning:

Here’s what keeps me up at night: adversarial attacks on AI auditors. If everyone starts using the same AI audit tools, attackers will learn to write vulnerable code that passes AI scrutiny but contains subtle exploits.

Think about it: EVMbench is public. Code4rena exploits are public. AI training data is based on known patterns. A sophisticated attacker can:

  1. Study what patterns AI models flag as suspicious
  2. Obfuscate vulnerabilities to avoid those patterns
  3. Write exploitable code that looks “clean” to pattern-matching AI
  4. Wait for projects to deploy with AI-only audits

This isn’t theoretical—we see similar attacks in adversarial ML research. The solution is hybrid approaches where human auditors provide unpredictable, creative analysis that attackers can’t game.

The Training Data Dilemma

Purpose-built AI security agents detect 92% of known vulnerability types, which sounds great until you realize the 8% they miss includes the novel attacks that cause the biggest losses.

We need continuous adversarial training where:

  • Human auditors feed AI new vulnerability patterns
  • AI learns from production exploits in real-time
  • Benchmark datasets expand beyond historical bugs

But this is always reactive. The most dangerous exploits are the ones we haven’t imagined yet.

Hybrid Workflow Framework

Building on your workflow recommendation, here’s what I’m advocating for protocols holding >M TVL:

Phase 1 - AI Pre-Screen (24-48 hours, ~K)

  • Run multiple AI tools (not just one)
  • Compare results across tools
  • Generate coverage map of analyzed code paths

Phase 2 - Human Scoping (1 week, ~K)

  • Security expert reviews AI findings
  • Identifies high-risk areas AI might have missed
  • Scopes deep-dive audit focus areas

Phase 3 - Deep Human Audit (2-4 weeks, ~K)

  • Manual review of business logic
  • Economic attack modeling
  • Cross-protocol interaction analysis
  • Novel attack vector brainstorming

Phase 4 - AI Fix Verification (24 hours, ~)

  • Verify patches don’t introduce regressions
  • Ensure fixes actually address root causes
  • Continuous monitoring post-deployment

Total cost: ~K vs -100K for pure human audit
Total time: 4-6 weeks vs 6-8 weeks
Coverage: 95%+ vs 70% (human-only) or 85% (AI-only)

The economics make sense, but we need industry standards to prevent projects from cutting corners.

Question: Should we push for insurance requirements that mandate hybrid audits for DeFi protocols? Make it a risk-based regulatory framework rather than self-regulation?

As someone building a DeFi protocol, this conversation is both exciting and terrifying. :sweat_smile:

The Economics Are Compelling… Maybe Too Compelling

Sarah’s workflow breakdown shows hybrid audits at ~$51K vs $80-100K for traditional human audits. That 40-50% cost reduction is huge for protocols like mine operating on tight budgets.

But here’s my concern: the race to the bottom.

I’ve already seen projects in our ecosystem advertising “AI-audited” as if it’s equivalent to a Trail of Bits audit. They’re using free or cheap AI tools ($1K), calling it good enough, and launching with millions in TVL. When something goes wrong, they’ll say “but we had an audit!”

The problem isn’t the technology—it’s the incentives. If Protocol A spends $50K on a hybrid audit and Protocol B spends $1K on an AI-only audit, and both market themselves as “audited,” which one do investors choose? Most can’t tell the difference.

The False Confidence Problem

Sophia mentioned this, but I want to emphasize it from a founder perspective. When you run an AI audit and it comes back clean, there’s a psychological tendency to think “great, we’re secure.”

I’ve experienced this myself. We ran our contracts through an AI auditor, got a green checkmark, and our team breathed a sigh of relief. Then our human auditor found three critical issues the AI missed:

  1. Oracle manipulation in our price feed aggregation
  2. Flash loan attack vector in our liquidation mechanism
  3. Governance takeover possibility if someone accumulated enough voting power

None of these are “code bugs” that AI excels at finding. They’re economic attack vectors that require understanding our protocol’s game theory.

The scary part? If we had just trusted the AI audit and launched, we’d probably be the next “DeFi hack of the month” headline.

The Market Segmentation Question

Maybe the answer isn’t “hybrid audits for everyone” but market segmentation:

Tier 1 - Low-Value / Experimental (<$1M TVL)

  • AI-only audit acceptable
  • Clear disclosure: “AI-audited, use at your own risk”
  • Serves as testing ground for new protocols

Tier 2 - Medium-Value ($1M-$10M TVL)

  • Hybrid AI+human audit required
  • Insurance coverage available
  • Monthly AI monitoring post-launch

Tier 3 - High-Value (>$10M TVL)

  • Full human audit + AI pre-screening
  • Multiple audit firms
  • Formal verification for critical components
  • Mandatory insurance and safety modules

This way, builders have affordable options that match their risk level, and users can make informed decisions.

The Insurance Angle

Sophia asked about insurance requirements—I think this is the key. If DeFi insurance providers (Nexus Mutual, InsurAce) start requiring hybrid audits for coverage, the market will follow. Right now, many projects skip insurance entirely because audits are expensive AND insurance is expensive.

But if AI cuts audit costs in half, suddenly insured protocols become financially viable. That’s a win for everyone.

Question for the community: Would you deposit in a protocol that disclosed “AI-only audit” if it clearly showed lower TVL cap and higher reward APY to compensate for risk? Or does AI audit = automatic red flag?

This thread is fascinating and honestly a bit overwhelming for someone like me who’s still learning security! :sweat_smile:

AI Tools Democratized Security for My Side Project

I’ve been building a small lending protocol as a learning project (definitely <$100K TVL if it ever launches). I could never afford a $50K+ audit, so AI tools have been a game-changer.

I used a free AI auditor and it found six issues I would never have caught on my own:

  • Missing access control on an admin function
  • Potential reentrancy in my withdrawal logic
  • Integer overflow risk (I wasn’t using SafeMath consistently)
  • Unchecked return values from external calls

These are probably “easy” bugs for experienced auditors, but for me, they were invisible. The AI tool basically taught me what to look for.

But I’m Also Scared I’m Overconfident

Reading Sophia and Diana’s comments, I realize the AI gave me false confidence. It found code bugs but probably didn’t check if my liquidation math actually makes economic sense or if someone could game my oracle setup.

The AI audit report was really technical and I didn’t understand half of it. It said things like “potential frontrunning vector in function X” but didn’t explain what that means in practice or how bad it would be.

Sarah, you mentioned this issue—AI audit reports need better UX for developers like me. I’d honestly pay for a service that:

  1. Runs AI audit
  2. Translates findings into plain English
  3. Explains severity and real-world impact
  4. Suggests specific fixes with code examples

Right now, I’m worried I have vulnerabilities the AI found but I don’t understand, AND vulnerabilities the AI missed that I definitely don’t understand.

The Learning Perspective

One positive thing: going through the AI audit report forced me to learn about security patterns. I had to research what reentrancy means, why access control matters, how to prevent integer overflows.

So even if the AI isn’t perfect, it’s educational for new developers. It’s like having a security mentor who points out your mistakes (even if they don’t catch everything).

My Ask for This Community

For those of you who are security experts—how do we educate developers to use AI tools responsibly?

I see tons of tutorials on “how to write a smart contract” but very few on “how to interpret an AI security audit” or “what to do when AI flags a potential vulnerability.”

If there were community resources like:

  • Video walkthroughs of common AI audit findings
  • Checklist of things AI can’t catch (so you know what to manually review)
  • Guidelines on when you actually need a human auditor

That would help people like me use AI as a learning tool without getting overconfident.

Diana’s tier system makes a lot of sense—I’m definitely in the “Tier 1: Experimental” category and should be transparent about only having AI audit. But I want to learn enough to eventually build something Tier 2 worthy with proper hybrid audit.

Thanks for this discussion—it’s making me rethink my entire security approach! :folded_hands:

As someone who analyzes on-chain data all day, I wanted to bring some numbers into this discussion. Let me break down what the exploit data actually tells us about AI’s potential impact.

Following the Money: What Actually Causes Exploits?

I analyzed the $905M in smart contract losses from 2025 (from the OWASP Smart Contract Top 10 dataset) and categorized them by vulnerability type:

Technical Code Bugs (30% of losses)

  • Reentrancy: $89M (10%)
  • Access control: $71M (8%)
  • Integer overflow: $54M (6%)
  • Other code bugs: $54M (6%)

Business Logic Flaws (40% of losses)

  • Oracle manipulation: $163M (18%)
  • Economic exploits: $127M (14%)
  • Flash loan attacks: $72M (8%)

Governance & Upgradeability (20% of losses)

  • Proxy/upgrade exploits: $108M (12%)
  • Governance attacks: $72M (8%)

Cross-Protocol Interactions (10% of losses)

  • Bridge hacks: $54M (6%)
  • Composability bugs: $36M (4%)

What This Means for AI Auditors

EVMbench tests AI on the 30% category—technical code bugs. And AI is crushing it at 70%+ detection rate. That’s amazing!

But here’s the problem: 70% of $272M is only $190M in prevented losses. The other $633M in losses comes from categories where AI currently performs poorly.

The biggest single loss category—oracle manipulation at $163M—requires understanding:

  • How price feeds work across multiple protocols
  • Market manipulation strategies
  • Time-weighted average price (TWAP) vulnerabilities
  • Economic incentives for attackers

AI tools analyze contract code, but oracle manipulation is about economic game theory, not code correctness.

The Pareto Principle Problem

AI is solving the wrong 30% of the problem really well.

Don’t get me wrong—preventing $190M in losses is huge! And as Emma mentioned, AI is great for catching basic bugs that new developers miss.

But if we’re being honest about AI’s limitations, we need to acknowledge that business logic and economic security are where the big money is lost, and AI isn’t solving that (yet).

Data-Driven Recommendations

Based on this analysis, here’s what I think makes sense:

1. Use AI for Technical Pre-Screening
AI catching 70% of technical bugs for $1K is incredible ROI. Every protocol should run AI audits to catch low-hanging fruit.

2. Allocate Human Audit Time to High-Value Areas
Since AI handles technical bugs, human auditors can focus on:

  • Economic attack modeling (40% of losses)
  • Governance security (20% of losses)
  • Cross-protocol risks (10% of losses)

3. Build Better Benchmarks
EVMbench is great but limited. We need benchmarks for:

  • Oracle manipulation scenarios
  • Flash loan attack compositions
  • Governance exploit patterns
  • Multi-protocol interaction bugs

4. Hybrid Workflow Is Optimal
Diana’s tiered system makes economic sense, but even “low-value” protocols should run AI pre-screening because it’s so cheap.

Open Data Initiative Proposal

I want to build an open-source database of AI audit accuracy that tracks:

  • Which AI tools catch which vulnerability types
  • False positive/negative rates by vulnerability category
  • Cost-effectiveness metrics (cost per bug found)
  • Comparison across different AI auditors

This would help developers choose tools and understand limitations. Would anyone be interested in contributing audit data (anonymized) or helping build this?

Also happy to create dashboards showing which vulnerability categories are being addressed vs neglected by current AI tools.

The key insight from the data: AI is a force multiplier for human auditors, not a replacement. Use it to handle the 30% it’s good at, so humans can focus on the 70% where they add unique value.