Formal Verification of Smart Contracts and AI-Assisted Auditing

July 28, 2025 · 39 min read

Software Engineer

Formal Verification in Smart Contract Auditing

Formal verification refers to the use of mathematical and logic-based techniques to prove the correctness and security of smart contracts. In practice, this encompasses a spectrum of methodologies: from property-based fuzz testing and symbolic execution, to rigorous theorem proving and model checking. The goal is to ensure a contract meets its specifications and contains no exploitable bugs across all possible inputs and states. Given the high stakes (billions of dollars are locked in DeFi protocols), formal methods have become increasingly important for Ethereum and other blockchain platforms.

Traditional Approaches: Early formal methods for Ethereum included symbolic execution tools like Oyente and Mythril, and static analyzers like Slither and Securify. Symbolic execution explores program paths with symbolic inputs to detect issues (e.g. reentrancy, integer overflow), while static analysis uses rule-based pattern matching. These tools have had success but also limitations: for example, Oyente suffered many false alarms even on simple contracts, and Slither’s pattern-based detectors can produce several false positives. Moreover, a 2023 study found that over 80% of exploitable contract bugs (especially complex “business logic” bugs) were missed by current tools, underscoring the need for more robust verification techniques.

The Promise and Challenge of Full Verification: In theory, formal verification can prove the absence of bugs by exhaustively checking invariants for all states. Tools like the Certora Prover or the Ethereum Foundation’s KEVM framework aim to mathematically verify smart contracts against a formal specification. For example, Certora’s “automated mathematical auditor” uses a specification language (CVL) to prove or refute user-defined rules. In practice, however, fully proving properties on real-world contracts is often unattainable or very labor-intensive. Code may need to be rewritten into simplified forms for verification, custom specs must be written, loops and complex arithmetic might require manual bounds or abstractions, and SMT solvers frequently time out on complex logic. As Trail of Bits engineers noted, “proving the absence of bugs is typically unattainable” on non-trivial codebases, and achieving it often requires heavy user intervention and expertise. Because of this, formal verification tools have traditionally been used sparingly for critical pieces of code (e.g. verifying a token’s invariant or a consensus algorithm), rather than entire contracts end-to-end.

Foundry’s Fuzz Testing and Invariant Testing

In recent years, property-based testing has emerged as a practical alternative to full formal proofs. Foundry, a popular Ethereum development framework, has built-in support for fuzz testing and invariant testing – techniques that greatly enhance test coverage and can be seen as lightweight formal verification. Foundry’s fuzz testing automatically generates large numbers of inputs to try to violate specified properties, and invariant testing extends this to sequences of state-changing operations:

Fuzz Testing: Instead of writing unit tests for specific inputs, the developer specifies properties or invariants that should hold for any input. Foundry then generates hundreds or thousands of random inputs to test the function and checks that the property always holds. This helps catch edge cases that a developer might not manually think to test. For example, a fuzz test might assert that a function’s return value is always non-negative or that a certain post-condition is true regardless of inputs. Foundry’s engine uses intelligent heuristics – it analyzes function signatures and introduces edge-case values (0, max uint, etc.) – to hit corner cases likely to break the property. If an assertion fails, Foundry reports a counterexample input that violates the property.
Invariant Testing: Foundry’s invariant testing (also called stateful fuzzing) goes further by exercising multiple function calls and state transitions in sequence. The developer writes invariant functions that should hold true throughout the contract’s lifetime (e.g. total assets = sum of user balances). Foundry then randomly generates sequences of calls (with random inputs) to simulate many possible usage scenarios, periodically checking that the invariant conditions remain true. This can uncover complex bugs that manifest only after a particular sequence of operations. Essentially, invariant testing explores the contract’s state space more thoroughly, ensuring that no sequence of valid transactions can violate the stated properties.

Why Foundry Matters: Foundry has made these advanced testing techniques accessible. The fuzzing and invariant features are native to the developer workflow – no special harness or external tool is needed, and tests are written in Solidity alongside unit tests. Thanks to a Rust-based engine, Foundry can execute thousands of tests quickly (parallelizing them) and provide detailed failure traces for any invariant violation. Developers report that Foundry’s fuzzer is easy to use and highly performant, requiring only minor configuration (e.g. setting the number of iterations or adding assumptions to constrain inputs). A simple example from Foundry’s documentation is a fuzz test for a divide(a,b) function, which uses vm.assume(b != 0) to avoid trivial invalid inputs and then asserts mathematical post-conditions like result * b <= a. By running such a test with thousands of random (a,b) pairs, Foundry might quickly discover edge cases (like overflow boundaries) that would be hard to find by manual reasoning.

Comparisons: Foundry’s approach builds on prior work in the community. Trail of Bits’ Echidna fuzzer was an earlier property-based testing tool for Ethereum. Echidna similarly generates random transactions to find violations of invariants expressed as Solidity functions returning a boolean. It’s known for “intelligent” input generation (incorporating coverage-guided fuzzing) and has been used to find many bugs. In fact, security researchers note that Echidna’s engine is extremely effective – “the Trail of Bits Echidna is the best fuzzer out there due to its intelligent random number selection” – though Foundry’s integrated workflow makes writing tests simpler for developers. In practice, Foundry’s fuzz testing is often regarded as the new “bare minimum” for secure Solidity development, complementing traditional unit tests. It cannot prove the absence of bugs (since it’s randomized and not exhaustive), but it greatly increases confidence by covering a vast range of inputs and state combinations.

Beyond Fuzzing: Formal Proofs and Advanced Tools

While fuzzing and invariants catch many issues, there are cases where stronger formal methods are used. Model checking and theorem proving involve specifying desired properties in a formal logic and using automated provers to check them against the contract code. Certora Prover (recently open-sourced) is a prominent tool in this category. It allows developers to write rules in a domain-specific language (CVL) and then automatically checks the contract for violations of those rules. Certora has been used to verify critical invariants in protocols like MakerDAO and Compound; for instance, it identified the “DAI debt invariant” bug in MakerDAO (a subtle accounting inconsistency) that went unnoticed for four years. Notably, Certora’s engine now supports multiple platforms (EVM, Solana’s VM, and eWASM), and by open-sourcing it in 2025, it made industrial-grade formal verification freely available on Ethereum, Solana, and Stellar. This move recognizes that formal proofs should not be a luxury for only well-funded teams.

Other formal tools include runtime verification approaches (e.g. the KEVM semantics of Ethereum, or the Move Prover for Move-based chains). The Move Prover in particular is built into the Move language (used by Aptos and Sui blockchains). It allows writing formal specifications alongside the code and can automatically prove certain properties with a user experience similar to a linter or type-checker. This tight integration lowers the barrier for developers on those platforms to use formal verification as part of development.

Summary: Today’s smart contract auditing blends these methodologies. Fuzzing and invariant testing (exemplified by Foundry and Echidna) are widely adopted for their ease of use and effectiveness in catching common bugs. Symbolic execution and static analyzers still serve to quickly scan for known vulnerability patterns (with tools like Slither often integrated into CI pipelines). Meanwhile, formal verification tools are maturing and expanding across chains, but they are typically reserved for specific critical properties or used by specialized auditing firms due to their complexity. In practice, many audits combine these approaches: e.g. using fuzzers to find runtime errors, static tools to flag obvious mistakes, and formal spec checks for key invariants like “no token balance exceeds total supply”.

AI-Assisted Auditing with Large Language Models

The advent of large language models (LLMs) like OpenAI’s GPT-3/4 (ChatGPT) and Codex has introduced a new paradigm for smart contract analysis. These models, trained on vast amounts of code and natural language, can reason about program behavior, explain code, and even detect certain vulnerabilities by pattern recognition and “common sense” knowledge. The question is: can AI augment or even automate smart contract auditing?

LLM-Based Vulnerability Analysis Tools

Several research efforts and prototype tools have emerged that apply LLMs to smart contract auditing:

OpenAI Codex / ChatGPT (general models): Early experiments simply involved prompting GPT-3 or Codex with contract code and asking for potential bugs. Developers found that ChatGPT could identify some well-known vulnerability patterns and even suggest fixes. However, responses were hit-or-miss and not reliably comprehensive. A recent academic evaluation noted that naive prompting of ChatGPT for vulnerability detection “did not yield significantly better outcomes compared to random prediction” on a benchmark dataset – essentially, if the model is not guided properly, it may hallucinate issues that aren’t there or miss real vulnerabilities. This highlighted the need for carefully engineered prompts or fine-tuning to get useful results.
AuditGPT (2024): This is an academic tool that leveraged ChatGPT (GPT-3.5/4) specifically to check ERC standard compliance in Ethereum contracts. The researchers observed that many ERC20/ERC721 token contracts violate subtle rules of the standard (which can lead to security or compatibility issues). AuditGPT works by breaking down the audit into small tasks and designing specialized prompts for each rule type. The result was impressive: in tests on real contracts, AuditGPT “successfully pinpoints 418 ERC rule violations and only reports 18 false positives”, demonstrating high accuracy. In fact, the paper claims AuditGPT outperformed a professional auditing service in finding ERC compliance bugs, at lower cost. This suggests that for well-defined, narrow domains (like enforcing a list of standard rules), an LLM with good prompts can be remarkably effective and precise.
LLM-SmartAudit (2024): Another research project, LLM-SmartAudit, takes a broader approach by using a multi-agent conversation system to audit Solidity code. It sets up multiple specialized GPT-3.5/GPT-4 agents (e.g. one “Auditor” focusing on correctness, one “Attacker” thinking of how to exploit) that talk to each other to analyze a contract. In their evaluation, LLM-SmartAudit was able to detect a wide range of vulnerabilities. Notably, in a comparative test against conventional tools, the GPT-3.5 based system achieved the highest overall recall (74%), outperforming all the traditional static and symbolic analyzers they tested (the next best was Mythril at 54% recall). It also was able to find all of the 10 types of vulnerabilities in their test suite (whereas each traditional tool struggled with some categories). Moreover, by switching to GPT-4 and focusing the analysis (what they call Targeted Analysis mode), the system could detect complex logical flaws that tools like Slither and Mythril completely missed. For instance, on a set of real-world DeFi contracts, the LLM-based approach found hundreds of logic bugs whereas the static tools “demonstrated no effectiveness in detecting” those issues. These results showcase the potential of LLMs to catch subtle bugs that are beyond the pattern-matching scope of traditional analyzers.
Prometheus (PromFuzz) (2023): A hybrid approach is to use LLMs to guide other techniques. One example is PromFuzz, which uses a GPT-based “auditor agent” to identify suspect areas in the code, then automatically generates invariant checkers and feeds them into a fuzzing engine. Essentially, the LLM does a first-pass analysis (both from a benign and attacker perspective) to focus the fuzz testing on likely vulnerabilities. In evaluations, this approach achieved very high bug-finding rates – e.g. over 86% recall with zero false positives in certain benchmark sets – significantly outperforming standalone fuzzers or previous tools. This is a promising direction: using AI to orchestrate and enhance classical techniques like fuzzing, combining the strengths of both.
Other AI Tools: The community has seen various other AI-assisted auditing concepts. For example, Trail of Bits’ “Toucan” project integrated OpenAI Codex into their audit workflow (more on its findings below). Some security startups are also advertising AI auditors (e.g. “ChainGPT Audit” services), typically wrapping GPT-4 with custom prompts to review contracts. Open-source enthusiasts have created ChatGPT-based audit bots on forums that will take a Solidity snippet and output potential issues. While many of these are experimental, the general trend is clear: AI is being explored at every level of the security review process, from automated code explanation and documentation generation to vulnerability detection and even suggesting fixes.

Capabilities and Limitations of LLM Auditors

LLMs have demonstrated notable capabilities in smart contract auditing:

Broad Knowledge: An LLM like GPT-4 has been trained on countless codes and vulnerabilities. It “knows” about common security pitfalls (reentrancy, integer overflow, etc.) and even some historical exploits. This allows it to recognize patterns that might indicate a bug, and to recall best practices from documentation. For example, it might remember that ERC-20 transferFrom should check allowances (and flag the absence of such a check as a violation). Unlike a static tool that only flags known patterns, an LLM can apply reasoning to novel code and infer problems that weren’t explicitly coded into it by a tool developer.
Natural Explanations: AI auditors can provide human-readable explanations of potential issues. This is extremely valuable for developer experience. Traditional tools often output cryptic warnings that require expertise to interpret, whereas a GPT-based tool can generate a paragraph explaining the bug in plain English and even suggest a remediation. AuditGPT, for instance, not only flagged ERC rule violations but described why the code violated the rule and what the implications were. This helps in onboarding less-experienced developers to security concepts.
Flexibility: With prompt engineering, LLMs can be directed to focus on different aspects or follow custom security policies. They are not limited by a fixed set of rules – if you can describe a property or pattern in words, the LLM can attempt to check it. This makes them attractive for auditing new protocols that might have unique invariants or logic (where writing a custom static analysis from scratch would be infeasible).

However, significant challenges and reliability issues have been observed:

Reasoning Limitations: Current LLMs sometimes struggle with the complex logical reasoning required for security analysis. Trail of Bits reported that “the models are not able to reason well about certain higher-level concepts, such as ownership of contracts, re-entrancy, and inter-function relationships”. For example, GPT-3.5/4 might understand what reentrancy is in theory (and even explain it), but it may fail to recognize a reentrancy vulnerability if it requires understanding a cross-function sequence of calls and state changes. The model might also miss vulnerabilities that involve multi-contract interactions or time-dependent logic, because those go beyond the scope of a single code snippet analysis.
False Positives and Hallucinations: A major concern is that LLMs can produce confident-sounding yet incorrect conclusions. In auditing, a “hallucination” might be flagging a vulnerability that isn’t actually there, or misinterpreting the code. Trail of Bits’ experiment with Codex (GPT) found that as they scaled to larger real-world contracts, “the false positive and hallucination rates [became] too high,” to the point that it would slow down auditors with too many spurious reports. This unpredictability is problematic – a tool that cries wolf too often will not be trusted by developers. AuditGPT’s success in keeping false positives low (only 18 out of hundreds of findings) is encouraging, but that was in a constrained domain. In general-purpose use, careful prompt design and maybe human review loops are needed to filter AI findings.
Context Limitations: LLMs have a context window limitation, meaning they can only “see” a certain amount of code at once. Complex contracts often span multiple files and thousands of lines. If an AI can’t ingest the whole codebase, it might miss important connections. Techniques like code slicing (feeding the contract in chunks) are used, but they risk losing the broader picture. The LLM-SmartAudit team noted that with GPT-3.5’s 4k token limit, they could not analyze some large real-world contracts until they switched to GPT-4 with a larger context. Even then, dividing analysis into parts can cause it to overlook bugs that manifest only when considering the system as a whole. This makes AI analysis of entire protocols (with multiple interacting contracts) a real challenge as of now.
Integration and Tooling: There is a lack of robust developer tooling around AI auditors. Running an LLM analysis is not as straightforward as running a linter. It involves API calls to a model, handling prompt engineering, rate limits, and parsing the AI’s responses. “The software ecosystem around integrating LLMs with traditional software is too crude and everything is cumbersome”, as one auditing team put it. There are virtually no off-the-shelf frameworks for continuously deploying an AI auditor in CI pipelines while managing its uncertainties. This is slowly improving (some projects are exploring CI bots that use GPT-4 for code review), but it’s early. Moreover, debugging why an AI gave a certain result is difficult – unlike deterministic tools, you can’t easily trace the logic that led the model to flag or miss something.
Cost and Performance: Using large models like GPT-4 is expensive and can be slow. If you want to incorporate an AI audit into a CI/CD pipeline, it might add several minutes per contract and incur significant API costs for large code. Fine-tuned models or open-source LLMs could mitigate cost, but they tend to be less capable than GPT-4. There’s ongoing research into smaller, specialized models for code security, but at present the top results have come from the big proprietary models.

In summary, LLM-assisted auditing is promising but not a silver bullet. We are seeing hybrid approaches where AI helps generate tests or analyze specific aspects, rather than doing the entire audit end-to-end. For instance, an AI might suggest potential invariants or risky areas, which a human or another tool then investigates. As one security engineer remarked, the technology is not yet ready to replace human auditors for critical tasks, given the reasoning gaps and integration hurdles. However, it can already be a useful assistant – “something imperfect may be much better than nothing” in cases where traditional tools fall short.

Accuracy and Reliability of Different Toolchains

It is instructive to compare the accuracy, coverage, and reliability of the various auditing approaches discussed. Below is a summary of findings from research and industry evaluations:

Static Analysis Tools: Static analyzers like Slither are valued for quick feedback and ease of use. They typically have high precision but moderate recall – meaning that most of the issues they report are true problems (few false positives by design), but they only detect certain classes of vulnerabilities. A 2024 benchmarking study (LLM-SmartAudit’s RQ1) found Slither caught about 46% of the known vulnerabilities in a test suite. Mythril (symbolic execution) did slightly better at 54% recall. Each tool had strengths in particular bug types (e.g. Slither is very good at spotting arithmetic issues or usage of tx.origin, while symbolic tools excel at finding simple reentrancy scenarios), but none had comprehensive coverage. False positives for mature tools like Slither are relatively low – its developers advertise “minimal false alarms and speedy execution (under 1s per contract)”, making it suitable for CI use. Nonetheless, static tools can misfire if code uses complex patterns; they might flag edge cases that are actually handled or miss deep logic bugs that don’t match any known anti-pattern.
Fuzzing and Property Testing: Fuzzers like Foundry’s fuzz/invariant tests or Trail of Bits’ Echidna have proven very effective at finding runtime errors and invariant violations. These tools tend to have near zero false positives in the sense that if a bug is reported (an assertion failed), it’s a real counterexample execution. The trade-off is in false negatives – if a bug doesn’t manifest within the tested input space or number of runs, it can slip by. Coverage depends on how well the fuzzer explores the state space. With enough time and good heuristics, fuzzers have found many high-severity bugs that static analysis missed. For example, Echidna was able to quickly reproduce the MakerDAO and Compound bugs that took formal verification efforts to find. However, fuzzing is not guaranteed to find every logic mistake. As contracts get more complex, fuzzers might require writing additional invariants or using smarter search strategies to hit deeper states. In terms of metrics, fuzzing doesn’t have a single “recall” number, but anecdotal evidence shows that important invariants can usually be broken by a fuzzer if they are breakable. The reliability is high for what it finds (no manual triage needed for false reports), but one must remember a passed fuzz test is not a proof of correctness – just an increase in confidence.
Formal Verification Tools: When applicable, formal verification provides the highest assurance – a successful proof means 100% of states satisfy the property. In terms of accuracy, it’s effectively perfect (sound and complete) for the properties it can prove. The biggest issue here is not the accuracy of results but the difficulty of use and narrow scope. Formal tools can also have false negatives in practice: they might simply be unable to prove a true property due to complexity (yielding no result or a timeout, which isn’t a false positive, but it means we fail to verify something that is actually safe). They can also have specification errors, where the tool “proves” something that wasn’t the property you intended (user error). In real audits, formal methods have caught some critical bugs (Certora’s successes include catching a subtle SushiSwap bug and a PRBMath library issue before deployment). But their track record is limited by how rarely they are comprehensively applied – as Trail of Bits noted, it was “difficult to find public issues discovered solely through formal verification, in contrast with the many bugs found by fuzzing”. So, while formal verification is extremely reliable when it succeeds, its impact on overall toolchain coverage is constrained by the effort and expertise required.
LLM-Based Analysis: The accuracy of AI auditors is currently a moving target, as new techniques (and newer models) are rapidly pushing the envelope. We can glean a few data points:
- The AuditGPT system, focused on ERC rules, achieved very high precision (≈96% by false positive count) and found hundreds of issues that static tools or humans overlooked. But this was in a narrow domain with structured prompts. We should not generalize that ChatGPT will be 96% accurate on arbitrary vulnerability hunting – outside of a controlled setup, its performance is lower.
- In broader vulnerability detection, LLM-SmartAudit (GPT-3.5) showed ~74% recall on a benchmark with moderate false positive rate, which is better than any single traditional tool. When upgraded to GPT-4 with specialized prompting (TA mode), it significantly improved – for example, on a set of 1,400 real-world vulnerabilities, the GPT-4 agent found ~48% of the specific issues and ~47% of the complex logic issues, whereas Slither/Mythril/Conkas each found ~0% (none) of those particular complex issues. This demonstrates that LLMs can dramatically expand coverage to types of bugs that static analysis completely misses. On the flip side, the LLM did not find over half of the issues (so it also has substantial false negatives), and it’s not clear how many false positives were among the ones it reported – the study focused on recall more than precision.
- Trail of Bits’ Codex/GPT-4 “Toucan” experiment is illustrative of reliability problems. Initially, on small examples, Codex could identify known vulnerabilities (ownership issues, reentrancy) with careful prompting. But as soon as they tried scaling up, they encountered inconsistent and incorrect results. They reported “the number of failures was high even in average-sized code” and difficult to predict. Ultimately, they concluded that GPT-4 (as of early 2023) was only an incremental improvement and still “missing key features” like reasoning about cross-function flows. The outcome was that the AI did not materially reduce false positives from their static tools, nor did it reliably speed up their audit workflow. In other words, the current reliability of a general LLM as an auditor was deemed insufficient by professional auditors in that trial.

To sum up, each toolchain has different strengths:

Static tools: Reliable for quick detection of known issues; low noise, but limited bug types (medium recall ~40–50% on benchmarks).
Fuzz/invariant testing: Very high precision (almost no false alerts) and strong at finding functional and state-dependent bugs; coverage can be broad but not guaranteed (no simple metric, depends on time and invariant quality). Often finds the same bugs formal proofs would if given enough guidance.
Formal verification: Gold standard for absolute correctness on specific properties; essentially 100% recall/precision for those properties if successfully applied. But practically limited to narrow problems and requires significant effort (not a one-button solution for full audits yet).
AI (LLM) analysis: Potentially high coverage – can find bugs across categories including those missed by other tools – but variable precision. With specialized setups, it can be both precise and broad (as AuditGPT showed for ERC checks). Without careful constraints, it may cast a wide net and require human vetting of results. The “average” accuracy of ChatGPT on vulnerability detection is not spectacular (close to guessing, in one study), but the best-case engineered systems using LLMs are pushing performance beyond traditional tools. There’s active research to make AI outputs more trustworthy (e.g. by having multiple agents cross-verify, or combining LLM with static reasoning to double-check AI conclusions).

It’s worth noting that combining approaches yields the best results. For example, one might run Slither (to catch low-hanging fruit with no noise), then use Foundry/Echidna to fuzz deeper behaviors, and perhaps use an LLM-based tool to scan for any patterns or invariants not considered. Each will cover different blind spots of the others.

Real-World Adoption Challenges and Limitations

In practice, adopting formal verification or AI tools in a development workflow comes with pragmatic challenges. Some key issues include:

Developer Experience and Expertise: Traditional formal verification has a steep learning curve. Writing formal specs (in CVL, Coq, Move’s specification language, etc.) is more like writing mathematical proofs than writing code. Many developers lack this background, and formal methods experts are in short supply. By contrast, fuzzing with Foundry or writing invariants in Solidity is much more accessible – it feels like writing additional tests. This is a big reason fuzz testing has seen faster uptake than formal proofs in the Ethereum community. The Trail of Bits team explicitly noted that fuzzing “produces similar results but requires significantly less skill and time” than formal methods in many cases. Thus, even though formal verification can catch different bugs, many teams opt for the easier approach that gets good enough results with lower effort.
Integration into Development Workflow: For a tool to be widely adopted, it needs to fit into CI/CD pipelines and everyday coding. Tools like Slither shine here – it “easily integrates into CI/CD setups, streamlining automation and aiding developers.” A developer can add Slither or Mythril to a GitHub Action and have it fail the build if issues are found. Foundry’s fuzz tests can be run as part of forge test every time. Invariant tests can even be run continuously in the cloud with tools like CloudExec, and any failure can automatically be converted into a unit test using fuzz-utils. These integrations mean developers get quick feedback and can iterate. In contrast, something like the Certora Prover historically was run as a separate process (or by an external auditing team) and might take hours to produce a result – not something you’d run on every commit. AI-based tools face integration hurdles too: calling an API and handling its output deterministically in CI is non-trivial. There’s also the matter of security and privacy – many organizations are uneasy about sending proprietary contract code to a third-party AI service for analysis. Self-hosted LLM solutions are not yet as powerful as the big cloud APIs, so this is a sticking point for CI adoption of AI auditors.
False Positives and Noise: A tool that reports many false positives or low-severity findings can reduce developer trust. Static analyzers have struggled with this in the past – e.g., early versions of some tools would flood users with warnings, many of which were irrelevant. The balance between signal and noise is crucial. Slither is praised for minimal false alarms, whereas a tool like Securify (in its research form) often produced many warnings that required manual filtering. LLMs, as discussed, can generate noise if not properly targeted. This is why currently AI suggestions are usually taken as advisory, not absolute. Teams are more likely to adopt a noisy tool if it’s run by a separate security team or in a audit context, but for day-to-day use, developers prefer tools that give clear, actionable results with low noise. The ideal is to “fail the build” only on definite bugs, not on hypotheticals. Achieving that reliability is an ongoing challenge, especially for AI tools.
Scalability and Performance: Formal verification can be computationally intensive. As noted, solvers might time out on complex contracts. This makes it hard to scale to large systems. If verifying one property takes hours, you won’t be checking dozens of properties on each code change. Fuzz testing also faces scalability limits – exploring a huge state space or a contract with many methods combinatorially can be slow (though in practice fuzzers can run in the background or overnight to deepen their search). AI models have fixed context sizes and increasing a model’s capacity is expensive. While GPT-4’s 128k-token context is a boon, feeding it hundreds of kilobytes of contract code is costly and still not enough for very large projects (imagine a complex protocol with many contracts – you might exceed that). There’s also multi-chain scaling: if your project involves interactions between contracts on different chains (Ethereum ↔ another chain), verifying or analyzing that cross-chain logic is even more complex and likely beyond current tooling.
Human Oversight and Verification: At the end of the day, most teams still rely on expert human auditors for final sign-off. Automated tools are aids, not replacements. One limitation of all these tools is that they operate within the bounds of known properties or patterns. They might miss a totally novel type of bug or a subtle economic flaw in a DeFi protocol’s design. Human auditors use intuition and experience to consider how an attacker might approach the system, sometimes in ways no tool is explicitly programmed to do. There have been cases where contracts passed all automated checks but a human later found a vulnerability in the business logic or a creative attack vector. Thus, a challenge is avoiding a false sense of security – developers must interpret the output of formal tools and AI correctly and not assume “no issues found” means the code is 100% safe.
Maintaining Specifications and Tests: For formal verification in particular, one practical issue is spec drift. The spec (invariants, rules, etc.) might become outdated as the code evolves. Ensuring the code and its formal spec remain in sync is a non-trivial management task. If developers are not vigilant, the proofs might “pass” simply because they’re proving something no longer relevant to the code’s actual requirements. Similarly, invariant tests must be updated as the system’s expected behavior changes. This requires a culture of invariant-driven development that not all teams have (though there is a push to encourage it).

In summary, the main limitations are people and process, rather than the raw capability of the technology. Formal and AI-assisted methods can greatly improve security, but they must be deployed in a way that fits developers’ workflows and skill sets. Encouragingly, trends like invariant-driven development (writing down critical invariants from day one) are gaining traction, and tooling is slowly improving to make advanced analyses more plug-and-play. The open-sourcing of major tools (e.g. Certora Prover) and the integration of fuzzing into popular frameworks (Foundry) are lowering barriers. Over time, we expect the gap between what “an average developer” can do and what “a PhD researcher” can do will narrow, in terms of using these powerful verification techniques.

Open-Source vs Commercial Auditing Tools

The landscape of smart contract security tools includes both open-source projects and commercial services. Each has its role, and often they complement each other:

Open-Source Tools: The majority of widely-used auditing tools in the Ethereum ecosystem are open-source. This includes Slither (static analyzer), Mythril (symbolic execution), Echidna (fuzzer), Manticore (symbolic/concolic execution), Securify (analyzer from ETH Zurich), Solhint/Ethlint (linters), and of course Foundry itself. Open-source tools are favored for a few reasons: (1) Transparency – developers can see how the tool works and trust that nothing malicious or hidden is occurring (important in an open ecosystem). (2) Community Contribution – rules and features get added by a broad community (Slither, for example, has many community-contributed detectors). (3) Cost – they are free to use, which is important for the many small projects/startups in Web3. (4) Integration – open tools can usually be run locally or in CI without legal hurdles, and often they can be customized or scripted for project-specific needs.

Open-source tools have rapidly evolved. For instance, Slither’s support for new Solidity versions and patterns is continuously updated by Trail of Bits. Mythril, developed by ConsenSys, can analyze not just Ethereum but any EVM-compatible chain by working on bytecode – showing how open tools can be repurposed across chains easily. The downside of open tools is that they often come with “use at your own risk” – no official support or guarantees. They might not scale or have the polish of a commercial product’s UI. But in blockchain, even many companies use open-source as their core tools internally (sometimes with slight custom modifications).
Commercial Services and Tools: A few companies have offered security analysis as a product. Examples include MythX (a cloud-based scanning API by ConsenSys Diligence), Certora (which offered its prover as a service before open-sourcing it), CertiK and SlowMist (firms that have internal scanners and AI that they use in audits or offer via dashboards), and some newer entrants like Securify from ChainSecurity (the company was acquired and its tool used internally) or SolidityScan (a cloud scanner that outputs an audit report). Commercial tools often aim to provide a more user-friendly experience or integrated service. For example, MythX integrated with IDE extensions and CI plugins so that developers could send their contracts to MythX and get back a detailed report, including a vulnerability score and details to fix issues. These services typically run a combination of analysis techniques under the hood (pattern matching, symbolic execution, etc.) tuned to minimize false positives.

The value proposition of commercial tools is usually convenience and support. They may maintain a continuously updated knowledge base of vulnerabilities and provide customer support in interpreting results. They might also handle heavy computation in the cloud (so you don’t need to run a solver locally). However, cost is a factor – many projects opt not to pay for these services, given the availability of free alternatives. Additionally, in the spirit of decentralization, some developers are hesitant to rely on closed-source services for security (the “verify, don’t trust” ethos). The open-sourcing of the Certora Prover in 2025 is a notable event: it turned what was a high-end commercial tool into a community resource. This move is expected to accelerate adoption, as now anyone can attempt to formally verify their contracts without a paid license, and the code openness will allow researchers to improve the tool or adapt it to other chains.
Human Audit Services: It’s worth mentioning that beyond software tools, many audits are done by professional firms (Trail of Bits, OpenZeppelin, Certik, PeckShield, etc.). These firms use a mix of the above tools (mostly open-source) and proprietary scripts. The outputs of these efforts are often kept private or only summarized in audit reports. There isn’t exactly an “open vs commercial” dichotomy here, since even commercial audit firms rely heavily on open-source tools. The differentiation is more in expertise and manual effort. That said, some firms are developing proprietary AI-assisted auditing platforms to give themselves an edge (for example, there were reports of “ChainGPT” being used for internal audits, or CertiK developing an AI called Skynet for on-chain monitoring). Those are not public tools per se, so their accuracy and methods are not widely documented.

In practice, a common pattern is open-source first, commercial optional. Teams will use open tools during development and testing (because they can integrate them easily and run as often as needed). Then, they might use a commercial service or two as an additional check before deployment – for instance, running a MythX scan to get a “second opinion” or hiring a firm that uses advanced tools like Certora to formally verify a critical component. With the lines blurring (Certora open source, MythX results often overlapping with open tools), the distinction is less about capability and more about support and convenience.

One notable cross-over is Certora’s multi-chain support – by supporting Solana and Stellar formally, they address platforms where fewer open alternatives exist (e.g. Ethereum has many open tools, Solana had almost none until recently). As security tools expand to new ecosystems, we may see more commercial offerings fill gaps, at least until open-source catches up.

Finally, it’s worth noting that open vs commercial is not adversarial here; they often learn from each other. For example, techniques pioneered in academic/commercial tools (like abstract interpretation used in Securify) eventually find their way into open tools, and conversely, data from open tool usage can guide commercial tool improvements. The end result sought by both sides is better security for the entire ecosystem.

Cross-Chain Applicability

While Ethereum has been the focal point for most of these tools (given its dominance in smart contract activity), the concepts of formal verification and AI-assisted auditing are applicable to other blockchain platforms as well. Here’s how they translate:

EVM-Compatible Chains: Blockchains like BSC, Polygon, Avalanche C-Chain, etc., which use the Ethereum Virtual Machine, can directly use all the same tools. A fuzz test or static analysis doesn’t care if your contract will deploy on Ethereum mainnet or on Polygon – the bytecode and source language (Solidity/Vyper) are the same. Indeed, Mythril and Slither can analyze contracts from any EVM chain by pulling the bytecode from an address (Mythril just needs an RPC endpoint). Many DeFi projects on these chains run Slither and Echidna just as an Ethereum project would. The audits of protocols on BSC or Avalanche typically use the identical toolkit as Ethereum audits. So, cross-chain in the EVM context mostly means reusing Ethereum’s toolchain.
Solana: Solana’s smart contracts are written in Rust (usually via the Anchor framework) and run on the BPF virtual machine. This is a very different environment from Ethereum, so Ethereum-specific tools don’t work out of the box. However, the same principles apply. For instance, one can do fuzz testing on Solana programs using Rust’s native fuzzing libraries or tools like cargo-fuzz. Formal verification on Solana had been nearly non-existent until recently. The collaboration between Certora and Solana engineers yielded an in-house verification tool for Solana programs that can prove Rust/Anchor contracts against specifications. As noted, Certora extended their prover to Solana’s VM, meaning developers can write rules about Solana program behavior and check them just like they would for Solidity. This is significant because Solana’s move-fast approach meant many contracts were launched without the kind of rigorous testing seen in Ethereum; formal tools could improve that. AI auditing for Solana is also plausible – an LLM that understands Rust could be prompted to check a Solana program for vulnerabilities (like missing ownership checks or arithmetic errors). It might require fine-tuning on Solana-specific patterns, but given Rust’s popularity, GPT-4 is fairly proficient at reading Rust code. We may soon see “ChatGPT for Anchor” style tools emerging as well.
Polkadot and Substrate-based Chains: Polkadot’s smart contracts can be written in Rust (compiled to WebAssembly) using the ink! framework, or you can run an EVM pallet (like Moonbeam does) which again allows Solidity. In the ink!/Wasm case, the verification tools are still nascent. One could attempt to formally verify a Wasm contract’s properties using general Wasm verification frameworks (for example, Microsoft’s Project Verona or others have looked at Wasm safety). Certora’s open-source prover also mentions support for Stellar’s WASM smart contracts, which are similar in concept to any Wasm-based chain. So it’s likely applicable to Polkadot’s Wasm contracts too. Fuzz testing ink! contracts can be done by writing Rust tests (property tests in Rust can serve a similar role). AI auditing of ink! contracts would entail analyzing Rust code as well, which again an LLM can do with the right context (though it might not know about the specific ink! macros or traits without some hints).

Additionally, Polkadot is exploring the Move language for parallel smart contract development (as hinted in some community forums). If Move becomes used on Polkadot parachains, the Move Prover comes with it, bringing formal verification capabilities by design. Move’s emphasis on safety (resource oriented programming) and built-in prover shows a cross-chain propagation of formal methods from the start.
Other Blockchains: Platforms like Tezos (Michelson smart contracts) and Algorand (TEAL programs) each have had formal verification efforts. Tezos, for example, has a tool called Mi-Cho-Coq that provides a formal semantics of Michelson and allows proving properties. These are more on the academic side but show that any blockchain with a well-defined contract semantics can be subjected to formal verification. AI auditing could, in principle, be applied to any programming language – it’s a matter of training or prompting the LLM appropriately. For less common languages, an LLM might need some fine-tuning to be effective, as it may not have been pretrained on enough examples.
Cross-Chain Interactions: A newer challenge is verifying interactions across chains (like bridges or inter-chain messaging). Formal verification here might involve modeling multiple chains’ state and the communication protocol. This is very complex and currently beyond most tools, though specific bridge protocols have been manually analyzed for security. AI might help in reviewing cross-chain code (for instance, reviewing a Solidity contract that interacts with an IBC protocol on Cosmos), but there’s no out-of-the-box solution yet.

In essence, Ethereum’s tooling has paved the way for other chains. Many of the open-source tools are being adapted: e.g., there are efforts to create a Slither-like static analyzer for Solana (Rust), and the concepts of invariant testing are language-agnostic (property-based testing exists in many languages). The open-sourcing of powerful engines (like Certora’s for multiple VMs) is particularly promising for cross-chain security – the same platform could be used to verify a Solidity contract, a Solana program, and a Move contract, provided each has a proper specification written. This encourages a more uniform security posture across the industry.

It’s also worth noting that AI-assisted auditing will benefit all chains, since the model can be taught about vulnerabilities in any context. As long as the AI is provided with the right information (language specs, known bug patterns in that ecosystem), it can apply similar reasoning. For example, ChatGPT could be asked to audit a Solidity contract or a Move module with the appropriate prompt, and it will do both – it might even catch something like “this Move module might violate resource conservation” if it has that concept. The limitation is whether the AI was exposed to enough training data for that chain. Ethereum, being most popular, has likely biased the models to be best at Solidity. As other chains grow, future LLMs or fine-tuned derivatives could cover those as well.

Conclusion

智能合约形式化验证与 AI 辅助审计 is a rapidly evolving field. We now have a rich toolkit: from deterministic static analyzers and fuzzing frameworks that improve code reliability, to cutting-edge AI that can reason about code in human-like ways. Formal verification, once a niche academic pursuit, is becoming more practical through better tools and integration. AI, despite its current limitations, has shown glimpses of game-changing capabilities in automating security analysis.

There is no one-size-fits-all solution yet – real-world auditing combines multiple techniques to achieve defense in depth. Foundry’s fuzz and invariant testing are already raising the bar for what is expected before deployment (catching many errors that would slip past basic tests). AI-assisted auditing, when used carefully, can act as a force multiplier for auditors, highlighting issues and verifying compliance at a scale and speed that manual review alone cannot match. However, human expertise remains crucial to drive these tools, interpret their findings, and define the right properties to check.

Moving forward, we can expect greater convergence of these approaches. For example, AI might help write formal specifications or suggest invariants (“AI, what security properties should hold for this DeFi contract?”). Fuzzing tools might incorporate AI to guide input generation more intelligently (as PromFuzz does). Formal verification engines might use machine learning to decide which lemmas or heuristics to apply. All of this will contribute to more secure smart contracts across not just Ethereum, but all blockchain platforms. The ultimate vision is a future where critical smart contracts can be deployed with high confidence in their correctness – a goal that will likely be achieved by the synergistic use of formal methods and AI assistance, rather than either alone.

протSources:

Foundry fuzzing and invariant testing overview
Trail of Bits on fuzzing vs formal verification
Trail of Bits on formal verification limitations
Patrick Collins on fuzz/invariant vs formal methods
Trail of Bits on invariants in audits
Medium (BuildBear) on Slither and Mythril usage
AuditGPT results (ERC compliance)
Trail of Bits on LLM (Codex/GPT-4) limitations
LLM-SmartAudit performance vs traditional tools
“Detection Made Easy” study on ChatGPT accuracy
PromFuzz (LLM+fuzz) performance
Certora open-source announcement (multi-chain)
Move Prover description (Aptos)
Static analysis background (Smart contract security literature)

Share on Twitter

API Marketplace Featured

Formal Verification in Smart Contract Auditing​

Foundry’s Fuzz Testing and Invariant Testing​

Beyond Fuzzing: Formal Proofs and Advanced Tools​

AI-Assisted Auditing with Large Language Models​

LLM-Based Vulnerability Analysis Tools​

Capabilities and Limitations of LLM Auditors​

Accuracy and Reliability of Different Toolchains​

Real-World Adoption Challenges and Limitations​

Open-Source vs Commercial Auditing Tools​

Cross-Chain Applicability​

Conclusion​