Skip to main content

3 posts tagged with "cryptographic proofs"

Cryptographic proof systems

View all tags

Gensyn's Judge: How Bitwise-Exact Reproducibility Is Ending the Era of Opaque AI APIs

· 18 min read
Dora Noda
Software Engineer

Every time you query ChatGPT, Claude, or Gemini, you're trusting an invisible black box. The model version? Unknown. The exact weights? Proprietary. Whether the output was generated by the model you think you're using, or a silently updated variant? Impossible to verify. For casual users asking about recipes or trivia, this opacity is merely annoying. For high-stakes AI decision-making—financial trading algorithms, medical diagnoses, legal contract analysis—it's a fundamental crisis of trust.

Gensyn's Judge, launched in late 2025 and entering production in 2026, offers a radical alternative: cryptographically verifiable AI evaluation where every inference is reproducible down to the bit. Instead of trusting OpenAI or Anthropic to serve the correct model, Judge enables anyone to verify that a specific, pre-agreed AI model executed deterministically against real-world inputs—with cryptographic proofs ensuring the results can't be faked.

The technical breakthrough is Verde, Gensyn's verification system that eliminates floating-point nondeterminism—the bane of AI reproducibility. By enforcing bitwise-exact computation across devices, Verde ensures that running the same model on an NVIDIA A100 in London and an AMD MI250 in Tokyo yields identical results, provable on-chain. This unlocks verifiable AI for decentralized finance, autonomous agents, and any application where transparency isn't optional—it's existential.

The Opaque API Problem: Trust Without Verification

The AI industry runs on APIs. Developers integrate OpenAI's GPT-4, Anthropic's Claude, or Google's Gemini via REST endpoints, sending prompts and receiving responses. But these APIs are fundamentally opaque:

Version uncertainty: When you call gpt-4, which exact version am I getting? GPT-4-0314? GPT-4-0613? A silently updated variant? Providers frequently deploy patches without public announcements, changing model behavior overnight.

No audit trail: API responses include no cryptographic proof of which model generated them. If OpenAI serves a censored or biased variant for specific geographies or customers, users have no way to detect it.

Silent degradation: Providers can "lobotomize" models to reduce costs—downgrading inference quality while maintaining the same API contract. Users report GPT-4 becoming "dumber" over time, but without transparent versioning, such claims remain anecdotal.

Nondeterministic outputs: Even querying the same model twice with identical inputs can yield different results due to temperature settings, batching, or hardware-level floating-point rounding errors. This makes auditing impossible—how do you verify correctness when outputs aren't reproducible?

For casual applications, these issues are inconveniences. For high-stakes decision-making, they're blockers. Consider:

Algorithmic trading: A hedge fund deploys an AI agent managing $50 million in DeFi positions. The agent relies on GPT-4 to analyze market sentiment from X posts. If the model silently updates mid-trading session, sentiment scores shift unpredictably—triggering unintended liquidations. The fund has no proof the model misbehaved; OpenAI's logs aren't publicly auditable.

Medical diagnostics: A hospital uses an AI model to recommend cancer treatments. Regulations require doctors to document decision-making processes. But if the AI model version can't be verified, the audit trail is incomplete. A malpractice lawsuit could hinge on proving which model generated the recommendation—impossible with opaque APIs.

DAO governance: A decentralized organization uses an AI agent to vote on treasury proposals. Community members demand proof the agent used the approved model—not a tampered variant that favors specific outcomes. Without cryptographic verification, the vote lacks legitimacy.

This is the trust gap Gensyn targets: as AI becomes embedded in critical decision-making, the inability to verify model authenticity and behavior becomes a "fundamental blocker to deploying agentic AI in high-stakes environments."

Judge: The Verifiable AI Evaluation Protocol

Judge solves the opacity problem by executing pre-agreed, deterministic AI models against real-world inputs and committing results to a blockchain where anyone can challenge them. Here's how the protocol works:

1. Model commitment: Participants agree on an AI model—its architecture, weights, and inference configuration. This model is hashed and committed on-chain. The hash serves as a cryptographic fingerprint: any deviation from the agreed model produces a different hash.

2. Deterministic execution: Judge runs the model using Gensyn's Reproducible Runtime, which guarantees bitwise-exact reproducibility across devices. This eliminates floating-point nondeterminism—a critical innovation we'll explore shortly.

3. Public commitment: After inference, Judge posts the output (or a hash of it) on-chain. This creates a permanent, auditable record of what the model produced for a given input.

4. Challenge period: Anyone can challenge the result by re-executing the model independently. If their output differs, they submit a fraud proof. Verde's refereed delegation mechanism pinpoints the exact operator in the computational graph where results diverge.

5. Slashing for fraud: If a challenger proves Judge produced incorrect results, the original executor is penalized (slashing staked tokens). This aligns economic incentives: executors maximize profit by running models correctly.

Judge transforms AI evaluation from "trust the API provider" to "verify the cryptographic proof." The model's behavior is public, auditable, and enforceable—no longer hidden behind proprietary endpoints.

Verde: Eliminating Floating-Point Nondeterminism

The core technical challenge in verifiable AI is determinism. Neural networks perform billions of floating-point operations during inference. On modern GPUs, these operations aren't perfectly reproducible:

Non-associativity: Floating-point addition isn't associative. (a + b) + c might yield a different result than a + (b + c) due to rounding errors. GPUs parallelize sums across thousands of cores, and the order in which partial sums accumulate varies by hardware and driver version.

Kernel scheduling variability: GPU kernels (like matrix multiplication or attention) can execute in different orders depending on workload, driver optimizations, or hardware architecture. Even running the same model on the same GPU twice can yield different results if kernel scheduling differs.

Batch-size dependency: Research has found that LLM inference is system-level nondeterministic because output depends on batch size. Many kernels (matmul, RMSNorm, attention) change numerical output based on how many samples are processed together—an inference with batch size 1 produces different values than the same input in a batch of 8.

These issues make standard AI models unsuitable for blockchain verification. If two validators re-run the same inference and get slightly different outputs, who's correct? Without determinism, consensus is impossible.

Verde solves this with RepOps (Reproducible Operators)—a library that eliminates hardware nondeterminism by controlling the order of floating-point operations on all devices. Here's how it works:

Canonical reduction orders: RepOps enforces a deterministic order for summing partial results in operations like matrix multiplication. Instead of letting the GPU scheduler decide, RepOps explicitly specifies: "sum column 0, then column 1, then column 2..." across all hardware. This ensures (a + b) + c is always computed in the same sequence.

Custom CUDA kernels: Gensyn developed optimized kernels that prioritize reproducibility over raw speed. RepOps matrix multiplications incur less than 30% overhead compared to standard cuBLAS—a reasonable trade-off for determinism.

Driver and version pinning: Verde uses version-pinned GPU drivers and canonical configurations, ensuring that the same model executing on different hardware produces identical bitwise outputs. A model running on an NVIDIA A100 in one datacenter matches the output from an AMD MI250 in another, bit for bit.

This is the breakthrough enabling Judge's verification: bitwise-exact reproducibility means validators can independently confirm results without trusting executors. If the hash matches, the inference is correct—mathematically provable.

Refereed Delegation: Efficient Verification Without Full Recomputation

Even with deterministic execution, verifying AI inference naively is expensive. A 70-billion-parameter model generating 1,000 tokens might require 10 GPU-hours. If validators must re-run every inference to verify correctness, verification cost equals execution cost—defeating the purpose of decentralization.

Verde's refereed delegation mechanism makes verification exponentially cheaper:

Multiple untrusted executors: Instead of one executor, Judge assigns tasks to multiple independent providers. Each runs the same inference and submits results.

Disagreement triggers investigation: If all executors agree, the result is accepted—no further verification needed. If outputs differ, Verde initiates a challenge game.

Binary search over computation graph: Verde doesn't re-run the entire inference. Instead, it performs binary search over the model's computational graph to find the first operator where results diverge. This pinpoints the exact layer (e.g., "attention layer 47, head 8") causing the discrepancy.

Minimal referee computation: A referee (which can be a smart contract or validator with limited compute) checks only the disputed operator—not the entire forward pass. For a 70B-parameter model with 80 layers, this reduces verification to checking ~7 layers (log₂ 80) in the worst case.

This approach is over 1,350% more efficient than naive replication (where every validator re-runs everything). Gensyn combines cryptographic proofs, game theory, and optimized processes to guarantee correct execution without redundant computation.

The result: Judge can verify AI workloads at scale, enabling decentralized inference networks where thousands of untrusted nodes contribute compute—and dishonest executors are caught and penalized.

High-Stakes AI Decision-Making: Why Transparency Matters

Judge's target market isn't casual chatbots—it's applications where verifiability isn't a nice-to-have, but a regulatory or economic requirement. Here are scenarios where opaque APIs fail catastrophically:

Decentralized finance (DeFi): Autonomous trading agents manage billions in assets. If an agent uses an AI model to decide when to rebalance portfolios, users need proof the model wasn't tampered with. Judge enables on-chain verification: the agent commits to a specific model hash, executes trades based on its outputs, and anyone can challenge the decision logic. This transparency prevents rug pulls where malicious agents claim "the AI told me to liquidate" without evidence.

Regulatory compliance: Financial institutions deploying AI for credit scoring, fraud detection, or anti-money laundering (AML) face audits. Regulators demand explanations: "Why did the model flag this transaction?" Opaque APIs provide no audit trail. Judge creates an immutable record of model version, inputs, and outputs—satisfying compliance requirements.

Algorithmic governance: Decentralized autonomous organizations (DAOs) use AI agents to propose or vote on governance decisions. Community members must verify the agent used the approved model—not a hacked variant. With Judge, the DAO encodes the model hash in its smart contract, and every decision includes a cryptographic proof of correctness.

Medical and legal AI: Healthcare and legal systems require accountability. A doctor diagnosing cancer with AI assistance needs to document the exact model version used. A lawyer drafting contracts with AI must prove the output came from a vetted, unbiased model. Judge's on-chain audit trail provides this evidence.

Prediction markets and oracles: Projects like Polymarket use AI to resolve bet outcomes (e.g., "Will this event happen?"). If resolution depends on an AI model analyzing news articles, participants need proof the model wasn't manipulated. Judge verifies the oracle's AI inference, preventing disputes.

In each case, the common thread is trust without transparency is insufficient. As VeritasChain notes, AI systems need "cryptographic flight recorders"—immutable logs proving what happened when disputes arise.

The Zero-Knowledge Proof Alternative: Comparing Verde and ZKML

Judge isn't the only approach to verifiable AI. Zero-Knowledge Machine Learning (ZKML) achieves similar goals using zk-SNARKs: cryptographic proofs that a computation was performed correctly without revealing inputs or weights.

How does Verde compare to ZKML?

Verification cost: ZKML requires ~1,000× more computation than the original inference to generate proofs (research estimates). A 70B-parameter model needing 10 GPU-hours for inference might require 10,000 GPU-hours to prove. Verde's refereed delegation is logarithmic: checking ~7 layers instead of 80 is a 10× reduction, not 1,000×.

Prover complexity: ZKML demands specialized hardware (like custom ASICs for zk-SNARK circuits) to generate proofs efficiently. Verde works on commodity GPUs—any miner with a gaming PC can participate.

Privacy trade-offs: ZKML's strength is privacy—proofs reveal nothing about inputs or model weights. Verde's deterministic execution is transparent: inputs and outputs are public (though weights can be encrypted). For high-stakes decision-making, transparency is often desirable. A DAO voting on treasury allocation wants public audit trails, not hidden proofs.

Proving scope: ZKML is practically limited to inference—proving training is infeasible at current computational costs. Verde supports both inference and training verification (Gensyn's broader protocol verifies distributed training).

Real-world adoption: ZKML projects like Modulus Labs have achieved breakthroughs (verifying 18M-parameter models on-chain), but remain limited to smaller models. Verde's deterministic runtime handles 70B+ parameter models in production.

ZKML excels where privacy is paramount—like verifying biometric authentication (Worldcoin) without exposing iris scans. Verde excels where transparency is the goal—proving a specific public model executed correctly. Both approaches are complementary, not competing.

The Gensyn Ecosystem: From Judge to Decentralized Training

Judge is one component of Gensyn's broader vision: a decentralized network for machine learning compute. The protocol includes:

Execution layer: Consistent ML execution across heterogeneous hardware (consumer GPUs, enterprise clusters, edge devices). Gensyn standardizes inference and training workloads, ensuring compatibility.

Verification layer (Verde): Trustless verification using refereed delegation. Dishonest executors are detected and penalized.

Peer-to-peer communication: Workload distribution across devices without centralized coordination. Miners receive tasks, execute them, and submit proofs directly to the blockchain.

Decentralized coordination: Smart contracts on an Ethereum rollup identify participants, allocate tasks, and process payments permissionlessly.

Gensyn's Public Testnet launched in March 2025, with mainnet planned for 2026. The $AI token public sale occurred in December 2025, establishing economic incentives for miners and validators.

Judge fits into this ecosystem as the evaluation layer: while Gensyn's core protocol handles training and inference, Judge ensures those outputs are verifiable. This creates a flywheel:

Developers train models on Gensyn's decentralized network (cheaper than AWS due to underutilized consumer GPUs contributing compute).

Models are deployed with Judge guaranteeing evaluation integrity. Applications consume inference via Gensyn's APIs, but unlike OpenAI, every output includes a cryptographic proof.

Validators earn fees by checking proofs and catching fraud, aligning economic incentives with network security.

Trust scales as more applications adopt verifiable AI, reducing reliance on centralized providers.

The endgame: AI training and inference that's provably correct, decentralized, and accessible to anyone—not just Big Tech.

Challenges and Open Questions

Judge's approach is groundbreaking, but several challenges remain:

Performance overhead: RepOps' 30% slowdown is acceptable for verification, but if every inference must run deterministically, latency-sensitive applications (real-time trading, autonomous vehicles) might prefer faster, non-verifiable alternatives. Gensyn's roadmap likely includes optimizing RepOps further—but there's a fundamental trade-off between speed and determinism.

Driver version fragmentation: Verde assumes version-pinned drivers, but GPU manufacturers release updates constantly. If some miners use CUDA 12.4 and others use 12.5, bitwise reproducibility breaks. Gensyn must enforce strict version management—complicating miner onboarding.

Model weight secrecy: Judge's transparency is a feature for public models but a bug for proprietary ones. If a hedge fund trains a valuable trading model, deploying it on Judge exposes weights to competitors (via the on-chain commitment). ZKML-based alternatives might be preferred for secret models—suggesting Judge targets open or semi-open AI applications.

Dispute resolution latency: If a challenger claims fraud, resolving the dispute via binary search requires multiple on-chain transactions (each round narrows the search space). High-frequency applications can't wait hours for finality. Gensyn might introduce optimistic verification (assume correctness unless challenged within a window) to reduce latency.

Sybil resistance in refereed delegation: If multiple executors must agree, what prevents a single entity from controlling all executors via Sybil identities? Gensyn likely uses stake-weighted selection (high-reputation validators are chosen preferentially) plus slashing to deter collusion—but the economic thresholds must be carefully calibrated.

These aren't showstoppers—they're engineering challenges. The core innovation (deterministic AI + cryptographic verification) is sound. Execution details will mature as the testnet transitions to mainnet.

The Road to Verifiable AI: Adoption Pathways and Market Fit

Judge's success depends on adoption. Which applications will deploy verifiable AI first?

DeFi protocols with autonomous agents: Aave, Compound, or Uniswap DAOs could integrate Judge-verified agents for treasury management. The community votes to approve a model hash, and all agent decisions include proofs. This transparency builds trust—critical for DeFi's legitimacy.

Prediction markets and oracles: Platforms like Polymarket or Chainlink could use Judge to resolve bets or deliver price feeds. AI models analyzing sentiment, news, or on-chain activity would produce verifiable outputs—eliminating disputes over oracle manipulation.

Decentralized identity and KYC: Projects requiring AI-based identity verification (age estimation from selfies, document authenticity checks) benefit from Judge's audit trail. Regulators accept cryptographic proofs of compliance without trusting centralized identity providers.

Content moderation for social media: Decentralized social networks (Farcaster, Lens Protocol) could deploy Judge-verified AI moderators. Community members verify the moderation model isn't biased or censored—ensuring platform neutrality.

AI-as-a-Service platforms: Developers building AI applications can offer "verifiable inference" as a premium feature. Users pay extra for proofs, differentiating services from opaque alternatives.

The commonality: applications where trust is expensive (due to regulation, decentralization, or high stakes) and verification cost is acceptable (compared to the value of certainty).

Judge won't replace OpenAI for consumer chatbots—users don't care if GPT-4 is verifiable when asking for recipe ideas. But for financial algorithms, medical tools, and governance systems, verifiable AI is the future.

Verifiability as the New Standard

Gensyn's Judge represents a paradigm shift: AI evaluation is moving from "trust the provider" to "verify the proof." The technical foundation—bitwise-exact reproducibility via Verde, efficient verification through refereed delegation, and on-chain audit trails—makes this transition practical, not just aspirational.

The implications ripple far beyond Gensyn. If verifiable AI becomes standard, centralized providers lose their moats. OpenAI's value proposition isn't just GPT-4's capabilities—it's the convenience of not managing infrastructure. But if Gensyn proves decentralized AI can match centralized performance with added verifiability, developers have no reason to lock into proprietary APIs.

The race is on. ZKML projects (Modulus Labs, Worldcoin's biometric system) are betting on zero-knowledge proofs. Deterministic runtimes (Gensyn's Verde, EigenAI) are betting on reproducibility. Optimistic approaches (blockchain AI oracles) are betting on fraud proofs. Each path has trade-offs—but the destination is the same: AI systems where outputs are provable, not just plausible.

For high-stakes decision-making, this isn't optional. Regulators won't accept "trust us" from AI providers in finance, healthcare, or legal applications. DAOs won't delegate treasury management to black-box agents. And as autonomous AI systems grow more powerful, the public will demand transparency.

Judge is the first production-ready system delivering on this promise. The testnet is live. The cryptographic foundations are solid. The market—$27 billion in AI agent crypto, billions in DeFi assets managed by algorithms, and regulatory pressure mounting—is ready.

The era of opaque AI APIs is ending. The age of verifiable intelligence is beginning. And Gensyn's Judge is lighting the way.


Sources:

Nillion's Blacklight Goes Live: How ERC-8004 is Building the Trust Layer for Autonomous AI Agents

· 12 min read
Dora Noda
Software Engineer

On February 2, 2026, the AI agent economy took a critical step forward. Nillion launched Blacklight, a verification layer implementing the ERC-8004 standard to solve one of blockchain's most pressing questions: how do you trust an AI agent you've never met?

The answer isn't a simple reputation score or a centralized registry. It's a five-step verification process backed by cryptographic proofs, programmable audits, and a network of community-operated nodes. As autonomous agents increasingly execute trades, manage treasuries, and coordinate cross-chain activities, Blacklight represents the infrastructure enabling trustless AI coordination at scale.

The Trust Problem AI Agents Can't Solve Alone

The numbers tell the story. AI agents now contribute 30% of Polymarket's trading volume, handle DeFi yield strategies across multiple protocols, and autonomously execute complex workflows. But there's a fundamental bottleneck: how do agents verify each other's trustworthiness without pre-existing relationships?

Traditional systems rely on centralized authorities issuing credentials. Web3's promise is different—trustless verification through cryptography and consensus. Yet until ERC-8004, there was no standardized way for agents to prove their authenticity, track their behavior, or validate their decision-making logic on-chain.

This isn't just a theoretical problem. As Davide Crapis explains, "ERC-8004 enables decentralized AI agent interactions, establishes trustless commerce, and enhances reputation systems on Ethereum." Without it, agent-to-agent commerce remains confined to walled gardens or requires manual oversight—defeating the purpose of autonomy.

ERC-8004: The Three-Registry Trust Infrastructure

The ERC-8004 standard, which went live on Ethereum mainnet on January 29, 2026, establishes a modular trust layer through three on-chain registries:

Identity Registry: Uses ERC-721 to provide portable agent identifiers. Each agent receives a non-fungible token representing its unique on-chain identity, enabling cross-platform recognition and preventing identity spoofing.

Reputation Registry: Collects standardized feedback and ratings. Unlike centralized review systems, feedback is recorded on-chain with cryptographic signatures, creating an immutable audit trail. Anyone can crawl this history and build custom reputation algorithms.

Validation Registry: Supports cryptographic and economic verification of agent work. This is where programmable audits happen—validators can re-execute computations, verify zero-knowledge proofs, or leverage Trusted Execution Environments (TEEs) to confirm an agent acted correctly.

The brilliance of ERC-8004 is its unopinionated design. As the technical specification notes, the standard supports various validation techniques: "stake-secured re-execution of tasks (inspired by systems like EigenLayer), verification of zero-knowledge machine learning (zkML) proofs, and attestations from Trusted Execution Environments."

This flexibility matters. A DeFi arbitrage agent might use zkML proofs to verify its trading logic without revealing alpha. A supply chain agent might use TEE attestations to prove it accessed real-world data correctly. A cross-chain bridge agent might rely on crypto-economic validation with slashing to ensure honest execution.

Blacklight's Five-Step Verification Process

Nillion's implementation of ERC-8004 on Blacklight adds a crucial layer: community-operated verification nodes. Here's how the process works:

1. Agent Registration: An agent registers its identity in the Identity Registry, receiving an ERC-721 NFT. This creates a unique on-chain identifier tied to the agent's public key.

2. Verification Request Initiation: When an agent performs an action requiring validation (e.g., executing a trade, transferring funds, or updating state), it submits a verification request to Blacklight.

3. Committee Assignment: Blacklight's protocol randomly assigns a committee of verification nodes to audit the request. These nodes are operated by community members who stake 70,000 NIL tokens, aligning incentives for network integrity.

4. Node Checks: Committee members re-execute the computation or validate cryptographic proofs. If validators detect incorrect behavior, they can slash the agent's stake (in systems using crypto-economic validation) or flag the identity in the Reputation Registry.

5. On-Chain Reporting: Results are posted on-chain. The Validation Registry records whether the agent's work was verified, creating permanent proof of execution. The Reputation Registry updates accordingly.

This process happens asynchronously and non-blocking, meaning agents don't wait for verification to complete routine tasks—but high-stakes actions (large transfers, cross-chain operations) can require upfront validation.

Programmable Audits: Beyond Binary Trust

Blacklight's most ambitious feature is "programmable verification"—the ability to audit how an agent makes decisions, not just what it does.

Consider a DeFi agent managing a treasury. Traditional audits verify that funds moved correctly. Programmable audits verify:

  • Decision-making logic consistency: Did the agent follow its stated investment strategy, or did it deviate?
  • Multi-step workflow execution: If the agent was supposed to rebalance portfolios across three chains, did it complete all steps?
  • Security constraints: Did the agent respect gas limits, slippage tolerances, and exposure caps?

This is possible because ERC-8004's Validation Registry supports arbitrary proof systems. An agent can commit to a decision-making algorithm on-chain (e.g., a hash of its neural network weights or a zk-SNARK circuit representing its logic), then prove each action conforms to that algorithm without revealing proprietary details.

Nillion's roadmap explicitly targets these use cases: "Nillion plans to expand Blacklight's capabilities to 'programmable verification,' enabling decentralized audits of complex behaviors such as agent decision-making logic consistency, multi-step workflow execution, and security constraints."

This shifts verification from reactive (catching errors after the fact) to proactive (enforcing correct behavior by design).

Blind Computation: Privacy Meets Verification

Nillion's underlying technology—Nil Message Compute (NMC)—adds a privacy dimension to agent verification. Unlike traditional blockchains where all data is public, Nillion's "blind computation" enables operations on encrypted data without decryption.

Here's why this matters for agents: an AI agent might need to verify its trading strategy without revealing alpha to competitors. Or prove it accessed confidential medical records correctly without exposing patient data. Or demonstrate compliance with regulatory constraints without disclosing proprietary business logic.

Nillion's NMC achieves this through multi-party computation (MPC), where nodes collaboratively generate "blinding factors"—correlated randomness used to encrypt data. As DAIC Capital explains, "Nodes generate the key network resource needed to process data—a type of correlated randomness referred to as a blinding factor—with each node storing its share of the blinding factor securely, distributing trust across the network in a quantum-safe way."

This architecture is quantum-resistant by design. Even if a quantum computer breaks today's elliptic curve cryptography, distributed blinding factors remain secure because no single node possesses enough information to decrypt data.

For AI agents, this means verification doesn't require sacrificing confidentiality. An agent can prove it executed a task correctly while keeping its methods, data sources, and decision-making logic private.

The $4.3 Billion Agent Economy Infrastructure Play

Blacklight's launch comes as the blockchain-AI sector enters hypergrowth. The market is projected to grow from $680 million (2025) to $4.3 billion (2034) at a 22.9% CAGR, while the broader confidential computing market reaches $350 billion by 2032.

But Nillion isn't just betting on market expansion—it's positioning itself as critical infrastructure. The agent economy's bottleneck isn't compute or storage; it's trust at scale. As KuCoin's 2026 outlook notes, three key trends are reshaping AI identity and value flow:

Agent-Wrapping-Agent systems: Agents coordinating with other agents to execute complex multi-step tasks. This requires standardized identity and verification—exactly what ERC-8004 provides.

KYA (Know Your Agent): Financial infrastructure demanding agent credentials. Regulators won't approve autonomous agents managing funds without proof of correct behavior. Blacklight's programmable audits directly address this.

Nano-payments: Agents need to settle micropayments efficiently. The x402 payment protocol, which processed over 20 million transactions in January 2026, complements ERC-8004 by handling settlement while Blacklight handles trust.

Together, these standards reached production readiness within weeks of each other—a coordination breakthrough signaling infrastructure maturation.

Ethereum's Agent-First Future

ERC-8004's adoption extends far beyond Nillion. As of early 2026, multiple projects have integrated the standard:

  • Oasis Network: Implementing ERC-8004 for confidential computing with TEE-based validation
  • The Graph: Supporting ERC-8004 and x402 to enable verifiable agent interactions in decentralized indexing
  • MetaMask: Exploring agent wallets with built-in ERC-8004 identity
  • Coinbase: Integrating ERC-8004 for institutional agent custody solutions

This rapid adoption reflects a broader shift in Ethereum's roadmap. Vitalik Buterin has repeatedly emphasized that blockchain's role is becoming "just the plumbing" for AI agents—not the consumer-facing layer, but the trust infrastructure enabling autonomous coordination.

Nillion's Blacklight accelerates this vision by making verification programmable, privacy-preserving, and decentralized. Instead of relying on centralized oracles or human reviewers, agents can prove their correctness cryptographically.

What Comes Next: Mainnet Integration and Ecosystem Expansion

Nillion's 2026 roadmap prioritizes Ethereum compatibility and sustainable decentralization. The Ethereum bridge went live in February 2026, followed by native smart contracts for staking and private computation.

Community members staking 70,000 NIL tokens can operate Blacklight verification nodes, earning rewards while maintaining network integrity. This design mirrors Ethereum's validator economics but adds a verification-specific role.

The next milestones include:

  • Expanded zkML support: Integrating with projects like Modulus Labs to verify AI inference on-chain
  • Cross-chain verification: Enabling Blacklight to verify agents operating across Ethereum, Cosmos, and Solana
  • Institutional partnerships: Collaborations with Coinbase and Alibaba Cloud for enterprise agent deployment
  • Regulatory compliance tools: Building KYA frameworks for financial services adoption

Perhaps most importantly, Nillion is developing nilGPT—a fully private AI chatbot demonstrating how blind computation enables confidential agent interactions. This isn't just a demo; it's a blueprint for agents handling sensitive data in healthcare, finance, and government.

The Trustless Coordination Endgame

Blacklight's launch marks a pivot point for the agent economy. Before ERC-8004, agents operated in silos—trusted within their own ecosystems but unable to coordinate across platforms without human intermediaries. After ERC-8004, agents can verify each other's identity, audit each other's behavior, and settle payments autonomously.

This unlocks entirely new categories of applications:

  • Decentralized hedge funds: Agents managing portfolios across chains, with verifiable investment strategies and transparent performance audits
  • Autonomous supply chains: Agents coordinating logistics, payments, and compliance without centralized oversight
  • AI-powered DAOs: Organizations governed by agents that vote, propose, and execute based on cryptographically verified decision-making logic
  • Cross-protocol liquidity management: Agents rebalancing assets across DeFi protocols with programmable risk constraints

The common thread? All require trustless coordination—the ability for agents to work together without pre-existing relationships or centralized trust anchors.

Nillion's Blacklight provides exactly that. By combining ERC-8004's identity and reputation infrastructure with programmable verification and blind computation, it creates a trust layer scalable enough for the trillion-agent economy on the horizon.

As blockchain becomes the plumbing for AI agents and global finance, the question isn't whether we need verification infrastructure—it's who builds it, and whether it's decentralized or controlled by a few gatekeepers. Blacklight's community-operated nodes and open standard make the case for the former.

The age of autonomous on-chain actors is here. The infrastructure is live. The only question left is what gets built on top.


Sources:

Verifiable On-Chain AI with zkML and Cryptographic Proofs

· 36 min read
Dora Noda
Software Engineer

Introduction: The Need for Verifiable AI on Blockchain

As AI systems grow in influence, ensuring their outputs are trustworthy becomes critical. Traditional methods rely on institutional assurances (essentially “just trust us”), which offer no cryptographic guarantees. This is especially problematic in decentralized contexts like blockchains, where a smart contract or user must trust an AI-derived result without being able to re-run a heavy model on-chain. Zero-knowledge Machine Learning (zkML) addresses this by allowing cryptographic verification of ML computations. In essence, zkML enables a prover to generate a succinct proof that “the output $Y$ came from running model $M$ on input $X$”without revealing $X$ or the internal details of $M$. These zero-knowledge proofs (ZKPs) can be verified by anyone (or any contract) efficiently, shifting AI trust from “policy to proof”.

On-chain verifiability of AI means a blockchain can incorporate advanced computations (like neural network inferences) by verifying a proof of correct execution instead of performing the compute itself. This has broad implications: smart contracts can make decisions based on AI predictions, decentralized autonomous agents can prove they followed their algorithms, and cross-chain or off-chain compute services can provide verifiable outputs rather than unverifiable oracles. Ultimately, zkML offers a path to trustless and privacy-preserving AI – for example, proving an AI model’s decisions are correct and authorized without exposing private data or proprietary model weights. This is key for applications ranging from secure healthcare analytics to blockchain gaming and DeFi oracles.

How zkML Works: Compressing ML Inference into Succinct Proofs

At a high level, zkML combines cryptographic proof systems with ML inference so that a complex model evaluation can be “compressed” into a small proof. Internally, the ML model (e.g. a neural network) is represented as a circuit or program consisting of many arithmetic operations (matrix multiplications, activation functions, etc.). Rather than revealing all intermediate values, a prover performs the full computation off-chain and then uses a zero-knowledge proof protocol to attest that every step was done correctly. The verifier, given only the proof and some public data (like the final output and an identifier for the model), can be cryptographically convinced of the correctness without re-executing the model.

To achieve this, zkML frameworks typically transform the model computation into a format amenable to ZKPs:

  • Circuit Compilation: In SNARK-based approaches, the computation graph of the model is compiled into an arithmetic circuit or set of polynomial constraints. Each layer of the neural network (convolutions, matrix multiplies, nonlinear activations) becomes a sub-circuit with constraints ensuring the outputs are correct given the inputs. Because neural nets involve non-linear operations (ReLUs, Sigmoids, etc.) not naturally suited to polynomials, techniques like lookup tables are used to handle these efficiently. For example, a ReLU (output = max(0, input)) can be enforced by a custom constraint or lookup that verifies output equals input if input≥0 else zero. The end result is a set of cryptographic constraints that the prover must satisfy, which implicitly proves the model ran correctly.
  • Execution Trace & Virtual Machines: An alternative is to treat the model inference as a program trace, as done in zkVM approaches. For instance, the JOLT zkVM targets the RISC-V instruction set; one can compile the ML model (or the code that computes it) to RISC-V and then prove each CPU instruction executed properly. JOLT introduces a “lookup singularity” technique, replacing expensive arithmetic constraints with fast table lookups for each valid CPU operation. Every operation (add, multiply, bitwise op, etc.) is checked via a lookup in a giant table of pre-computed valid outcomes, using a specialized argument (Lasso/SHOUT) to keep this efficient. This drastically reduces the prover workload: even complex 64-bit operations become a single table lookup in the proof instead of many arithmetic constraints.
  • Interactive Protocols (GKR Sum-Check): A third approach uses interactive proofs like GKR (Goldwasser–Kalai–Rotblum) to verify a layered computation. Here the model’s computation is viewed as a layered arithmetic circuit (each neural network layer is one layer of the circuit graph). The prover runs the model normally but then engages in a sum-check protocol to prove that each layer’s outputs are correct given its inputs. In Lagrange’s approach (DeepProve, detailed next), the prover and verifier perform an interactive polynomial protocol (made non-interactive via Fiat-Shamir) that checks consistency of each layer’s computations without re-doing them. This sum-check method avoids generating a monolithic static circuit; instead it verifies the consistency of computations in a step-by-step manner with minimal cryptographic operations (mostly hashing or polynomial evaluations).

Regardless of approach, the outcome is a succinct proof (typically a few kilobytes to a few tens of kilobytes) that attests to the correctness of the entire inference. The proof is zero-knowledge, meaning any secret inputs (private data or model parameters) can be kept hidden – they influence the proof but are not revealed to verifiers. Only the intended public outputs or assertions are revealed. This allows scenarios like “prove that model $M$ when applied to patient data $X$ yields diagnosis $Y$, without revealing $X$ or the model’s weights.”

Enabling on-chain verification: Once a proof is generated, it can be posted to a blockchain. Smart contracts can include verification logic to check the proof, often using precompiled cryptographic primitives. For example, Ethereum has precompiles for BLS12-381 pairing operations used in many zk-SNARK verifiers, making on-chain verification of SNARK proofs efficient. STARKs (hash-based proofs) are larger, but can still be verified on-chain with careful optimization or possibly with some trust assumptions (StarkWare’s L2, for instance, verifies STARK proofs on Ethereum by an on-chain verifier contract, albeit with higher gas cost than SNARKs). The key is that the chain does not need to execute the ML model – it only runs a verification which is much cheaper than the original compute. In summary, zkML compresses expensive AI inference into a small proof that blockchains (or any verifier) can check in milliseconds to seconds.

Lagrange DeepProve: Architecture and Performance of a zkML Breakthrough

DeepProve by Lagrange Labs is a state-of-the-art zkML inference framework focusing on speed and scalability. Launched in 2025, DeepProve introduced a new proving system that is dramatically faster than prior solutions like Ezkl. Its design centers on the GKR interactive proof protocol with sum-check and specialized optimizations for neural network circuits. Here’s how DeepProve works and achieves its performance:

  • One-Time Preprocessing: Developers start with a trained neural network (currently supported types include multilayer perceptrons and popular CNN architectures). The model is exported to ONNX format, a standard graph representation. DeepProve’s tool then parses the ONNX model and quantizes it (converts weights to fixed-point/integer form) for efficient field arithmetic. In this phase, it also generates the proving and verification keys for the cryptographic protocol. This setup is done once per model and does not need to be repeated per inference. DeepProve emphasizes ease of integration: “Export your model to ONNX → one-time setup → generate proofs → verify anywhere”.

  • Proving (Inference + Proof Generation): After setup, a prover (which could be run by a user, a service, or Lagrange’s decentralized prover network) takes a new input $X$ and runs the model $M$ on it, obtaining output $Y$. During this execution, DeepProve records an execution trace of each layer’s computations. Instead of translating every multiplication into a static circuit upfront (as SNARK approaches do), DeepProve uses the linear-time GKR protocol to verify each layer on the fly. For each network layer, the prover commits to the layer’s inputs and outputs (e.g., via cryptographic hashes or polynomial commitments) and then engages in a sum-check argument to prove that the outputs indeed result from the inputs as per the layer’s function. The sum-check protocol iteratively convinces the verifier of the correctness of a sum of evaluations of a polynomial that encodes the layer’s computation, without revealing the actual values. Non-linear operations (like ReLU, softmax) are handled efficiently through lookup arguments in DeepProve – if an activation’s output was computed, DeepProve can prove that each output corresponds to a valid input-output pair from a precomputed table for that function. Layer by layer, proofs are generated and then aggregated into one succinct proof covering the whole model’s forward pass. The heavy lifting of cryptography is minimized – DeepProve’s prover mostly performs normal numeric computations (the actual inference) plus some light cryptographic commitments, rather than solving a giant system of constraints.

  • Verification: The verifier uses the final succinct proof along with a few public values – typically the model’s committed identifier (a cryptographic commitment to $M$’s weights), the input $X$ (if not private), and the claimed output $Y$ – to check correctness. Verification in DeepProve’s system involves verifying the sum-check protocol’s transcript and the final polynomial or hash commitments. This is more involved than verifying a classic SNARK (which might be a few pairings), but it’s vastly cheaper than re-running the model. In Lagrange’s benchmarks, verifying a DeepProve proof for a medium CNN takes on the order of 0.5 seconds in software. That is ~0.5s to confirm, for example, that a convolutional network with hundreds of thousands of parameters ran correctly – over 500× faster than naively re-computing that CNN on a GPU for verification. (In fact, DeepProve measured up to 521× faster verification for CNNs and 671× for MLPs compared to re-execution.) The proof size is small enough to transmit on-chain (tens of KB), and verification could be performed in a smart contract if needed, although 0.5s of computation might require careful gas optimization or layer-2 execution.

Architecture and Tooling: DeepProve is implemented in Rust and provides a toolkit (the zkml library) for developers. It natively supports ONNX model graphs, making it compatible with models from PyTorch or TensorFlow (after exporting). The proving process currently targets models up to a few million parameters (tests include a 4M-parameter dense network). DeepProve leverages a combination of cryptographic components: a multilinear polynomial commitment (to commit to layer outputs), the sum-check protocol for verifying computations, and lookup arguments for non-linear ops. Notably, Lagrange’s open-source repository acknowledges it builds on prior work (the sum-check and GKR implementation from Scroll’s Ceno project), indicating an intersection of zkML with zero-knowledge rollup research.

To achieve real-time scalability, Lagrange pairs DeepProve with its Prover Network – a decentralized network of specialized ZK provers. Heavy proof generation can be offloaded to this network: when an application needs an inference proved, it sends the job to Lagrange’s network, where many operators (staked on EigenLayer for security) compute proofs and return the result. This network economically incentivizes reliable proof generation (malicious or failed jobs get the operator slashed). By distributing work across provers (and potentially leveraging GPUs or ASICs), the Lagrange Prover Network hides the complexity and cost from end-users. The result is a fast, scalable, and decentralized zkML service: “verifiable AI inferences fast and affordable”.

Performance Milestones: DeepProve’s claims are backed by benchmarks against the prior state-of-the-art, Ezkl. For a CNN with ~264k parameters (CIFAR-10 scale model), DeepProve’s proving time was ~1.24 seconds versus ~196 seconds for Ezkl – about 158× faster. For a larger dense network with 4 million parameters, DeepProve proved an inference in ~2.3 seconds vs ~126.8 seconds for Ezkl (~54× faster). Verification times also dropped: DeepProve verified the 264k CNN proof in ~0.6s, whereas verifying the Ezkl proof (Halo2-based) took over 5 minutes on CPU in that test. The speedups come from DeepProve’s near-linear complexity: its prover scales roughly O(n) with the number of operations, whereas circuit-based SNARK provers often have superlinear overhead (FFT and polynomial commitments scaling). In fact, DeepProve’s prover throughput can be within an order of magnitude of plain inference runtime – recent GKR systems can be <10× slower than raw execution for large matrix multiplications, an impressive achievement in ZK. This makes real-time or on-demand proofs more feasible, paving the way for verifiable AI in interactive applications.

Use Cases: Lagrange is already collaborating with Web3 and AI projects to apply zkML. Example use cases include: verifiable NFT traits (proving an AI-generated evolution of a game character or collectible is computed by the authorized model), provenance of AI content (proving an image or text was generated by a specific model, to combat deepfakes), DeFi risk models (proving a model’s output that assesses financial risk without revealing proprietary data), and private AI inference in healthcare or finance (where a hospital can get AI predictions with a proof, ensuring correctness without exposing patient data). By making AI outputs verifiable and privacy-preserving, DeepProve opens the door to “AI you can trust” in decentralized systems – moving from an era of “blind trust in black-box models” to one of “objective guarantees”.

SNARK-Based zkML: Ezkl and the Halo2 Approach

The traditional approach to zkML uses zk-SNARKs (Succinct Non-interactive Arguments of Knowledge) to prove neural network inference. Ezkl (by ZKonduit/Modulus Labs) is a leading example of this approach. It builds on the Halo2 proving system (a PLONK-style SNARK with polynomial commitments over BLS12-381). Ezkl provides a tooling chain where a developer can take a PyTorch or TensorFlow model, export it to ONNX, and have Ezkl compile it into a custom arithmetic circuit automatically.

How it works: Each layer of the neural network is converted into constraints:

  • Linear layers (dense or convolution) become collections of multiplication-add constraints that enforce the dot-products between inputs, weights, and outputs.
  • Non-linear layers (like ReLU, sigmoid, etc.) are handled via lookups or piecewise constraints because such functions are not polynomial. For instance, a ReLU can be implemented by a boolean selector $b$ with constraints ensuring $y = x \cdot b$ and $0 \le b \le 1$ and $b=1$ if $x>0$ (one way to do it), or more efficiently by a lookup table mapping $x \mapsto \max(0,x)$ for a range of $x$ values. Halo2’s lookup arguments allow mapping 16-bit (or smaller) chunks of values, so large domains (like all 32-bit values) are usually “chunked” into several smaller lookups. This chunking increases the number of constraints.
  • Big integer ops or divisions (if any) are similarly broken into small pieces. The result is a large set of R1CS/PLONK constraints tailored to the specific model architecture.

Ezkl then uses Halo2 to generate a proof that these constraints hold given the secret inputs (model weights, private inputs) and public outputs. Tooling and integration: One advantage of the SNARK approach is that it leverages well-known primitives. Halo2 is already used in Ethereum rollups (e.g. Zcash, zkEVMs), so it’s battle-tested and has an on-chain verifier readily available. Ezkl’s proofs use BLS12-381 curve, which Ethereum can verify via precompiles, making it straightforward to verify an Ezkl proof in a smart contract. The team has also provided user-friendly APIs; for example, data scientists can work with their models in Python and use Ezkl’s CLI to produce proofs, without deep knowledge of circuits.

Strengths: Ezkl’s approach benefits from the generality and ecosystem of SNARKs. It supports reasonably complex models and has already seen “practical integrations (from DeFi risk models to gaming AI)”, proving real-world ML tasks. Because it operates at the level of the model’s computation graph, it can apply ML-specific optimizations: e.g. pruning insignificant weights or quantizing parameters to reduce circuit size. It also means model confidentiality is natural – the weights can be treated as private witness data, so the verifier only sees that some valid model produced the output, or at best a commitment to the model. The verification of SNARK proofs is extremely fast (typically a few milliseconds or less on-chain), and proof sizes are small (a few kilobytes), which is ideal for blockchain usage.

Weaknesses: Performance is the Achilles’ heel. Circuit-based proving imposes large overheads, especially as models grow. It’s noted that historically, SNARK circuits could be a million times more work for the prover than just running the model itself. Halo2 and Ezkl optimize this, but still, operations like large matrix multiplications generate tons of constraints. If a model has millions of parameters, the prover must handle correspondingly millions of constraints, performing heavy FFTs and multiexponentiation in the process. This leads to high proving times (often minutes or hours for non-trivial models) and high memory usage. For example, proving even a relatively small CNN (e.g. a few hundred thousand parameters) can take tens of minutes with Ezkl on a single machine. The team behind DeepProve cited that Ezkl took hours for certain model proofs that DeepProve can do in minutes. Large models might not even fit in memory or require splitting into multiple proofs (which then need recursive aggregation). While Halo2 is “moderately optimized”, any need to “chunk” lookups or handle wide-bit operations translates to extra overhead. In summary, scalability is limited – Ezkl works well for small-to-medium models (and indeed outperformed some earlier alternatives like naive Stark-based VMs in benchmarks), but struggles as model size grows beyond a point.

Despite these challenges, Ezkl and similar SNARK-based zkML libraries are important stepping stones. They proved that verified ML inference is possible on-chain and have active usage. Notably, projects like Modulus Labs demonstrated verifying an 18-million-parameter model on-chain using SNARKs (with heavy optimization). The cost was non-trivial, but it shows the trajectory. Moreover, the Mina Protocol has its own zkML toolkit that uses SNARKs to allow smart contracts on Mina (which are Snark-based) to verify ML model execution. This indicates a growing multi-platform support for SNARK-based zkML.

STARK-Based Approaches: Transparent and Programmable ZK for ML

zk-STARKs (Scalable Transparent ARguments of Knowledge) offer another route to zkML. STARKs use hash-based cryptography (like FRI for polynomial commitments) and avoid any trusted setup. They often operate by simulating a CPU or VM and proving the execution trace is correct. In context of ML, one can either build a custom STARK for the neural network or use a general-purpose STARK VM to run the model code.

General STARK VMs (RISC Zero, Cairo): A straightforward approach is to write inference code and run it in a STARK VM. For example, Risc0 provides a RISC-V environment where any code (e.g., C++ or Rust implementation of a neural network) can be executed and proven via a STARK. Similarly, StarkWare’s Cairo language can express arbitrary computations (like an LSTM or CNN inference) which are then proved by the StarkNet STARK prover. The advantage is flexibility – you don’t need to design custom circuits for each model. However, early benchmarks showed that naive STARK VMs were slower compared to optimized SNARK circuits for ML. In one test, a Halo2-based proof (Ezkl) was about 3× faster than a STARK-based approach on Cairo, and even 66× faster than a RISC-V STARK VM on a certain benchmark in 2024. This gap is due to the overhead of simulating every low-level instruction in a STARK and the larger constants in STARK proofs (hashing is fast but you need a lot of it; STARK proof sizes are bigger, etc.). However, STARK VMs are improving and have the benefit of transparent setup (no trusted setup) and post-quantum security. As STARK-friendly hardware and protocols advance, proving speeds will improve.

DeepProve’s approach vs STARK: Interestingly, DeepProve’s use of GKR and sum-check yields a proof more akin to a STARK in spirit – it’s an interactive, hash-based proof with no need for a structured reference string. The trade-off is that its proofs are larger and verification is heavier than a SNARK. Yet, DeepProve shows that careful protocol design (specialized to ML’s layered structure) can vastly outperform both generic STARK VMs and SNARK circuits in proving time. We can consider DeepProve as a bespoke STARK-style zkML prover (though they use the term zkSNARK for succinctness, it doesn’t have a traditional SNARK’s small constant-size verification, since 0.5s verify is bigger than typical SNARK verify). Traditional STARK proofs (like StarkNet’s) often involve tens of thousands of field operations to verify, whereas SNARK verifies in maybe a few dozen. Thus, one trade-off is evident: SNARKs yield smaller proofs and faster verifiers, while STARKs (or GKR) offer easier scaling and no trusted setup at the cost of proof size and verify speed.

Emerging improvements: The JOLT zkVM (discussed earlier under JOLTx) is actually outputting SNARKs (using PLONKish commitments) but it embodies ideas that could be applied in STARK context too (Lasso lookups could theoretically be used with FRI commitments). StarkWare and others are researching ways to speed up proving of common operations (like using custom gates or hints in Cairo for big int ops, etc.). There’s also Circomlib-ML by Privacy&Scaling Explorations (PSE), which provides Circom templates for CNN layers, etc. – that’s SNARK-oriented, but conceptually similar templates could be made for STARK languages.

In practice, non-Ethereum ecosystems leveraging STARKs include StarkNet (which could allow on-chain verification of ML if someone writes a verifier, though cost is high) and Risc0’s Bonsai service (which is an off-chain proving service that emits STARK proofs which can be verified on various chains). As of 2025, most zkML demos on blockchain have favored SNARKs (due to verifier efficiency), but STARK approaches remain attractive for their transparency and potential in high-security or quantum-resistant settings. For example, a decentralized compute network might use STARKs to let anyone verify work without a trusted setup, useful for longevity. Also, some specialized ML tasks might exploit STARK-friendly structures: e.g. computations heavily using XOR/bit operations could be faster in STARKs (since those are cheap in boolean algebra and hashing) than in SNARK field arithmetic.

Summary of SNARK vs STARK for ML:

  • Performance: SNARKs (like Halo2) have huge proving overhead per gate but benefit from powerful optimizations and small constants for verify; STARKs (generic) have larger constant overhead but scale more linearly and avoid expensive crypto like pairings. DeepProve shows that customizing the approach (sum-check) yields near-linear proving time (fast) but with a STARK-like proof. JOLT shows that even a general VM can be made faster with heavy use of lookups. Empirically, for models up to millions of operations: a well-optimized SNARK (Ezkl) can handle it but might take tens of minutes, whereas DeepProve (GKR) can do it in seconds. STARK VMs in 2024 were likely in between or worse than SNARKs unless specialized (Risc0 was slower in tests, Cairo was slower without custom hints).
  • Verification: SNARK proofs verify quickest (milliseconds, and minimal data on-chain ~ a few hundred bytes to a few KB). STARK proofs are larger (dozens of KB) and take longer (tens of ms to seconds) to verify due to many hashing steps. In blockchain terms, a SNARK verify might cost e.g. ~200k gas, whereas a STARK verify could cost millions of gas – often too high for L1, acceptable on L2 or with succinct verification schemes.
  • Setup and Security: SNARKs like Groth16 require a trusted setup per circuit (unfriendly for arbitrary models), but universal SNARKs (PLONK, Halo2) have a one-time setup that can be reused for any circuit up to certain size. STARKs need no setup and use only hash assumptions (plus classical polynomial complexity assumptions), and are post-quantum secure. This makes STARKs appealing for longevity – proofs remain secure even if quantum computers emerge, whereas current SNARKs (BLS12-381 based) would be broken by quantum attacks.

We will consolidate these differences in a comparison table shortly.

FHE for ML (FHE-o-ML): Private Computation vs. Verifiable Computation

Fully Homomorphic Encryption (FHE) is a cryptographic technique that allows computations to be performed directly on encrypted data. In the context of ML, FHE can enable a form of privacy-preserving inference: for example, a client can send encrypted input to a model host, the host runs the neural network on the ciphertext without decrypting it, and sends back an encrypted result which the client can decrypt. This ensures data confidentiality – the model owner learns nothing about the input (and potentially the client learns only the output, not the model’s internals if they only get output). However, FHE by itself does not produce a proof of correctness in the same way ZKPs do. The client must trust that the model owner actually performed the computation honestly (the ciphertext could have been manipulated). Usually, if the client has the model or expects a certain distribution of outputs, blatant cheating can be detected, but subtle errors or use of a wrong model version would not be evident just from the encrypted output.

Trade-offs in performance: FHE is notoriously heavy in computation. Running deep learning inference under FHE incurs orders-of-magnitude slowdown. Early experiments (e.g., CryptoNets in 2016) took tens of seconds to evaluate a tiny CNN on encrypted data. By 2024, improvements like CKKS (for approximate arithmetic) and better libraries (Microsoft SEAL, Zama’s Concrete) have reduced this overhead, but it remains large. For example, a user reported that using Zama’s Concrete-ML to run a CIFAR-10 classifier took 25–30 minutes per inference on their hardware. After optimizations, Zama’s team achieved ~40 seconds for that inference on a 192-core server. Even 40s is extremely slow compared to a plaintext inference (which might be 0.01s), showing a ~$10^3$–$10^4\times$ overhead. Larger models or higher precision increase the cost further. Additionally, FHE operations consume a lot of memory and require occasional bootstrapping (a noise-reduction step) which is computationally expensive. In summary, scalability is a major issue – state-of-the-art FHE might handle a small CNN or simple logistic regression, but scaling to large CNNs or Transformers is beyond current practical limits.

Privacy advantages: FHE’s big appeal is data privacy. The input can remain completely encrypted throughout the process. This means an untrusted server can compute on a client’s private data without learning anything about it. Conversely, if the model is sensitive (proprietary), one could envisage encrypting the model parameters and having the client perform FHE inference on their side – but this is less common because if the client has to do the heavy FHE compute, it negates the idea of offloading to a powerful server. Typically, the model is public or held by server in the clear, and the data is encrypted by the client’s key. Model privacy in that scenario is not provided by default (the server knows the model; the client learns outputs but not weights). There are more exotic setups (like secure two-party computation or multi-key FHE) where both model and data can be kept private from each other, but those incur even more complexity. In contrast, zkML via ZKPs can ensure model privacy and data privacy at once – the prover can have both the model and data as secret witness, only revealing what’s needed to the verifier.

No on-chain verification needed (and none possible): With FHE, the result comes out encrypted to the client. The client then decrypts it to obtain the actual prediction. If we want to use that result on-chain, the client (or whoever holds the decryption key) would have to publish the plaintext result and convince others it’s correct. But at that point, trust is back in the loop – unless combined with a ZKP. In principle, one could combine FHE and ZKP: e.g., use FHE to keep data private during compute, and then generate a ZK-proof that the plaintext result corresponds to a correct computation. However, combining them means you pay the performance penalty of FHE and ZKP – extremely impractical with today’s tech. So, in practice FHE-of-ML and zkML serve different use cases:

  • FHE-of-ML: Ideal when the goal is confidentiality between two parties (client and server). For instance, a cloud service can host an ML model and users can query it with their sensitive data without revealing the data to the cloud (and if the model is sensitive, perhaps deploy it via FHE-friendly encodings). This is great for privacy-preserving ML services (medical predictions, etc.). The user still has to trust the service to faithfully run the model (since no proof), but at least any data leakage is prevented. Some projects like Zama are even exploring an “FHE-enabled EVM (fhEVM)” where smart contracts could operate on encrypted inputs, but verifying those computations on-chain would require the contract to somehow enforce correct computation – an open challenge likely requiring ZK proofs or specialized secure hardware.
  • zkML (ZKPs): Ideal when the goal is verifiability and public auditability. If you want anyone (or any contract) to be sure that “Model $M$ was evaluated correctly on $X$ and produced $Y$”, ZKPs are the solution. They also provide privacy as a bonus (you can hide $X$ or $Y$ or $M$ if needed by treating them as private inputs to the proof), but their primary feature is the proof of correct execution.

A complementary relationship: It’s worth noting that ZKPs protect the verifier (they learn nothing about secrets, only that the computation was correctly done), whereas FHE protects the prover’s data from the computing party. In some scenarios, these could be combined – for example, a network of untrusted nodes could use FHE to compute on users’ private data and then provide ZK proofs to the users (or blockchain) that the computations were done according to the protocol. This would cover both privacy and correctness, but the performance cost is enormous with today’s algorithms. More feasible in the near term are hybrids like Trusted Execution Environments (TEE) plus ZKP or Functional Encryption plus ZKP – these are beyond our scope, but they aim to provide something similar (TEEs keep data/model secret during compute, then a ZKP can attest the TEE did the right thing).

In summary, FHE-of-ML prioritizes confidentiality of inputs/outputs, while zkML prioritizes verifiable correctness (with possible privacy). Table 1 below contrasts the key properties:

ApproachProver Performance (Inference & Proof)Proof Size & VerificationPrivacy FeaturesTrusted Setup?Post-Quantum?
zk-SNARK (Halo2, Groth16, PLONK, etc)Heavy prover overhead (up to 10^6× normal runtime without optimizations; in practice 10^3–10^5×). Optimized for specific model/circuit; proving time in minutes for medium models, hours for large. Recent zkML SNARKs (DeepProve with GKR) vastly improve this (near-linear overhead, e.g. seconds instead of minutes for million-param models).Very small proofs (often < 100 KB, sometimes ~a few KB). Verification is fast: a few pairings or polynomial evals (typically < 50 ms on-chain). DeepProve’s GKR-based proofs are larger (tens–hundreds KB) and verify in ~0.5 s (still much faster than re-running the model).Data confidentiality: Yes – inputs can be private in proof (not revealed). Model privacy: Yes – prover can commit to model weights and not reveal them. Output hiding: Optional – proof can be of a statement without revealing output (e.g. “output has property P”). However, if the output itself is needed on-chain, it typically becomes public. Overall, SNARKs offer full zero-knowledge flexibility (hide whichever parts you want).Depends on scheme. Groth16/EZKL require a trusted setup per circuit; PLONK/Halo2 use a universal setup (one time). DeepProve’s sum-check GKR is transparent (no setup) – a bonus of that design.Classical SNARKs (BLS12-381 curves) are not PQ-safe (vulnerable to quantum attacks on elliptic curve discrete log). Some newer SNARKs use PQ-safe commitments, but Halo2/PLONK as used in Ezkl are not PQ-safe. GKR (DeepProve) uses hash commitments (e.g. Poseidon/Merkle) which are conjectured PQ-safe (relying on hash preimage resistance).
zk-STARK (FRI, hash-based proof)Prover overhead is high but more linear scaling. Typically 10^2–10^4× slower than native for large tasks, with room to parallelize. General STARK VMs (Risc0, Cairo) saw slower performance vs SNARK for ML in 2024 (e.g. 3×–66× slower than Halo2 in some cases). Specialized STARKs (or GKR) can approach linear overhead and outperform SNARKs for large circuits.Proofs are larger: often tens of KB (growing with circuit size/log(n)). Verifier must do multiple hash and FFT checks – verification time ~O(n^ε) for small ε (e.g. ~50 ms to 500 ms depending on proof size). On-chain, this is costlier (StarkWare’s L1 verifier can take millions of gas per proof). Some STARKs support recursive proofs to compress size, at cost of prover time.Data & Model privacy: A STARK can be made zero-knowledge by randomizing trace data (adding blinding to polynomial evaluations), so it can hide private inputs similarly to SNARK. Many STARK implementations focus on integrity, but zk-STARK variants do allow privacy. So yes, they can hide inputs/models like SNARKs. Output hiding: likewise possible in theory (prover doesn’t declare the output as public), but rarely used since usually the output is what we want to reveal/verify.No trusted setup. Transparency is a hallmark of STARKs – only require common random string (which Fiat-Shamir can derive). This makes them attractive for open-ended use (any model, any time, no per-model ceremony).Yes, STARKs rely on hash and information-theoretic security assumptions (like random oracle and difficulty of certain codeword decoding in FRI). These are believed to be secure against quantum adversaries. STARK proofs are thus PQ-resistant, an advantage for future-proofing verifiable AI.
FHE for ML (Fully Homomorphic Encryption applied to inference)Prover = party doing computation on encrypted data. The computation time is extremely high: 10^3–10^5× slower than plaintext inference is common. High-end hardware (many-core servers, FPGA, etc.) can mitigate this. Some optimizations (low-precision inference, leveled FHE parameters) can reduce overhead but there is a fundamental performance hit. FHE is currently practical for small models or simple linear models; deep networks remain challenging beyond toy sizes.No proof generated. The result is an encrypted output. Verification in the sense of checking correctness is not provided by FHE alone – one trusts the computing party to not cheat. (If combined with secure hardware, one might get an attestation; otherwise, a malicious server could return an incorrect encrypted result that the client would decrypt to wrong output without knowing the difference).Data confidentiality: Yes – the input is encrypted, so the computing party learns nothing about it. Model privacy: If the model owner is doing the compute on encrypted input, the model is in plaintext on their side (not protected). If roles are reversed (client holds model encrypted and server computes), model could be kept encrypted, but this scenario is less common. There are techniques like secure two-party ML that combine FHE/MPC to protect both, but these go beyond plain FHE. Output hiding: By default, the output of the computation is encrypted (only decryptable by the party with the secret key, usually the input owner). So the output is hidden from the computing server. If we want the output public, the client can decrypt and reveal it.No setup needed. Each user generates their own key pair for encryption. Trust relies on keys remaining secret.The security of FHE schemes (e.g. BFV, CKKS, TFHE) is based on lattice problems (Learning With Errors), which are believed to be resistant to quantum attacks (at least no efficient quantum algorithm is known). So FHE is generally considered post-quantum secure.

Table 1: Comparison of zk-SNARK, zk-STARK, and FHE approaches for machine learning inference (performance and privacy trade-offs).

Use Cases and Implications for Web3 Applications

The convergence of AI and blockchain via zkML unlocks powerful new application patterns in Web3:

  • Decentralized Autonomous Agents & On-Chain Decision-Making: Smart contracts or DAOs can incorporate AI-driven decisions with guarantees of correctness. For example, imagine a DAO that uses a neural network to analyze market conditions before executing trades. With zkML, the DAO’s smart contract can require a zkSNARK proof that the authorized ML model (with a known hash commitment) was run on the latest data and produced the recommended action, before the action is accepted. This prevents malicious actors from injecting a fake prediction – the chain verifies the AI’s computation. Over time, one could even have fully on-chain autonomous agents (contracts that query off-chain AI or contain simplified models) making decisions in DeFi or games, with all their moves proven correct and policy-compliant via zk proofs. This raises the trust in autonomous agents, since their “thinking” is transparent and verifiable rather than a black-box.

  • Verifiable Compute Markets: Projects like Lagrange are effectively creating verifiable computation marketplaces – developers can outsource heavy ML inference to a network of provers and get back a proof with the result. This is analogous to decentralized cloud computing, but with built-in trust: you don’t need to trust the server, only the proof. It’s a paradigm shift for oracles and off-chain computation. Protocols like Ethereum’s upcoming DSC (decentralized sequencing layer) or oracle networks could use this to provide data feeds or analytic feeds with cryptographic guarantees. For instance, an oracle could supply “the result of model X on input Y” and anyone can verify the attached proof on-chain, rather than trusting the oracle’s word. This could enable verifiable AI-as-a-service on blockchain: any contract can request a computation (like “score these credit risks with my private model”) and accept the answer only with a valid proof. Projects such as Gensyn are exploring decentralized training and inference marketplaces using these verification techniques.

  • NFTs and Gaming – Provenance and Evolution: In blockchain games or NFT collectibles, zkML can prove traits or game moves were generated by legitimate AI models. For example, a game might allow an AI to evolve an NFT pet’s attributes. Without ZK, a clever user might modify the AI or the outcome to get a superior pet. With zkML, the game can require a proof that “pet’s new stats were computed by the official evolution model on the pet’s old stats”, preventing cheating. Similarly for generative art NFTs: an artist could release a generative model as a commitment; later, when minting NFTs, prove each image was produced by that model given some seed, guaranteeing authenticity (and even doing so without revealing the exact model to the public, preserving the artist’s IP). This provenance verification ensures authenticity in a manner akin to verifiable randomness – except here it’s verifiable creativity.

  • Privacy-Preserving AI in Sensitive Domains: zkML allows confirmation of outcomes without exposing inputs. In healthcare, a patient’s data could be run through an AI diagnostic model by a cloud provider; the hospital receives a diagnosis and a proof that the model (which could be privately held by a pharmaceutical company) was run correctly on the patient data. The patient data remains private (only an encrypted or committed form was used in the proof), and the model weights remain proprietary – yet the result is trusted. Regulators or insurance could also verify that only approved models were used. In finance, a company could prove to an auditor or regulator that its risk model was applied to its internal data and produced certain metrics without revealing the underlying sensitive financial data. This enables compliance and oversight with cryptographic assurances rather than manual trust.

  • Cross-Chain and Off-Chain Interoperability: Because zero-knowledge proofs are fundamentally portable, zkML can facilitate cross-chain AI results. One chain might have an AI-intensive application running off-chain; it can post a proof of the result to a different blockchain, which will trustlessly accept it. For instance, consider a multi-chain DAO using an AI to aggregate sentiment across social media (off-chain data). The AI analysis (complex NLP on large data) is done off-chain by a service that then posts a proof to a small blockchain (or multiple chains) that “analysis was done correctly and output sentiment score = 0.85”. All chains can verify and use that result in their governance logic, without each needing to rerun the analysis. This kind of interoperable verifiable compute is what Lagrange’s network aims to support, by serving multiple rollups or L1s simultaneously. It removes the need for trusted bridges or oracle assumptions when moving results between chains.

  • AI Alignment and Governance: On a more forward-looking note, zkML has been highlighted as a tool for AI governance and safety. Lagrange’s vision statements, for example, argue that as AI systems become more powerful (even superintelligent), cryptographic verification will be essential to ensure they follow agreed rules. By requiring AI models to produce proofs of their reasoning or constraints, humans retain a degree of control – “you cannot trust what you cannot verify”. While this is speculative and involves social as much as technical aspects, the technology could enforce that an AI agent running autonomously still proves it is using an approved model and hasn’t been tampered with. Decentralized AI networks might use on-chain proofs to verify contributions (e.g., a network of nodes collaboratively training a model can prove each update was computed faithfully). Thus zkML could play a role in ensuring AI systems remain accountable to human-defined protocols even in decentralized or uncontrolled environments.

In conclusion, zkML and verifiable on-chain AI represent a convergence of advanced cryptography and machine learning that stands to enhance trust, transparency, and privacy in AI applications. By comparing the major approaches – zk-SNARKs, zk-STARKs, and FHE – we see a spectrum of trade-offs between performance and privacy, each suitable for different scenarios. SNARK-based frameworks like Ezkl and innovations like Lagrange’s DeepProve have made it feasible to prove substantial neural network inferences with practical effort, opening the door to real-world deployments of verifiable AI. STARK-based and VM-based approaches promise greater flexibility and post-quantum security, which will become important as the field matures. FHE, while not a solution for verifiability, addresses the complementary need of confidential ML computation, and in combination with ZKPs or in specific private contexts it can empower users to leverage AI without sacrificing data privacy.

The implications for Web3 are significant: we can foresee smart contracts reacting to AI predictions, knowing they are correct; markets for compute where results are trustlessly sold; digital identities (like Worldcoin’s proof-of-personhood via iris AI) protected by zkML to confirm someone is human without leaking their biometric image; and generally a new class of “provable intelligence” that enriches blockchain applications. Many challenges remain – performance for very large models, developer ergonomics, and the need for specialized hardware – but the trajectory is clear. As one report noted, “today’s ZKPs can support small models, but moderate to large models break the paradigm”; however, rapid advances (50×–150× speedups with DeepProve over prior art) are pushing that boundary outward. With ongoing research (e.g., on hardware acceleration and distributed proving), we can expect progressively larger and more complex AI models to become provable. zkML might soon evolve from niche demos to an essential component of trusted AI infrastructure, ensuring that as AI becomes ubiquitous, it does so in a way that is auditable, decentralized, and aligned with user privacy and security.