Skip to main content

Covenant-72B: The Largest Collaboratively Trained AI Model in Crypto History

· 9 min read
Dora Noda
Software Engineer

What if the next frontier AI model wasn't trained in a billion-dollar data center owned by a single corporation — but by dozens of anonymous contributors scattered across the globe, coordinated by a blockchain, communicating over ordinary internet connections?

That's exactly what just happened. Templar's Covenant-72B, a 72.7-billion-parameter large language model pre-trained entirely on Bittensor's Subnet 3, has become the largest collaboratively trained AI model in crypto history — and one of the first to achieve competitive performance with centralized baselines while allowing fully permissionless participation. No whitelists. No corporate gatekeepers. Just GPUs, compressed gradients, and a token-incentive mechanism that kept everyone honest.

Anthropic co-founder Jack Clark called out the achievement in his influential Import AI newsletter, noting that decentralized training compute is growing at 20x per year — four times faster than centralized frontier training's 5x annual growth rate.

Here's why this matters far beyond the Bittensor ecosystem.

The $1 Billion Problem Covenant-72B Addresses

Training a frontier LLM in 2026 is an exercise in concentrated capital. Anthropic's CEO has stated that single training runs are approaching $1 billion in cost. OpenAI, Google DeepMind, and xAI compete for finite supplies of NVIDIA H100 and B200 GPUs, locking them into multi-year cloud contracts worth billions. The result: only five or six organizations on Earth can afford to train models at the frontier.

This concentration creates real risks. A single company's alignment choices, data curation decisions, and commercial incentives shape the AI systems that billions of people use. If frontier model training remains exclusively centralized, the "who decides" question in AI governance narrows to a handful of boardrooms.

Covenant-72B doesn't solve this overnight. But it provides the first credible proof that a different path exists at meaningful scale.

Inside Covenant-72B: The Technical Architecture

Model Specifications

Covenant-72B uses a LLaMA-style architecture with 80 transformer layers, 8,192 model width, 64 query attention heads, and 8 key-value heads via grouped-query attention. It uses RoPE positional embeddings and the Gemma 3 SentencePiece tokenizer with a 262,208-token vocabulary.

The model was trained on approximately 1.1 trillion tokens — 1.09 trillion from DCLM web text during the main phase, plus 14.2 billion tokens during an annealing phase on curated high-quality data (27% instruction, 20% synthetic web, 15% code, 13% math, 25% replay). A supervised fine-tuning stage added another 14.8 billion tokens to produce a chat-capable variant.

SparseLoCo: The Communication Breakthrough

The core innovation enabling decentralized training at this scale is SparseLoCo, a communication-efficient optimizer that achieves a Pareto-optimal tradeoff between model performance and bandwidth consumption.

Here's the problem it solves: in centralized training, GPUs in the same data center exchange gradients over high-speed interconnects (NVLink, InfiniBand) with hundreds of gigabits per second of bandwidth. Distributed training over commodity internet has orders of magnitude less bandwidth. Naively synchronizing gradients would make training impossibly slow.

SparseLoCo uses chunk-wise Top-k sparsification with 2-bit quantization to compress pseudo-gradients by more than 146x. Each peer runs 30 inner optimization steps locally using AdamW, then communicates only the most significant gradient updates in heavily compressed form. The result: each training round requires roughly 20 minutes of compute but only 70 seconds of communication — achieving 94.5% compute utilization.

For comparison, the previous largest decentralized training effort, Prime Intellect's INTELLECT-1 (a 10B parameter model), required 8.3 minutes of communication overhead per round. Covenant-72B trained a model 7x larger with 7x less communication time.

Gauntlet: Keeping Anonymous Participants Honest

Permissionless participation creates an obvious problem: how do you prevent freeloaders or adversarial actors from submitting garbage gradients and collecting rewards?

Gauntlet is the answer — a blockchain-compatible reward mechanism that validates each peer's contribution through multiple checks:

  • LossScore evaluation: Peers are assessed on whether their gradient updates actually improve model loss on held-out data batches.
  • Liveness and synchronization checks: Ensuring peers are actually training and staying current with the global model state.
  • Duplicate detection: Comparing loss improvement on assigned versus random data to catch peers copying others' work.
  • Norm-based scaling: Contributions are normalized relative to the median, preventing any single peer from dominating updates.

This is what makes Covenant-72B fundamentally different from Prime Intellect's INTELLECT-1 or Psyche's Consilience-40B: those projects required whitelisted participants. Covenant-72B was open to anyone with the hardware.

The Numbers: How Does It Compare?

Benchmark Performance

On zero-shot evaluations, Covenant-72B performs competitively with centralized models trained at similar scale:

BenchmarkCovenant-72BK2 (65B, centralized)LLaMA-2-70B (centralized)
ARC-Challenge56.8%53.8%57.4%
MMLU67.1%65.5%65.6%
HellaSwag80.6%82.9%84.3%
WinoGrande75.9%76.4%80.4%
PIQA81.6%82.5%82.6%

Covenant-72B outperforms both baselines on MMLU (the broad knowledge benchmark) and ARC-Challenge (scientific reasoning), while trailing modestly on HellaSwag and WinoGrande. The researchers attribute these gaps to differences in data mixture and training recipes rather than infrastructure limitations.

The chat-tuned variant shows particular strength in instruction following (IFEval: 64.7%) and mathematical reasoning (MATH: 26.3%), outperforming K2-Chat on both metrics.

Scale of Participation

  • Average contributing peers per round: 16.9 (capped at 20 replicas)
  • Average active peers per step: 24.4
  • Minimum unique participants: 70+ throughout the training run
  • Hardware per peer: 8x NVIDIA B200 GPUs
  • Total training rounds: ~6,190

Why Anthropic's Co-Founder Is Paying Attention

Jack Clark's analysis in Import AI highlighted a striking asymmetry: decentralized training compute is currently about 1,000x smaller than frontier centralized training. But it's growing at 20x per year, while centralized training grows at 5x per year.

If those growth rates hold, the gap closes within a few years. Clark noted that decentralized training is "technically feasible and may support broader collective development of more powerful models."

This matters because it challenges the implicit assumption in AI governance discussions — that training frontier models will always require the resources of nation-states or trillion-dollar corporations. If a blockchain-coordinated network of anonymous GPU owners can train competitive 72B models today, what happens when the same approach scales to 200B or 400B parameters?

The Covenant AI Ecosystem

Templar's success has spawned a broader ecosystem called Covenant AI, built on three interconnected platforms:

  • Templar (Subnet 3): Decentralized pre-training — the engine behind Covenant-72B
  • Basilica: Decentralized compute rental — making GPU resources accessible to the network
  • Grail: Decentralized post-training — reinforcement learning from human feedback (RLHF) and alignment

This three-layer stack mirrors the full pipeline of modern AI development, from raw pre-training through fine-tuning to alignment. If all three layers can operate at scale without centralized coordination, it would represent a complete alternative to the vertically integrated approach of labs like OpenAI and Anthropic.

The Competitive Landscape in Decentralized AI Training

Covenant-72B didn't emerge in a vacuum. Several projects are competing to prove decentralized training's viability:

ProjectParametersTokensPermissionless?Status
Covenant-72B (Bittensor)72.7B1.1TYesCompleted
Consilience-40B (Psyche)40BNo (whitelisted)Completed
INTELLECT-1 (Prime Intellect)10BNo (whitelisted)Completed
INTELLECT-3 (Prime Intellect)106B MoEClaimed decentralizedTrained on centralized 512-GPU cluster
GensynProtocol layerN/A$50.6M raised, protocol in development

The contrast with Prime Intellect is particularly striking. INTELLECT-3, a 106B Mixture-of-Experts model scoring 90.8% on AIME 2024, was marketed as a decentralized AI project — but was actually trained on a centralized 512-GPU cluster. Covenant-72B's fully permissionless, blockchain-verified approach stands in sharp contrast.

Limitations and Honest Challenges

Covenant-72B is a milestone, not a finish line. Several limitations deserve acknowledgment:

Scale gap remains large. At roughly 9 x 10^17 FLOPs/s, Covenant-72B's training compute is approximately 1,000x smaller than frontier centralized runs. Matching GPT-4-class models requires closing that gap substantially.

Participation was capped. The 20-replica cap and requirement for 8x B200 GPUs per peer limits participation to well-resourced contributors. This isn't "train AI on your laptop" — it's decentralized among entities with serious hardware.

Cost redistribution, not reduction. Decentralized training doesn't inherently cost less than centralized training. It changes the financing model — distributing costs across many participants via token incentives rather than concentrating them in a single organization's balance sheet.

Quality gaps in some benchmarks. The model trails centralized baselines on HellaSwag and WinoGrande, suggesting that data curation and training recipe optimization remain areas where centralized labs hold an edge — for now.

What This Means for the Future of AI

Covenant-72B represents a phase transition in the decentralized AI narrative. Prior to this, "decentralized AI training" was either theoretical, limited to small models, or required trusted participants. Now there's a published arXiv paper, open model weights on Hugging Face, and benchmark results showing competitive performance — all from a fully permissionless network coordinated by a blockchain.

The implications cascade across multiple domains:

AI governance: If training can be decentralized, the "regulate the data centers" approach to AI safety becomes insufficient. Policymakers will need frameworks that account for distributed training.

Open-source AI: Covenant-72B's weights are publicly available, adding a 72B-class model to the open-source ecosystem that wasn't funded by any single corporation.

Token economics: Bittensor's TAO token, which incentivized the entire training run, demonstrates a concrete use case for crypto tokens beyond speculation — funding AI research through market-driven incentive mechanisms.

Competitive dynamics: If decentralized training continues scaling at 20x/year, centralized labs face pressure not just from each other but from open, permissionless networks that can't be acquired, regulated as a single entity, or shut down.

The question is no longer whether decentralized AI training works. It's how fast it can close the gap with centralized frontier labs — and what happens to the AI industry's power structure when it does.


BlockEden.xyz provides enterprise-grade blockchain API infrastructure powering the decentralized networks that make projects like Bittensor possible. Explore our API marketplace to build on the infrastructure layer of the decentralized AI revolution.