Solana's Outage History Exposed: A Data-Driven Analysis of Why Firedancer Solves the 2022 Problems but Not the 2024 Ones

A Data-Driven Look at Every Major Solana Outage

I have compiled a comprehensive analysis of every major Solana network disruption since 2022, because understanding the failure modes of the past is the best way to evaluate whether Firedancer actually solves the right problems. The data tells a story that is more nuanced than either the critics or the defenders want to admit.

The Outage Timeline

2022: The Year of Growing Pains

January 2022 – Consensus Stall (Congestion-Induced)
The network experienced severe congestion between January 6-12, leading to degraded performance and partial outages. Under heavy load, vote transactions (critical for consensus) were being crowded out by regular transactions. Without confirmed votes, consensus stalled and block production halted. This was fundamentally a resource prioritization bug – the client did not differentiate between consensus-critical and user-submitted transactions under extreme load.

April 2022 – NFT Bot Surge (6M RPS)
This was the most dramatic failure. Some nodes reported receiving six million requests per second, generating over 100 Gbps of traffic per node. The trigger was bots competing for NFT mints through Metaplex Candy Machine. Validators ran out of memory and crashed sequentially, stalling consensus. This was simultaneously a networking failure (inability to handle traffic spikes), a resource management failure (no memory limits), and an economic design failure (no cost to spam).

Late 2022 – Duplicate Block Production
A backup “hot-spare” validator began producing duplicate blocks at the same height. A bug in fork selection logic prevented other validators from correctly resolving the fork, halting consensus. This was a classic distributed systems bug – the kind that only manifests in specific operational configurations that are difficult to test for.

2023: Stability Improves, But…

February 2023 – Oversized Block Propagation
A malfunctioning validator broadcast an unusually large block that overwhelmed Solana’s Turbine block propagation protocol. This cascaded into a network-wide outage. The fix required protocol-level changes to how blocks are shredded and distributed. This one is particularly relevant to the Firedancer discussion because Firedancer’s Turbine implementation is independent – it could have potentially avoided this specific failure mode.

2024: The Last Straw?

February 2024 – LoadedPrograms Bug (5-Hour Outage)
A bug in the LoadedPrograms function caused validators to crash. The roughly five-hour outage required a coordinated validator restart. This is the most concerning outage from a client diversity perspective because it was a consensus-critical execution bug – exactly the kind of failure that an independent client should catch. But because Frankendancer shares Agave’s execution runtime, it would have been affected too.

What Pattern Do These Outages Reveal?

Looking at the data, Solana’s outages fall into three categories:

Category Outages Would Firedancer Help?
Networking/Traffic handling Jan 2022, Apr 2022 Yes – Firedancer’s kernel bypass and custom QUIC stack directly address these
Consensus/Fork selection Late 2022, Feb 2023 Partially – Independent implementation might diverge on edge cases
Execution runtime bugs Feb 2024 No (Frankendancer) / Yes (Full Firedancer)

This is the critical insight: Firedancer solves the right problems for 2022-era Solana, but the network’s failure modes have evolved. The networking-layer outages have been largely addressed through Agave improvements independent of Firedancer. The remaining risks are in the execution and consensus layers – precisely the components that Frankendancer shares with Agave.

The Reliability Numbers in Context

It is worth noting that Solana’s reliability has improved dramatically. Since the February 2024 outage, the network has maintained continuous uptime through record-breaking transaction volumes. The 2024-2025 period saw:

  • Daily transaction counts regularly exceeding 50 million
  • Peak throughput above 4,000 TPS sustained
  • Zero major outages for over 12 months

This improvement came primarily from Agave client fixes, not from Firedancer adoption. Vote transaction prioritization, memory management improvements, and Turbine protocol hardening were all Agave-side changes.

Why a Second Client Is Still Critical

Despite the improved reliability, a second independent client remains essential for three reasons:

  1. Systematic blind spots: Every codebase has bugs that its developers cannot see because they share the same mental model. An independent team implementing the same protocol from scratch will make different assumptions and catch different edge cases.

  2. Operational resilience: If a zero-day vulnerability is discovered in Agave, validators need an alternative they can switch to immediately. Without Firedancer, the only option during an Agave critical vulnerability is to shut down the network entirely while a patch is developed.

  3. Performance competition: Firedancer’s existence has already motivated Agave performance improvements. The competitive pressure between two client teams produces better software for the entire ecosystem.

The Bottom Line

Solana’s outage history strongly supports the need for client diversity. But the specific kind of diversity matters enormously. Frankendancer provides meaningful networking diversity that would have prevented 2-3 of the historical outages. Full Firedancer would provide comprehensive diversity that addresses all categories.

The current 21% Frankendancer stake is progress, but it is not the finish line. The ecosystem needs to push toward full Firedancer adoption and the 33% threshold as quickly as operational safety allows.


Lisa Rodriguez is an L2 scaling engineer and former infrastructure lead at Polygon and Optimism.

Lisa, this is exactly the kind of rigorous, evidence-based analysis that the Solana client diversity debate needs. Your categorization table is excellent.

I want to push back on one claim though: you say Firedancer “solves the right problems for 2022-era Solana.” I think that understates the networking improvements’ ongoing relevance.

The reason Solana has not had a networking-induced outage since 2022 is partly because transaction volumes have been managed through fee markets and priority fees. But Solana’s ambition is to scale to millions of daily active users and significantly higher throughput. At those levels, the networking bottlenecks that caused the 2022 outages will resurface – and Firedancer’s kernel-bypass architecture provides a fundamental, not incremental, solution.

Consider the April 2022 outage: 6 million RPS and 100 Gbps per node. Agave’s fix was essentially to add traffic throttling – a valid solution, but one that caps throughput. Firedancer’s approach is to handle the traffic directly at the hardware level without throttling. These are philosophically different approaches: Agave says “reject excess traffic,” Firedancer says “process all the traffic.”

I also want to note something about your outage timeline. You describe these as “outages” but Solana’s architecture means they were actually halts – the network stopped producing blocks entirely. This is qualitatively different from Ethereum, where individual client bugs cause degraded performance (missed attestations, inactivity leaks) but the network continues operating. Solana’s tight coupling between networking, consensus, and execution means any component failure tends to cascade into a full halt.

This architectural difference is actually an argument for faster Firedancer adoption, not slower. In a system where any component failure causes a full halt, having independent implementations of every component is proportionally more important.

Your point about full Firedancer vs. Frankendancer addressing different failure categories is the most important insight in this thread. The community needs to understand that the current migration is Phase 1, not the endgame.

Lisa’s categorization is useful, but I think the “Would Firedancer Help?” column needs more nuance for the consensus/fork selection category.

The late 2022 duplicate block outage and the February 2023 oversized block outage are both cases where a second client with an independent consensus implementation would have behaved differently. In the duplicate block scenario, Firedancer’s fork choice rule implementation – being written from scratch – might not have had the same bug that prevented validators from building on the correct chain. In the oversized block scenario, Firedancer’s independent Turbine implementation would process block propagation differently, potentially avoiding the cascade.

The keyword there is “might.” And that uncertainty is exactly the problem. Without formal verification of both clients’ fork choice implementations, we cannot guarantee that independent implementations will diverge in the right direction during a failure. It is entirely possible that two independently implemented clients hit the same logical bug because the protocol specification (such as it is) is ambiguous on edge cases.

This is why I keep emphasizing the formal specification gap. Ethereum’s consensus-specs repository is executable – you can literally run the spec and compare client behavior against it. Solana’s “specification” is the Agave source code. When Firedancer encounters an edge case not covered by explicit documentation, the team’s approach is to match Agave’s behavior – which means replicating bugs alongside correct behavior.

On the reliability improvement point: the 12+ months of uptime is genuinely impressive, but sample size matters. Solana had long periods of stability between outages before too. The February 2024 outage came after months of clean operation. Absence of failure is not proof of resilience; it is just absence of the specific trigger conditions.

The true test of Firedancer’s value will come during the next stress event – whether that is a memecoin mania, a DeFi liquidation cascade, or a targeted attack. Until then, we are operating on engineering analysis and hope, not empirical evidence.

I want to focus on something Lisa mentioned but did not elaborate on: the coordinated validator restart that resolved the February 2024 outage.

From a security perspective, coordinated restarts are a governance mechanism that operates entirely outside the protocol’s formal security model. When the network halts and validators coordinate via Discord and Telegram to restart with a patched binary, they are essentially performing a manual consensus override. This works when the validator set is cooperative and aligned, but it is a centralization risk that undermines the trustless guarantees blockchain systems are supposed to provide.

Client diversity reduces the probability of needing coordinated restarts, but it does not eliminate it. If a bug affects the protocol specification itself (as opposed to a single implementation), all compliant clients would reproduce it. The only defense in that scenario is still coordinated human intervention.

What concerns me specifically about Solana’s outage pattern is the recovery mechanism. In every major outage, recovery required:

  1. A core team identifying the bug
  2. Developing and testing a patch
  3. Distributing the patch to validators
  4. Coordinating a simultaneous restart

Each step introduces latency and centralization. Step 1 depends on a small number of engineers who understand the codebase deeply. Step 3 assumes validators trust the patch without independent verification. Step 4 requires a communication channel (typically Discord) that is itself a single point of failure.

Firedancer improves this by providing an alternative execution path. If an Agave-specific bug is identified, Firedancer validators can continue operating while Agave validators patch and restart. But this only works if Firedancer has sufficient stake (33%+) and if the bug is truly Agave-specific – not a shared component bug that affects Frankendancer too.

The outage history analysis is valuable precisely because it shows that the failure modes are evolving faster than the mitigation strategies. Each outage was caused by a different mechanism. This suggests that future outages will also be novel, which means the defense must be architectural (independent implementations) rather than tactical (fixing specific bugs).