Real-Time Streaming vs Batch Indexing: How Goldsky Is Rethinking Blockchain Data Pipelines

The indexing wars conversation has mostly focused on who can run subgraphs faster and cheaper. But Goldsky is doing something more interesting: they are arguing that the subgraph model itself is the wrong abstraction for many use cases. And after spending three weeks integrating their streaming platform, I think they might be right.

The Subgraph Model: A Quick Recap

The Graph popularized the subgraph as the standard unit of blockchain data indexing. You write a manifest that defines:

  • Which smart contracts to watch
  • Which events to index
  • How to transform and store the data
  • A GraphQL schema for querying

This batch indexing model works well for historical queries: “show me all DEX swaps in the last 24 hours” or “what is the total value locked in this lending protocol?” You write your subgraph, deploy it, wait for it to sync, and then query it via GraphQL.

But the model has fundamental limitations for real-time use cases:

1. Polling latency. Subgraphs are updated in batches as new blocks are processed. Your application polls the GraphQL endpoint to check for updates. The gap between an on-chain event occurring and your application receiving the data is typically 5-15 seconds at best, and can be much longer if the indexer is behind.

2. No push mechanism. The subgraph model is inherently pull-based. Your application has to ask “is there new data?” repeatedly. This wastes resources and creates unnecessary latency.

3. Query-time computation. Complex aggregations are computed at query time, which means every dashboard refresh re-executes expensive calculations. If you have 1,000 users refreshing their portfolio value simultaneously, that is 1,000 identical computation passes.

4. The sync gap problem. When a new subgraph is deployed or an existing one needs to re-index, there can be hours or even days of sync time. During this period, your application either serves stale data or fails entirely.

Goldsky’s Streaming Paradigm

Goldsky’s approach flips the model. Instead of indexing data into a proprietary store and exposing it via GraphQL, they stream blockchain events directly into your infrastructure.

Mirror: Your Database, Their Pipeline

The Mirror product continuously replicates on-chain data into your own database – PostgreSQL, ClickHouse, BigQuery, or others. Your application reads from its own database, not an external API.

This seemingly simple change has profound implications:

  • Zero query latency for reads, because the data is already local in your database.
  • Full SQL flexibility. You are not constrained by GraphQL schema design. Any SQL query works.
  • Pre-computed aggregations. You can create materialized views, triggers, and computed columns using standard database tools.
  • No external dependency for reads. Even if Goldsky’s pipeline goes down temporarily, your application continues serving data from your database. The pipeline catches up when connectivity is restored.

Webhooks: Event-Driven Architecture

Goldsky’s webhook system pushes notifications to your backend when specific on-chain events occur. Instead of polling for new data, your application receives callbacks.

For example, instead of:

// Polling every 5 seconds
setInterval(async () => {
  const result = await graphqlClient.query(GET_RECENT_SWAPS);
  updateUI(result.data);
}, 5000);

You get:

// Webhook handler - called immediately when event occurs
app.post("/webhook/swap", (req, res) => {
  const swapEvent = req.body;
  notifyUser(swapEvent);
  updateAnalytics(swapEvent);
  res.status(200).send("ok");
});

This is the same event-driven architecture that powers modern Web2 applications (Stripe webhooks, Twilio callbacks, etc.), applied to blockchain data.

When Streaming Beats Batch Indexing

After three weeks of integration, here are the use cases where Goldsky’s streaming approach clearly wins:

1. Real-time notifications. “Alert me when my liquidation threshold is approaching.” With subgraphs, you poll every N seconds and hope you catch the event in time. With webhooks, you get notified within seconds of the on-chain event.

2. Live dashboards. Trading interfaces, portfolio trackers, and DeFi dashboards that need to reflect current state. Mirror keeps your database in sync continuously, so every page load shows current data.

3. Backend event processing. Automated actions triggered by on-chain events – rebalancing strategies, arbitrage execution, governance vote notifications. The webhook model eliminates the polling overhead entirely.

4. Analytics and reporting. Having blockchain data in your own PostgreSQL or ClickHouse instance means you can use standard BI tools (Metabase, Grafana, Looker) to create dashboards without any custom integration.

When Batch Indexing Still Makes Sense

Goldsky’s streaming is not universally better. Batch indexing with subgraphs still wins for:

1. Simple, infrequent queries. If you just need to look up a token balance or check an NFT owner occasionally, spinning up a streaming pipeline is overkill.

2. Historical deep dives. Complex queries against months of historical data are better served by a fully indexed and optimized GraphQL endpoint.

3. Prototyping and development. Subgraphs are faster to set up for quick experiments. Mirror requires database infrastructure and pipeline configuration.

The Cost Question

Goldsky’s pricing is based on data volume streamed and pipeline complexity, which can get expensive for high-throughput chains. A full Ethereum mainnet mirror with all token transfers and DeFi events could cost significantly more than an equivalent Ormi subgraph.

However, the total cost of ownership calculation is different because you are running your own database. If you already have PostgreSQL or ClickHouse infrastructure, the marginal cost of adding Goldsky’s Mirror is lower than standing up a separate indexing service.

My Assessment

Goldsky is not trying to win the “faster subgraph” race. They are arguing that the subgraph abstraction is inadequate for modern application needs, and I think they are making a compelling case.

The future of blockchain data infrastructure is probably not a single paradigm. It is a combination of batch indexing (for historical queries), real-time streaming (for live data), and push-based events (for automated workflows). Teams that adopt a multi-paradigm approach will build better products.

Has anyone else experimented with Goldsky’s streaming products? I would love to hear about other teams’ experiences, especially around reliability and cost at scale.

Bob, this is an excellent technical analysis. As someone building yield optimization infrastructure, real-time data is not a nice-to-have – it is literally the difference between profitable and unprofitable strategies.

Let me share a concrete example from our yield aggregator:

The liquidation monitoring problem.

Our protocol monitors lending positions across Aave, Compound, and several smaller lending markets. When a position approaches its liquidation threshold, we need to notify users AND potentially execute automated de-leveraging transactions.

With our old subgraph-based approach:

  • We polled The Graph every 10 seconds for position health factors
  • By the time we detected a position at risk, processed the alert, and executed a transaction, 15-25 seconds had elapsed
  • In a fast-moving market crash, that is an eternity. We had users get liquidated while our system was still processing the alert from two polling cycles ago.

After switching to Goldsky’s webhooks:

  • We receive callbacks within 2-3 seconds of a relevant on-chain price update
  • Our system processes the webhook and can initiate a protective transaction within 5 seconds total
  • The improvement in response time has measurably reduced liquidation losses for our users

The Mirror product for our analytics.

We also use Goldsky Mirror to stream all DEX swap events into our ClickHouse cluster. This lets us run real-time yield calculations and historical backtesting against the same dataset using standard SQL. Before, we maintained two separate data pipelines: a subgraph for real-time queries and a custom ETL job for analytics. Now it is one pipeline feeding one database.

The cost trade-off:

Goldsky is more expensive than our old hosted Graph setup (which was free). We are paying roughly $800/month for our streaming pipelines. But the value we extract from real-time data far exceeds that cost. A single prevented liquidation can save a user thousands of dollars.

One concern I have:

The webhook delivery guarantees are not as robust as I would like. We have experienced occasional missed webhooks during periods of high chain activity. Goldsky acknowledges this and recommends implementing your own catch-up mechanism by periodically polling your Mirror database for any events your webhook handler might have missed. It works, but it adds complexity.

Overall, I agree with your assessment. For DeFi protocols where real-time data has direct financial implications, the streaming paradigm is clearly superior to polling subgraphs.

Great post, Bob. I want to dig into the data engineering implications of Goldsky’s Mirror product because this is exactly the kind of architecture I deal with daily.

From a data pipeline perspective, Mirror is solving the right problem.

In traditional data engineering, we distinguish between:

  • ETL (Extract, Transform, Load): Pull data from a source, transform it, load it into your warehouse. This is the subgraph model.
  • CDC (Change Data Capture): Stream changes from a source into your systems as they happen. This is Mirror’s model.

CDC has been the standard pattern in enterprise data engineering for decades. Tools like Debezium, AWS DMS, and Fivetran all implement this pattern. Goldsky is essentially bringing CDC to blockchain data, and it is about time.

The database choice matters enormously.

When you use Mirror, you need to decide where to stream the data. This decision has huge implications:

  • PostgreSQL: Great for transactional queries and application backends. Good enough for moderate analytics. But it struggles with time-series aggregations at scale (millions of rows per day).

  • ClickHouse: Excellent for analytics and aggregations over large datasets. Column-oriented storage is perfect for “sum all swaps in the last 30 days” type queries. But it is not great for point lookups like “get this specific transaction.”

  • BigQuery: Scalable analytics without managing infrastructure. But the per-query pricing can get expensive for high-frequency queries, and the latency is too high for application backends.

We ended up running a dual-sink setup: Mirror streams into both PostgreSQL (for our application backend) and ClickHouse (for analytics and reporting). This adds complexity but gives us the best of both worlds.

Performance numbers from our deployment:

  • Mirror lag (time from on-chain event to database write): typically 3-8 seconds on Ethereum mainnet, 1-3 seconds on L2s
  • Our PostgreSQL handles about 2,000 reads per second against the mirrored data with sub-5ms response times
  • ClickHouse analytics queries that previously took 30+ seconds against The Graph now complete in under 2 seconds

The backfill challenge:

One area where Mirror is less polished is historical backfilling. When you first set up a pipeline, it needs to process all historical blocks. For contracts that have been active since early Ethereum, this backfill can take days. During this period, your database has incomplete data, which your application needs to handle gracefully.

Subgraphs have the same problem (initial sync time), but at least with subgraphs, you can query the partially-synced data through The Graph’s API. With Mirror, your database either has the data or it does not.

My recommendation:

If you have a data engineering team and existing database infrastructure, Mirror is a no-brainer. It fits naturally into modern data stack architectures. If you are a small team without database expertise, stick with subgraphs (via Ormi or The Graph) – the managed experience is simpler to operate.

I want to raise a security dimension that I have not seen discussed in the streaming vs. batch indexing debate.

Data integrity in streaming pipelines.

When you use a subgraph on The Graph’s decentralized network, there is a verification mechanism. Multiple indexers process the same data, and the network can detect and penalize dishonest indexers through dispute resolution. Your query results have a degree of trustless verification.

With Goldsky’s streaming pipeline, you are trusting a single provider to deliver accurate data to your database. There is no independent verification layer. If Goldsky’s pipeline has a bug that drops events, misorders transactions, or delivers corrupted data, your application will silently serve incorrect information.

Why this matters for security-critical applications:

Consider a DeFi protocol that uses Mirror-ed data to make automated decisions – rebalancing, liquidations, or governance vote counting. If the streamed data is incorrect or delayed, the automated system could make wrong decisions with real financial consequences.

I have seen three categories of data integrity issues in streaming pipelines:

  1. Dropped events. The pipeline misses an event due to network issues or processing errors. Diana already mentioned this with webhook delivery. For a security-critical system, a single dropped liquidation event could mean millions in losses.

  2. Reorg handling. When a chain reorganization occurs, previously confirmed events become invalid. A streaming pipeline needs to handle reorgs by either re-streaming corrected data or notifying the consumer that previous data was invalidated. The subgraph model handles this more gracefully because the indexer maintains state and can re-process affected blocks.

  3. Ordering guarantees. In complex DeFi interactions, the order of events within a block matters. If two transactions interact with the same pool in the same block, the order of processing determines the correct state. Streaming pipelines need to preserve this ordering, which is harder than it sounds in distributed systems.

My recommendations for teams using streaming infrastructure:

  1. Implement data validation checksums. Periodically compare your mirrored data against a known-good source (even if that source is slower) to detect discrepancies.

  2. Build reconciliation processes. Run daily or hourly reconciliation jobs that verify your streamed data against direct RPC calls for critical data points.

  3. Design for eventual consistency. Accept that your streamed data might be temporarily incorrect and build your application logic to handle this gracefully.

  4. Do not use streamed data for security-critical automated decisions without independent verification. A webhook telling you a position is at risk should trigger a verification step (direct RPC call to the smart contract) before executing a liquidation.

The streaming paradigm is powerful and I understand its appeal for performance. But please, trust but verify. Every data pipeline can fail, and in DeFi, data failures cost real money.