Substreams: Massively Faster Indexing Performance for Subgraphs

The Graph ecosystem has grown substantially over the past year, with five Core Developer teams now working full-time to enhance The Graph’s indexing and query capabilities for the world. StreamingFast, the first additional team to join as a Core Dev after Edge & Node, brings both an incredible pool of talent and powerful technology to further the protocol. One of the most exciting innovations is coming to fruition soon: substreams.

StreamingFast (formerly dfuse) was founded in 2018, providing high performance, cross-chain centralized indexing services. Interactions with the Edge & Node team convinced StreamingFast that decentralization is the most effective and scalable way to build for the future. Subsequently, StreamingFast accepted a grant from The Graph Foundation and, in June 2021, joined as a Graph Core Developer team to work full-time on The Graph ecosystem. This decentralized version of M&A was the first of its kind (but not the last).

In joining as a Graph Core Developer team, StreamingFast brought Firehose, a high-performance method of ingesting data from blockchains and started integrating it into to The Graph. At that time, an extremely complex subgraph could take weeks to sync, creating friction for developers building on The Graph. StreamingFast created a prototype called Sparkle, which helped decrease sync time on that subgraph from weeks to around six hours. Now, StreamingFast has evolved Sparkle’s capabilities and created substreams that can scale across all subgraphs on all chains.

How Substreams Work

RPC-based Subgraphs have a linear indexing model for processing blockchain data (i.e. they process events one at a time, in order). They do so via polling API calls to Ethereum clients. Firehose technology replaces those polling API calls with a stream of data utilizing a push model and sending data to the indexing node faster. This helps increase the speed of syncing and indexing.

Substreams take things even further by enabling massively parallelized streaming data. Substreams can be combined and aggregated in powerful new ways to feed data into subgraphs or end-user applications in a fraction of the time. With substream parallelization, some subgraphs could sync more than 100x faster.

With substreams, the data pipeline can be broken down into four stages:

  • Extract (via Firehose)
  • Transform (via Substreams and Subgraphs)
  • Load (to the postgres database)
  • Query (serving queries to users)

The first transformation via substreams allows lighter weight parallelized computation and composability that many subgraphs can benefit from.

To illustrate: in the instance of large DEXes—which need to find pairs for any given trade—a substream model enables individual small modules to work simultaneously on pairs, reserve extractors, prices, volume aggregation, and other key metrics. If a developer bases their work on existing substreams, they can take the DEX prices and create a module to average all DEX prices across an ecosystem.

Substream modules don’t go through postgresQL. Existing modules can be leveraged, which developers can adapt, allowing end users to take advantage of composability without paying a performance penalty for indexing.

After the Extraction and Transformation stages, substreams can be composed in an infinite number of ways, enabling another module to populate into a subgraph, all before Load operations.

As opposed to linear historical data processing, substream data can be processed in parallel and cached. This allows for the fastest possible insertion into the postgres database, going from days or weeks to mere hours.

This all serves as a benefit to developers. Developers need to build subgraphs and should be able to iterate on those subgraphs as fast as possible, maximizing developer productivity. Developers will be able to iterate upon existing modules, reuse the most efficient processes (such as in the DEX example), using incremental iterations to improve without needing to rebuild a new subgraph. They will be able to observe data and add to their database as required. The speed and data composability of subgraphs and substreams, pulling data through Firehose, will make The Graph the fastest and most efficient way to get data from blockchains.

This is the power of open-source data composability via The Graph: a hivemind of developers building composable data across a global ecosystem. Centralized services cannot compete.

Current Stage in the Process

An initial implementation of substreams has been built and is being tested. The core devs are working with a small group of developers to improve the software. Keep an eye out for announcements on availability for developers.

Thanks to all the core dev teams that have worked on this (special shoutout to StreamingFast!). We can’t wait for developers to experience the radically faster indexing performance enabled by substreams.

About StreamingFast

StreamingFast is a web3 builder and investor. As a core developper on The Graph, it excels at building massively scalable open-source software for processing and indexing blockchain data. Founded by a team of serial tech entrepreneurs, the company has deep expertise in large scale data science. Its core innovation, the Firehose, is a files-based and streaming-first approach to processing blockchain data that enables high performance indexing on high throughput chains.

You can follow StreamingFast on Twitterand on Discord.

About The Graph

The Graph is the source of data and information for the decentralized internet. As the original decentralized data marketplace that introduced and standardized subgraphs, The Graph has become web3’s method of indexing and accessing blockchain data. Since its launch in 2018, tens of thousands of developers have built subgraphs for dapps across 40+ blockchains - including  Ethereum, Arbitrum, Optimism, Base, Polygon, Celo, Fantom, Gnosis, and Avalanche.

As demand for data in web3 continues to grow, The Graph enters a New Era with a more expansive vision including new data services and query languages, ensuring the decentralized protocol can serve any use case - now and into the future.

Discover more about how The Graph is shaping the future of decentralized physical infrastructure networks (DePIN) and stay connected with the community. Follow The Graph on X, LinkedIn, Instagram, Facebook, Reddit, and Medium. Join the community on The Graph’s Telegram, join technical discussions on The Graph’s Discord.

The Graph Foundation oversees The Graph Network. The Graph Foundation is overseen by the Technical Council. Edge & Node, StreamingFast, Semiotic Labs, The Guild, Messari, GraphOps, Pinax and Geo are eight of the many organizations within The Graph ecosystem.

Graph Protocol
The Graph Foundation
June 2, 2022

The Graph Foundation

View all blog posts