How The Graph Powers Dapps with Subgraphs, Firehose, & Substreams

Prefer to hear this blog aloud? Listen to the contents of this blog via The Graph Podcast.

The Graph has grown and evolved immensely over the past five years. The team at StreamingFast, one of the core dev teams contributing to the protocol, has appreciated the opportunity to work with others to facilitate a more decentralized and efficient data indexing and querying ecosystem for the internet of the future.

With the recent release of the New Era roadmap, The Graph is entering an exciting new chapter. As we eagerly anticipate the many new features and technologies that will bolster The Graph ecosystem, we thought it would be a great time to highlight some of the core technology that currently powers The Graph and how they will continue to evolve in the New Era–namely subgraphs, Firehose, and Substreams.

For readers who might be seeing these terms for the first time, this blog serves as a comprehensive reintroduction. In addition to defining each technology individually, this post defines how these combined technologies provide a trifecta of value for users of The Graph. Lastly, we’ll also highlight how these core technologies flourish in the next chapter of The Graph!

TLDR:

Subgraphs offer a rapid deployment of a Postgres database, filled with the indexed data you need. Query it using a ready-to-use GraphQL query layer.

Firehose unpacks every transaction in a block into their smallest unit, in simple flat files. This makes the data parallelizable, enabling rapid indexing speeds.

Substreams utilizes the Firehose’s flat files to index data at lightning fast speeds. Substreams are highly composable, allowing developers to build atop each other’s work.

Subgraphs

Definition – An open API that extracts data from a blockchain, processing it, and storing it so that it can be easily queried via GraphQL.

Subgraphs were the original data service introduced by The Graph, and have since become the industry standard when it comes to indexing blockchain data. Since launching in 2018, The Graph has supported more than 85,000 subgraphs on over 40 chains.

Subgraphs are custom APIs that can be written and queried by anyone. They define a data schema that developers wish to index, making that data easily queryable. These are written in conjunction with a subgraph manifest, which refers to the subgraph’s GraphQL schema, its data sources, and metadata.

In simple terms, you can think of a subgraph as a specialized librarian for blockchain data. Just as a librarian organizes and catalogs books so people can easily and quickly find what they're looking for, a subgraph organizes blockchain data into a structured format. It's a custom-made tool, created by anyone who wants to make certain pieces of blockchain data easy to find and use. When developers write a subgraph, they're essentially creating a unique 'catalog' that specifies what data they want to be easily accessible, how it should be arranged, and where it comes from, making it much simpler for others to search and utilize this information effectively.

Subgraphs are primarily used by web3 developers to rapidly serve neatly organized blockchain data into the front end of a decentralized application (dapp). With subgraphs, developers can display blockchain data such as DeFi transactions, NFT provenance, liquidity pool information and more – often in charts and graphs. Subgraphs also enable devs to build rich features in their dapps, such as implementing a search bar, filter features on charts, pagination, tagging, and more.

A subgraph can be published to The Graph Network to be served permissionlessly by any Indexer, or if a developer is running their own local instance of a graph-node, they can deploy it locally.

Subgraphs deployed to The Graph Network can be queried by anyone.

Firehose

Definition – Firehose is a tool for extracting data from Firehose-enabled blockchain nodes and storing it in flat files.

Created by StreamingFast, the Firehose provides previously unattainable capabilities and speeds for indexing blockchain data by using a files-based and streaming-first approach. How does it work? Rich data is extracted from Firehose-enabled instrumented blockchain nodes and saved to simple flat files, providing capture and processing speeds previously thought to be impossible. Firehose is primarily written in the Go programming language and takes full advantage of parallel computing. Firehose also consumes, processes, and streams blockchain data to consumers. Firehose was designed to replace the brittleness and slow response times of, often inconsistent, JSON-RPC systems.

Let’s return to the librarian analogy we used to help explain subgraphs. In this case, think of Firehose as a revolutionary delivery system for the 'books' (or data) that our specialized librarian, the subgraph, needs to organize. Let’s pretend the subgraph is a librarian cataloging and making blockchain data easily accessible. In this analogy, Firehose is an advanced book delivery system in the library. Imagine a high-tech conveyor belt system zipping books directly to the librarian's desk at incredible speeds. This is what Firehose does with blockchain data–it efficiently extracts and delivers the data to the subgraph, much like delivering books quickly and reliably to the librarian. This enables the librarian (subgraph) to catalog the information faster and more accurately, making it readily available for anyone who needs it.

Bringing it back to web3, let’s review some of the key technical points that the Firehose was designed to address:

Deliver a files-based and streaming-first approach to processing blockchain data
Consume, process, and stream blockchain data to consumers of nodes running Firehose-enabled, instrumented blockchain client software
Low-latency processing of real-time blockchain data in the form of binary data streams
A Firehose cursor points to a specific position in the stream of events emitted by the database and the blockchain itself. This cursor contains information that is required to reconstruct an equivalent forked or canonical instance, ensuring that it is extremely simple to recover from any amount of downtime

In The Graph’s New Era, Firehose is featured in a few milestones under the World of Data Services objective, playing a crucial role as a new data service ensuring a faster and more modular flow of data. Additionally, the introduction of Verifiable Firehose presents an innovative solution for accessing historical Ethereum data, potentially addressing the challenges posed by EIP-4444.

Substreams

Definition – A data-processing tool designed for efficiently transforming large-scale streaming data. They can easily adapt to different tasks (composable) and handle multiple data processing activities at the same time (parallelizable). Substreams specialize in quickly converting data from basic file formats into more usable forms.

Substreams, also created by StreamingFast, enables developers to write rust modules to compose data. Substreams provides extremely high-performance indexing by virtue of parallelization, in a streaming-first fashion–meaning that Substreams will push the data to you as soon as it is available, rather than waiting for you to continuously request it.

Substreams modules sort, sift, temporarily store, and transform blockchain data from block objects and smart contracts, for use in data sinks such as databases or subgraphs. These modules are composable, breaking down the entire dataset into smaller streams of data that can be reused within other Substreams, allowing developers to build atop of the works of others. A public registry of Substreams packages can be found at substreams.dev, where developers can access Substreams built by other community developers. They can modify them, utilize them as inputs for their own Substreams, and add their own.

Returning, once again, to our librarian analogy, where the subgraph is the specialized librarian organizing data, and Firehose is the advanced delivery system bringing in books at high speed, Substreams act like a team of expert assistants. These assistants pre-sort, categorize, and even summarize the books before they reach our librarian's desk. Imagine them working diligently alongside the conveyor belt, picking out relevant books, arranging them into neat, thematic stacks, and even bookmarking important pages. This means that by the time the books (blockchain data) get to our librarian (the subgraph), they are not just delivered quickly but are also pre-organized and easier to catalog. Substreams can enhance the library's efficiency, turning certain types of tasks that once took months into hours.

“In the New Era, new tooling for Substreams will be created to empower developers and enhance the developer experience, while also being integrated into the network as a new data service.”

Combining the Power of Subgraphs, Firehose, and Substreams

Now that we’ve reintroduced subgraphs, Firehose, and Substreams, we can use the example of processing large amounts of data to help illustrate the use cases where each one is needed and where they support each other.

As you may already know, subgraphs are a powerful tool for developers, offering easy access to serverless architecture coupled with robust tooling and documentation. They have become integral to many popular and widely-used dapps, providing a convenient way for developers to quickly access and query blockchain data. Subgraphs excel in managing standard data loads, making them the default industry standard for accessing blockchain data.

Additionally, for handling large-scale data processing, incorporating Firehose can significantly enhance the performance and efficiency of your subgraph. Firehose is adept at processing large amounts of data quickly and efficiently, having the ability to break down every transaction on the blockchain into smaller, more manageable components called flat files. These files allow for rapid traversal and retrieval of data, catering to the needs and demands associated with processing large data requests.

Despite the efficiencies of Firehose, it still requires developers to individually poll data to their own servers, compute, and then transform it to fit their needs. This leads to lots of duplicated data flowing around the world. Substreams offer an elegant solution to this challenge. They enable the request of pre-computed data that can be processed in parallel across large chunks of historical data, and then seamlessly keep up with live data as it is streamed to you. By having all the raw and cached pre-computed data live within the Substreams cluster, the developer ecosystem can benefit from a shared data intelligence layer, which previously was held (and built) by each developer individually.

Differences between Substreams and Subgraphs

With such similar names, the differences between subgraphs and Substreams may not be clear to some, especially given the recent launch of Substreams-powered subgraphs. In this section, we’ll make the distinction more clear.

Perhaps the best way to understand the distinction between Substreams and subgraphs is to understand the different use cases served by each. Subgraphs serve the vast majority of blockchain data retrieval needs, thanks to how easy and rapid it is to use them to index smart contract events and start building with that data using a GraphQL API. Substreams was created in response to different use cases, particularly those around analytics and big data that require parallelized data processing instead of linear processing. Some other key points for which Substreams was created include:

Deliver a streaming-first approach to consuming and transforming blockchain data
Highly parallelizable, yet simple model to consume and transform blockchain data
Composable system where you can depend on building blocks offered by other developers in the community
Offer a rich block model

While they share similar ideas around transforming and processing blockchain data, and they are both part of The Graph ecosystem, each can be viewed as an independent technology that are unrelated to one another. One cannot take a subgraph's code and run it on the Substreams engine, and vice versa – they are incompatible, though below we’ll learn about how developers can feed their subgraph using Substreams as a data source.

For simplicity, here is a list of key differences between subgraphs and Substreams:

Substreams are currently written in Rust. Subgraphs are written in AssemblyScript
Substreams are "stateless" requests through gRPC. Subgraphs are persistent deployments
Substreams offers the blockchain's specific full block. Subgraphs require you to define triggers that will invoke your code
Substreams are consumed through a gRPC connection where you control the actual output message. Subgraphs are consumed via GraphQL queries
Substreams have no long-term storage nor database (it has transient storage). Subgraph stores data persistently in a Postgres database
Substreams can be consumed in real-time with a fork-aware model. Subgraphs can only be consumed through GraphQL and polling
Substreams rely on Protobuf models. Subgraphs rely on GraphQL schemas.

Deciding between subgraphs and Substreams

When asking yourself if you should write a subgraph or a Substreams, here’s a couple of handy tips to help you decide.

Subgraphs effectively eliminate the hassle and costs of setting up, operatings, and managing a database. They come equipped with a ready-to-use GraphQL query layer, streamlining developer workflows. This means you can begin consuming data almost immediately after getting started!

Substreams enable the parallelization of data ingestion, leveraging the power of composable modules created by other developers. This means you’ll be able to customize the database type and storage format, giving you greater control over the output format and the data you consume.

Subgraphs and Substreams each offer distinct advantages tailored to different use cases. It’s important for developers to consider the specific needs of their project when choosing between these two technologies. While Substreams offer flexibility in data storage and live-streaming capabilities, ideal for complex analytics and real-time data requirements, subgraphs simplify the initial setup and data management process, making them a great choice for developers seeking an easy-to-use, GraphQL-based interface.

How Substreams, Firehose, and Subgraphs Interact on The Graph

When considering subgraphs, Firehose, and Substreams independently, it becomes clear how each of them have revolutionized data access and management. Let’s now take a look at how combining these technology stacks has enabled The Graph to push the frontiers of what’s possible and established a new standard in the world of blockchain data.

Firehose feeds Subgraphs

Indexers must run the Graph Node software, which sources the blockchain to deterministically update a data store that can be queried via a GraphQL endpoint. A subgraph is how the developer refers to the schema and the mappings for transforming the data synced from the blockchain. Originally, Indexers would feed their Graph Node through either an RPC or an Archive Node. Firehose was then added as an alternative data source for Graph Node, allowing Indexers to utilize this highly performant file-based data store.

Firehose feeds Substreams

Substreams offers all the benefits of Firehose, including low-cost caching and archiving of blockchain data, high throughput processing, and cursor-based reorg handling. It is platform-independent of underlying blockchain protocols and works solely on data extracted from nodes using Firehose.

Substreams-powered Subgraphs

Substreams-powered subgraphs introduce a new data source for subgraphs, bringing the indexing speed and additional data of Substreams to subgraph developers. This data source must specify the indexed network, the Substreams package (spkg) as a relative file location, and the module within that Substreams package which produces subgraph-compatible entity changes.

Developers who build using Substreams-powered subgraphs have been able to reduce some subgraph syncing times by more than 100x, while also improving overall performance, providing a new and fresh form of data agility. For Indexers, running Firehose and serving Substreams saves time and resources by horizontally scaling and increasing efficiency, reducing processing and wait time.

The benefits of using Substreams-powered subgraphs are already demonstrating value in real-world applications! One of the most striking examples of these benefits is the dramatic improvement in sync times. Consider the Uniswap-v3 subgraph, which traditionally took 2 months to sync. With a Substreams-powered subgraph, this process was completed in just 20 hours. That's a staggering 72x speed improvement, reducing sync time from more than 1,440 hours to a mere 20. This kind of efficiency can revolutionize the development process, as well as how users interact with data on The Graph Network.

Summary

In summary, we hope this post was an effective reintroduction of subgraphs, Firehose, and Substreams. As we enter a New Era for The Graph, we believe it’s important to understand these core technologies, how they relate to each other, and then be ready to see them expand and evolve in the New Era.

When it comes to subgraphs, Firehose, and Substreams, each of them plays a pivotal role in optimizing access and managing blockchain data. From the foundational impact of subgraphs, to the innovative capabilities of Firehose and Substreams, these technologies collectively represent a significant leap forward and how The Graph will help support the decentralized internet.

Here is a summary list of the essential elements to remember, offering a clear understanding of how these technologies are interconnected:

Subgraphs are the industry standard when it comes to creating and organizing open APIs to consume indexed blockchain data
Subgraphs effectively eliminate the hassle and costs of setting up, operating, and managing a database or indexing infrastructure
Firehose-enabled nodes extract all data from each transaction and stores all the information in flat files to speed up blockchain data retrieval; flat files introduce the ability to easily parallelize data indexing jobs that run atop the information found within
Substreams breaks down each data stream into smaller composable modules that can be easily reused and piped into one another
Substreams enables developers to persist data to their storage of choice, or send it directly to another data pipeline
Substreams uses Firehose-created flat files in order to pipe in all the data to be transformed and stored in a data sink
Substreams-powered subgraphs allow for developers to easily query indexed data via a GraphQL API endpoint, as defined by the subgraph creator

What should you do next?

Now that you’re well versed in the differences between subgraphs, Firehose and Substreams, why not jump in and give them a try?

Whether it’s writing and publishing your first subgraph, sending your first query to The Graph Network, or simply perusing the subgraph registry, there’s lots to dive into. If you want to try running a Firehose node, take a look at which protocols are currently supported. And lastly, you can see which Substreams packages are available to try, or look under the hood at Substreams-powered subgraphs.

Category: Graph Protocol
Author: StreamingFast
Published: December 12, 2023

StreamingFast

View all blog posts⁠