Best Practices in Subgraph Development: Avoiding Large Arrays
Welcome to the second edition of the . If you're new to The Graph and want to understand the best practices for making and using subgraphs, you're in the right spot. This blog is aimed at guiding developers on crafting efficient and well-structured subgraph definitions.
The purpose of this series is to share insights and recommendations based on patterns that have emerged since The Graph Node was open-sourced in 2018. In the in this series, we covered improving indexing performance by reducing eth_calls.
In this second post, we'll concentrate on enhancing the performance of data storage defined within the subgraph schema and handlers. However, before we dive into the solution, it makes sense to explain the problem and why managing the way data entries are stored in subgraphs is important.
Schema Definition
The serves as the guide for Graph Node, informing it about the data types to be indexed, their relationships, and the format to provide when responding to queries.
Mapping of Indexed Data
Data is collected through a that aligns with a specified schema in the subgraph. This establishes a connection between the original blockchain data and the organized data structure.
Storage
The main store for graph-node
is PostgreSQL, this is where subgraph data is stored, as well as metadata about subgraphs, and subgraph-agnostic network data such as the block cache, and eth_call
cache.
The processed and mapped data is stored in a way that optimizes query performance. Graph Node creates indexes on the data, enabling rapid retrieval of specific information. These indexes work similarly to how a book's index helps you quickly find specific topics or pages.
Another important feature of Graph Node is that when an existing entity is updated, The Graph stores both versions, keeping the old version and the new one as well. This concept is important to understand, and we will highlight why it is important later in this blog post.
Querying (GraphQL)
Once the data is indexed, developers and users can compose GraphQL queries to request specific information. The GraphQL queries are designed to mirror the structure of the subgraph's schema. Graph Node efficiently processes these queries, retrieves the required data from the database, and returns the results to the query issuer from a GraphQL interface.
Given the previous explanation of how Graph Node works, we can safely say that these definitions are extremely programmable in the sense that they can be authored and customized by developers to specify how blockchain data is indexed, organized, and queried. This empowers developers to tailor the behavior of subgraphs to their specific needs, and makes The Graph powerful; however, there are some things to look out for when it comes to properly optimizing the data within Graph Node. Let’s take a look at the issue with storing large arrays and show how to optimize them using some features available in the schema configuration.
Large Arrays
To illustrate this issue, we will begin with a simple smart contract. In this example, we assume there is a createUser
function along with an addTransaction
function, and we track contract calls to these functions with the following events:
As a subgraph developer, we usually want to build our schema around these events. Let's say that we want to keep track of all the transactions that a particular user executed. Without any knowledge of how arrays were handled within a subgraph configuration, we might design a schema similar to this where we define a field with an array definition. In the below example, we do this with the entity named transactions
.
Then we might perform an update to this entity field using a similar code like below, where we load in the existing transactions, push the new one to the array, and then save the data.
This will work, but it is important to note that when we access an array of an entity we are actually getting a copy of that data. Therefore, if we update the data and save the entity, we are simply making a copy of the array, while the original is left unchanged. This is not a problem for small arrays with fewer than 50 or so entries. However, if it contains a larger amount of data and changes frequently it will bloat the database.
The reason for this is because of a powerful capability in Graph Node known as Originally implemented so that Graph Node can handle chain reorgs and ensure data accuracy by tracking state at a certain block number and block hash, this feature also empowered users to query the subgraph at any specific point in time, giving access to rich historical data. In order to achieve this, Graph Node is keeping track of all the changes within all the entities for any given subgraph.
Here is an example of what a GraphQL data response would look like with the userCreated
schema that tracks transactions.
Query:
Output:
Looking at the response from GraphQL you might think everything is fine, but in order to really understand the problem, we should dig deeper to see how the data is stored in PostgreSQL. The database screenshot below shows the tables using a popular PostgreSQL tool called .
This can be done with the following steps, Select Databases -> graph-node -> schemas -> sgd1 (where numbered iteration of your Subgraph) -> Tables -> user_created
Then simply right click the database and select View/Edit Data and select All Rows
If we examine the table called transactions, we can immediately see the issue with this approach:
After only a few transactions, our changes are duplicated into a new row and the existing row remains unchanged. This will obviously become an issue as more and more entries are added to our database.
Overview of loadRelated and derivedFrom
If we look at the feature @derivedFrom
, we will see that it gives us the ability to perform reverse lookups on entities. Enabling this feature creates a virtual field on the entity that may be queried but cannot be set manually through the mappings API. Rather, it is derived from the relationship defined on the other entity.
Additionally, as of graph-node v0.31.0, @graphprotocol/graph-ts v0.31.0, and @graphprotocol/graph-cli v0.51.0 the loadRelated
method is available. This enables loading derived entity fields from within an event handler. It proves useful when you want to read in values when processing events on your subgraph and make determinations on how the data is stored. This is a very powerful feature and makes data processing easy. When we combine loadRelated
with @derivedFrom
within our schema, we can solve our problem.
Let’s look at a quick example of how a derivedFrom
configuration and loadRelated
work in practice. Let’s say that we have the following schema with two entities, one of the fields within EntityA
and entityb
is derivedFrom
a field called link
.
Notice we also reference EntityA
as the value to be derived from the key link
within EntityB
. Below, we have an example handler that saves some values to our entities.
After updating our values within the handler, we would end up with the following data stored in our database. It is important to note that each entity value is stored in its own unique table, the magic of loadRelated
is when querying the data.
To showcase what this looks like when loading the data into a handler, we can look at the following code as an example:
The value of b in the previous code will be determined by our @derivedFrom(field: link)
meaning it is loadRelated
and is generated when calling the .load()
method. In this example it will load all EntityB
for which the field link == entityA.id
or 1 which is the value of in this example. Therefore, the SQL query generated by Graph Node for entityA.entityb.load
would be in the following syntax. If you want to learn more about how SQL generation is handled with graph-node you can read more in the .
The Solution
Knowing what we know now about how loadRelated
works and how derivedFrom
is configured, we can go back and fix our code to store the data in a more efficient way. First, we will update our previous entity to now the derivedFrom
annotation of a new field called user
. This is also defined as an array type so that we can store one-to-many relationships.
Now we can change our handler to update the relationship between the entities. In this example, the event transaction hash is converted to a hex string using.toHexString()
only to make it human-readable, this way we can examine the data within the tables.
In our configuration, we defined a new foreign key within our entity called user
, therefore we are saving the entity.user
as event.params.userAddress
for clarity; however, we could also have used the userAddress
key already available and saved the relationship using entitiy.userAddress
. After successfully storing the data using our new schema using our handler, we can open up the database again and see how much more organized the data structure becomes. We can see below that we no longer have the compounding array data and the transactions are stored in a much cleaner way.
Additionally, we can make a query to GraphQL to get some data and verify the data is still correlated. As you can see below, we can query to get the transaction ids from each user in one simple easy to consume GraphQL query.
Query:
Output:
Summary
The Graph serves as an exceptionally potent data indexing protocol, enabling the intricate storage of data through customizable schema configuration and handler logic. This duality makes it a potent tool for shaping data structures. Consequently, when crafting your subgraphs, it's imperative to carefully assess how Indexers across the network will store this data.
Leveraging the @derivedFrom
feature within entities can significantly enhance storage performance and give your subgraph a clean and optimized data storage architecture. If you want to dive more into @derivedFrom
you can do so here .
Thanks to the SQL query generation handled by graph-node, complicated logic is handled behind the scenes to build out the proper SQL query based on the schema definition and transports the data to you via GraphQL.
Thank you for reading the second edition of the Best Practices in Subgraph Development series. Make sure to follow The Graph on or join the conversation on for more information on how to sharpen and improve your subgraph development efforts!
About The Graph
is the source of data and information for the decentralized internet. As the original decentralized data marketplace that introduced and standardized subgraphs, The Graph has become web3’s method of indexing and accessing blockchain data. Since its launch in 2018, tens of thousands of developers have for dapps across 70+ blockchains - including Ethereum, Solana, Arbitrum, Optimism, Base, Polygon, Celo, Fantom, Gnosis, and Avalanche.
As demand for data in web3 continues to grow, The Graph enters a with a more expansive vision including new data services and query languages, ensuring the decentralized protocol can serve any use case - now and into the future.
Discover more about how The Graph is shaping the future of decentralized physical infrastructure networks (DePIN) and stay connected with the community. Follow The Graph on , , , , , and . Join the community on The Graph’s , join technical discussions on The Graph’s .
oversees The Graph Network. The Graph Foundation is overseen by the . , , , , , and are eight of the many organizations within The Graph ecosystem.