Best Practices in Subgraph Development: Avoiding Large Arrays

Welcome to the second edition of the Best Practices in Subgraph Development series. If you're new to The Graph and want to understand the best practices for making and using subgraphs, you're in the right spot. This blog is aimed at guiding developers on crafting efficient and well-structured subgraph definitions.

The purpose of this series is to share insights and recommendations based on patterns that have emerged since The Graph Node was open-sourced in 2018. In the previous blog in this series, we covered improving indexing performance by reducing eth_calls.

In this second post, we'll concentrate on enhancing the performance of data storage defined within the subgraph schema and handlers. However, before we dive into the solution, it makes sense to explain the problem and why managing the way data entries are stored in subgraphs is important.

Schema Definition

The schema serves as the guide for Graph Node, informing it about the data types to be indexed, their relationships, and the format to provide when responding to queries.

Mapping of Indexed Data

Data is collected through a mapping process that aligns with a specified schema in the subgraph. This establishes a connection between the original blockchain data and the organized data structure.

Storage

The main store for graph-node is PostgreSQL, this is where subgraph data is stored, as well as metadata about subgraphs, and subgraph-agnostic network data such as the block cache, and eth_call cache.

The processed and mapped data is stored in a way that optimizes query performance. Graph Node creates indexes on the data, enabling rapid retrieval of specific information. These indexes work similarly to how a book's index helps you quickly find specific topics or pages.

Another important feature of Graph Node is that when an existing entity is updated, The Graph stores both versions, keeping the old version and the new one as well. This concept is important to understand, and we will highlight why it is important later in this blog post.

Querying (GraphQL)

Once the data is indexed, developers and users can compose GraphQL queries to request specific information. The GraphQL queries are designed to mirror the structure of the subgraph's schema. Graph Node efficiently processes these queries, retrieves the required data from the database, and returns the results to the query issuer from a GraphQL interface.

Given the previous explanation of how Graph Node works, we can safely say that these definitions are extremely programmable in the sense that they can be authored and customized by developers to specify how blockchain data is indexed, organized, and queried. This empowers developers to tailor the behavior of subgraphs to their specific needs, and makes The Graph powerful; however, there are some things to look out for when it comes to properly optimizing the data within Graph Node. Let’s take a look at the issue with storing large arrays and show how to optimize them using some features available in the schema configuration.

Large Arrays

To illustrate this issue, we will begin with a simple smart contract. In this example, we assume there is a createUser function along with an addTransaction function, and we track contract calls to these functions with the following events:

event UserCreated(
address indexed userAddress,
string username
);
event TransactionAdded(
address indexed userAddress,
uint256 transactionIndex
);

As a subgraph developer, we usually want to build our schema around these events. Let's say that we want to keep track of all the transactions that a particular user executed. Without any knowledge of how arrays were handled within a subgraph configuration, we might design a schema similar to this where we define a field with an array definition. In the below example, we do this with the entity named transactions.

type TransactionAdded @entity {
id: Bytes!
userAddress: Bytes!
transactionIndex: BigInt!
blockNumber: BigInt!
blockTimestamp: BigInt!
transactionHash: Bytes!
}
type UserCreated @entity {
id: Bytes!
userAddress: Bytes!
username: String!
blockNumber: BigInt!
blockTimestamp: BigInt!
transactionHash: Bytes!
transactions: [String!]! # array defined entity
}

Then we might perform an update to this entity field using a similar code like below, where we load in the existing transactions, push the new one to the array, and then save the data.

let val = user.transactions;
val.push(event.transaction.hash.toHexString()); // Add transaction ID to the array
user.transactions = val
user.save();

This will work, but it is important to note that when we access an array of an entity we are actually getting a copy of that data. Therefore, if we update the data and save the entity, we are simply making a copy of the array, while the original is left unchanged. This is not a problem for small arrays with fewer than 50 or so entries. However, if it contains a larger amount of data and changes frequently it will bloat the database.

The reason for this is because of a powerful capability in Graph Node known as "time-travel queries." Originally implemented so that Graph Node can handle chain reorgs and ensure data accuracy by tracking state at a certain block number and block hash, this feature also empowered users to query the subgraph at any specific point in time, giving access to rich historical data. In order to achieve this, Graph Node is keeping track of all the changes within all the entities for any given subgraph.

Here is an example of what a GraphQL data response would look like with the userCreated schema that tracks transactions.

Query:

query MyQuery {
userCreateds {
transactions
userAddress
username
}
}

Output:

{
"data": {
"userCreateds": [
{
"transactions": [
"0xeed478e63f2f9648997983f0fe4a5c662dbbfaee958f37c7a0c4c98d58724fa1",
"0xecc8739fa98564902b7be0a44f048f4386c7cf30280d51f61440c040cf132725",
"0x82b598e665b3c7586ee1e48b20bf96fcaceef31f53b2e07f5d5023a55a9e69a2",
"0x4d7686e45275eaf71db4c0ba11e162b35e82ae521501ac691f1516a16084a775"
],
"userAddress": "0x328cf2bb352ad3bab62a7a22233b9b1ff2f79f69",
"username": "alice"
},
{
"transactions": [
"0xc53afb441cbbb4f03c89137c5713b230cd9015c84d3116a62debfa3c3cd3d445"
],
"userAddress": "0xd418aa30f9cbd8107f1842aef83dfc6947321114",
"username": "bob"
}
]
}
}

Looking at the response from GraphQL you might think everything is fine, but in order to really understand the problem, we should dig deeper to see how the data is stored in PostgreSQL. The database screenshot below shows the tables using a popular PostgreSQL tool called pgAdmin.

This can be done with the following steps, Select Databases -> graph-node -> schemas -> sgd1 (where numbered iteration of your Subgraph) -> Tables -> user_created

Then simply right click the database and select View/Edit Data and select All Rows

If we examine the table called transactions, we can immediately see the issue with this approach:

After only a few transactions, our changes are duplicated into a new row and the existing row remains unchanged. This will obviously become an issue as more and more entries are added to our database.

If we look at the feature @derivedFrom, we will see that it gives us the ability to perform reverse lookups on entities. Enabling this feature creates a virtual field on the entity that may be queried but cannot be set manually through the mappings API. Rather, it is derived from the relationship defined on the other entity.

Additionally, as of graph-node v0.31.0, @graphprotocol/graph-ts v0.31.0, and @graphprotocol/graph-cli v0.51.0 the loadRelated method is available. This enables loading derived entity fields from within an event handler. It proves useful when you want to read in values when processing events on your subgraph and make determinations on how the data is stored. This is a very powerful feature and makes data processing easy. When we combine loadRelated with @derivedFrom within our schema, we can solve our problem.

Let’s look at a quick example of how a derivedFrom configuration and loadRelated work in practice. Let’s say that we have the following schema with two entities, one of the fields within EntityA and entityb is derivedFrom a field called link.

type EntityA {
id: ID!
value: Int!
entityb: [EntityB!] derivedFrom(field: link)
}
type EntityB {
id: ID!
value: Int!
link: [EntityA!]
}

Notice we also reference EntityA as the value to be derived from the key link within EntityB. Below, we have an example handler that saves some values to our entities.

// Handle Event for EntityA
handleEventA(){
const entityA = new EntityA('a1');
entityA.value = 1;
entityA.save();
}
// Handle Event for EntityB
handleEventB(){
const entityB = new EntityB('b1');
entityB.value = 1;
entityB.link = ["1", "2"]; // Notice how we have to save the ID values of EntityA's here.
entityB.save();
}

After updating our values within the handler, we would end up with the following data stored in our database. It is important to note that each entity value is stored in its own unique table, the magic of loadRelated is when querying the data.

EntityA
+-----+-------+
| id | value |
+-----+-------+
| a1 | 1 |
| a2 | ... |
| ... | ... |
+-----+-------+
EntityB
+-----+-------+--------+
| id | value | link |
+-----+-------+--------+
| b1 | 1 | [1,2] |
| ... | ... | ... |
+-----+-------+--------+

To showcase what this looks like when loading the data into a handler, we can look at the following code as an example:

entityA = EntityA.load("1")
b = entityA.entityb.load()

The value of b in the previous code will be determined by our @derivedFrom(field: link) meaning it is loadRelated and is generated when calling the .load() method. In this example it will load all EntityB for which the field link == entityA.id or 1 which is the value of entityA.id in this example. Therefore, the SQL query generated by Graph Node for entityA.entityb.load would be in the following syntax. If you want to learn more about how SQL generation is handled with graph-node you can read more in the documentation.

SELECT * FROM EntityB WHERE link == entityA.id;

The Solution

Knowing what we know now about how loadRelated works and how derivedFrom is configured, we can go back and fix our code to store the data in a more efficient way. First, we will update our previous entity to now the derivedFrom annotation of a new field called user. This is also defined as an array type so that we can store one-to-many relationships.

type TransactionAdded @entity {
id: Bytes!
userAddress: Bytes!
transactionIndex: BigInt!
blockNumber: BigInt!
blockTimestamp: BigInt!
transactionHash: String!
user: UserCreated!
}
type UserCreated @entity {
id: Bytes!
userAddress: Bytes!
username: String!
blockNumber: BigInt!
blockTimestamp: BigInt!
transactionHash: Bytes!
transactions: [TransactionAdded!]! @derivedFrom(field: "user")
}

Now we can change our handler to update the relationship between the entities. In this example, the event transaction hash is converted to a hex string using.toHexString() only to make it human-readable, this way we can examine the data within the tables.

export function handleTransactionAdded(event: TransactionAddedEvent): void {
let id = event.transaction.hash.toHex() + "-" + event.logIndex.toString();
let entity = new TransactionAdded(
event.transaction.hash.concatI32(event.logIndex.toI32())
);
entity.userAddress = event.params.userAddress;
entity.transactionIndex = event.params.transactionIndex;
entity.blockNumber = event.block.number;
entity.blockTimestamp = event.block.timestamp;
entity.transactionHash = event.transaction.hash.toHexString(); // saving as hex string for visualization
entity.user = event.params.userAddress; // save the relationship
entity.save();
}

In our configuration, we defined a new foreign key within our entity called user, therefore we are saving the entity.user as event.params.userAddress for clarity; however, we could also have used the userAddress key already available and saved the relationship using entitiy.userAddress. After successfully storing the data using our new schema using our handler, we can open up the database again and see how much more organized the data structure becomes. We can see below that we no longer have the compounding array data and the transactions are stored in a much cleaner way.

Additionally, we can make a query to GraphQL to get some data and verify the data is still correlated. As you can see below, we can query to get the transaction ids from each user in one simple easy to consume GraphQL query.

Query:

query MyQuery {
userCreateds {
userAddress
username
transactions {
id
}
}
}

Output:

{
"data": {
"userCreateds": [
{
"transactions": [
{
"id": "0x48aa5a7cb0192a25602911e23bec97b865e81be5d4252da660ab7b9ccc74083a00000000"
}
],
"userAddress": "0x0498a25019ad52a979dd81ab959fe59b2f6d71db",
"username": "bob"
},
{
"transactions": [
{
"id": "0x474e5631a3f8230212e06ecd28882c4f2954bd0ad108cd689d2d1ffef705385b00000000"
},
{
"id": "0xa5f45a612c54e5323ef8880b35847d89ade94afb76f5bd71a46b2624037d696900000000"
},
{
"id": "0xe585a9242d1f2abde0a487119f392df210abe9ba10d889e86a7af21e259e818500000000"
}
],
"userAddress": "0x328cf2bb352ad3bab62a7a22233b9b1ff2f79f69",
"username": "alice"
}
]
}
}

Summary

The Graph serves as an exceptionally potent data indexing protocol, enabling the intricate storage of data through customizable schema configuration and handler logic. This duality makes it a potent tool for shaping data structures. Consequently, when crafting your subgraphs, it's imperative to carefully assess how Indexers across the network will store this data.

Leveraging the @derivedFrom feature within entities can significantly enhance storage performance and give your subgraph a clean and optimized data storage architecture. If you want to dive more into @derivedFrom   you can do so here in the docs.

Thanks to the SQL query generation handled by graph-node, complicated logic is handled behind the scenes to build out the proper SQL query based on the schema definition and transports the data to you via GraphQL.

Thank you for reading the second edition of the Best Practices in Subgraph Development series. Make sure to follow The Graph on Twitter or join the conversation on Discord for more information on how to sharpen and improve your subgraph development efforts!

About The Graph

The Graph is the source of data and information for the decentralized internet. As the original decentralized data marketplace that introduced and standardized subgraphs, The Graph has become web3’s method of indexing and accessing blockchain data. Since its launch in 2018, tens of thousands of developers have built subgraphs for dapps across 70+ blockchains - including  Ethereum, Solana, Arbitrum, Optimism, Base, Polygon, Celo, Fantom, Gnosis, and Avalanche.

As demand for data in web3 continues to grow, The Graph enters a New Era with a more expansive vision including new data services and query languages, ensuring the decentralized protocol can serve any use case - now and into the future.

Discover more about how The Graph is shaping the future of decentralized physical infrastructure networks (DePIN) and stay connected with the community. Follow The Graph on X, LinkedIn, Instagram, Facebook, Reddit, Farcaster and Medium. Join the community on The Graph’s Telegram, join technical discussions on The Graph’s Discord.

The Graph Foundation oversees The Graph Network. The Graph Foundation is overseen by the Technical Council. Edge & Node, StreamingFast, Semiotic Labs, Messari, GraphOps, Pinax and Geo are eight of the many organizations within The Graph ecosystem.


Categories
Developer CornerRecommended
Published
December 19, 2023

Kevin Jones

View all blog posts