File Data Sources Tutorial: Using Subgraphs to Index Off-Chain Data

TLDR: By combining IPFS and Arweave files with on-chain information, subgraphs are able to provide new data solutions to dapps. Dive into this blog post to explore both problem-solving techniques a subgraph developer may encounter as well as building a File Data Sources subgraph that indexes Lens Protocol posts.

The completed repo of this tutorial’s File Data Sources subgraph.

The published File Data Sources subgraph live on The Graph Network.


Technology And Terminology in This Tutorial

File Data Sources - A subgraph feature designed to index off-chain files and their contents. Currently, File Data Sources can index these off-chain files stored on Arweave and IFPS.

Arweave - A decentralized storage network that focuses on providing permanent, unchangeable data storage. In contrast to other storage solutions, it offers a pay-once model, ensuring that data is stored forever without recurring fees.

IPFS (InterPlanetary File System) - A peer-to-peer network that enables the decentralized sharing and storing of files, aiming to make the web faster, safer, and more open. Unlike Arweave's permanent storage model, IPFS allows for more dynamic data updating and retrieval but doesn't inherently guarantee perpetual storage of data.

Lens Protocol - A decentralized social graph designed to give users control over their content and connections in a web3-native space. Lens Protocol stores post content on both Arweave and IPFS which is considerably more cost-effective than storing this data on-chain.

ABI - An ABI (Application Binary Interface) is a standardized method for interacting with smart contracts in a blockchain, defining how to call functions, how data is structured, and how results are interpreted. Subgraphs use ABIs to decode and index smart contract events and function calls, enabling the querying of blockchain data in an efficient and structured manner.

Data Source Templates - Data Source Templates are used for indexing dynamic data sources. There are two ways Data Source Templates are used in subgraphs: to index off-chain files and to programmatically add new contracts to index on runtime like proxy smart contracts or factories.

Proxy pattern smart contracts - This smart contract architecture consists of a single implementation smart contract and possibly many proxy smart contracts. Learn more about proxy pattern smart contracts by watching a video tutorial by Patrick Collins and reading documentation from OpenZeppelin.

A File Data Sources Use-Case

In the past, a subgraph indexing off-chain data was linear, which means that if the subgraph was triggered to initiate indexing off-chain data, the on-chain indexing would pause until the file was retrieved. Now, File Data Sources allows for parallel indexing of both on-chain and off-chain data simultaneously, improving sync speed and reliability. Due to these improvements, now is a great time to learn to build with File Data Sources!

Let’s start with a use case. Wouldn’t it be powerful to gather all of Lens Protocol’s posts, including the usernames of those who posted?

From this subgraph, we could perform numerous analyses such as trend analysis, influencer identification, content recommendation, and more using libraries such as Playgrounds or even create no-code dashboards from DappLooker.

Confirming Data Architecture and our Subgraph’s Specs

Before we start building, we need to confirm that there are both on-chain and off-chain data sources to trigger the indexing of our File Data Sources subgraph.

Most NFTs, including some Lens Protocol V1 posts, store their content and metadata in JSON files off-chain while storing their state on-chain. After confirming that Lens uses both IPFS and Arweave and has its proxy and implementation smart contracts on Polygon, we are ready to move onto creating our subgraph’s specs.

Here are the two key operations our subgraph will perform to properly index off-chain data:

  1. Gather file IDs from an On-chain Event
    • We will gather file IDs from an off-chain file from the PostCreated event. This event is defined on the LensHub Implementation Smart Contract and emits from the LensHub Proxy Smart Contract.
      • We will trigger File Data Sources indexing with these file IDs.
  2. Index off-chain files from Arweave and IPFS with File Data Sources
    • We will be gathering the post’s content from the JSON file stored on Arweave or IPFS and storing the file’s metadata in a PostContent entity in our subgraph.

Off-chain File IDs That Trigger File Data Sources

Arweave

  • As of version 0.33.0, Graph Node can fetch files stored on Arweave based on their transaction ID from an Arweave gateway (example file). Arweave supports transactions uploaded via Irys (formerly Bundlr), and Graph Node can also fetch files based on Irys manifests.

IPFS

  • Graph Node supports v0 and v1 content identifiers (CIDs), and CIDs with directories such as Qm.../metadata.json.
    • For example, if the URI emitted from the chain looks like this: https://ipfs.infura.io/ipfs/QmR7baNsHXNXEThcZNSw1SpRu1ZvKjaCnakEemT94Ur9Pn , we need to extract this file ID to trigger IPFS File Data Sources: QmR7baNsHXNXEThcZNSw1SpRu1ZvKjaCnakEemT94Ur9Pn

Now that we’ve confirmed our data’s architecture and understand our subgraph’s specs, let’s work on our first subgraph spec: gathering file IDs from an on-chain event.


Gathering File IDs from On-chain Events

If we were building our own smart contract, it would be quite easy to ensure it emits an event with a file’s ID for File Data Sources. As we are not building our our own smart contract, we will need to dig into Lens’ smart contracts to see if we can find an event that emits a file’s ID for our File Data Sources subgraph.

Let’s explore LensHub implementation smart contract’s events.sol to find an event that may contain a file’s ID that we can use to trigger File Data Sources off-chain indexing.

If we look at the PostCreated event, we find that it is emitted every time a post is created. We can also see it has the contentURI field; this field may contain file IDs that we need to trigger File Data Sources.

event PostCreated(
uint256 indexed profileId,
uint256 indexed pubId,
string contentURI,
address collectModule,
bytes collectModuleReturnData,
address referenceModule,
bytes referenceModuleReturnData,
uint256 timestamp
);

Let's deploy a subgraph that indexes this PostCreated event and its contentURI field to see if it contains the file’s IDs we need to trigger File Data Sources indexing. Once we confirm that this contentURI has the file’s ID of the file that we need, we can extend this subgraph further to trigger File Data Sources indexing of that file.

Spin up our Subgraph to Index the PostCreated Event

Start at www.thegraph.com/studio and create a new subgraph that will choose Polygon as its indexed blockchain as Polygon is the chain where the smart contracts are deployed.

Using graph-cli, init our subgraph using the graph-init command listed in the bottom right of Subgraph Studio.

When the CLI asks for a smart contract address, point it to the LensHub proxy smart contract address (0xDb46d1Dc155634FbC732f92E853b10B288AD5a1d) and confirm that we want to index the events as entities. We want to target the proxy smart contract as this is the smart contract that Lens participants interact with and will emit events from both the implementation smart contract and proxy smart contract.

If the CLI ask us for the start block and ABI, we can go to https://miniscan.xyz/ and input the proxy smart contract address to gather that data.

After authorizing, deploy the subgraph!

This subgraph is now indexing the proxy smart contract, but as we will soon see, this subgraph does not yet have the capacity to index both the implementation and the proxy smart contract. We need our subgraph to capture the PostCreated event defined on implementation smart contract that emits from the proxy smart contract.

Let’s move onto exploring this problem:

Problem #1: Indexing a Proxy Pattern Smart Contract

Proxy smart contracts emit both their events and their implementation’s events.

When setting up a subgraph in graph-cli, the CLI scaffolds event handlers and entities using the provided contract's ABI. This ABI is obtained from the block explorer where the smart contract is deployed. If the CLI is pointed to a proxy smart contract, it only gathers this information from the proxy and not from its implementation.

This means our current subgraph is unaware of the implementation smart contract’s PostCreated event! It needs the ABI, event handlers, and entities for this PostCreated event!

Potential Solution A: Provide CLI with both Proxy and Implementation Smart Contracts

As we are indexing just a singular proxy and a singular implementation smart contract, we could provide the CLI with the addresses of both the implementation and the proxy smart contracts. Consequently, the subgraph will index any event that is emitted from both the proxy and the implementation smart contract.



When the CLI asks “Add another contract?” we could say yes , and we would be able to simply index both smart contracts and see all events emitted from both smart contracts.

Unfortunately Potential Solution A will not work due to an edge-case! In rare cases, the ABI in the block explorer does not accurately reflect the smart contract.

If we look into the smart contract code of the Lens Protocol’s implementation smart contract, we can see that the PostCreated event is in the events.sol file, however it is nowhere to be found in the block explorer’s ABI.

This means using the CLI to gather the implementation ABI will not solve this problem as the ABI gathered will not have have the all-important PostCreated event data.

Let’s try another solution:

Potential Solution B: Manually Creating the PostCreated ABI Entry

Let’s manually build the implementation’s ABI so that our subgraph correctly captures the PostCreated event.

To get the ABI entry that properly reflects PostCreated , we will compile events.sol in Remix (any other smart contract framework works as well) then copy/paste the PostCreated event definition into our subgraph’s ABI. This should solve our problem!

Start by copying the events.sol smart contract from the LensHub implementation smart contract and place it in the contracts folder of a new Remix IDE session.

You should see a few lines of code erroring out. We can comment those lines out.

The reason we can just comment these parts out is that we don’t need the entire file to compile, just the PostCreated event.

Let’s go ahead and compile events.sol:

Once compiled, look in the artifacts folder for your PostCreated ABI in Events.json:::

{
"anonymous": false,
"inputs": [
{
"indexed": true,
"internalType": "uint256",
"name": "profileId",
"type": "uint256"
},
{
"indexed": true,
"internalType": "uint256",
"name": "pubId",
"type": "uint256"
},
{
"indexed": false,
"internalType": "string",
"name": "contentURI",
"type": "string"
},
{
"indexed": false,
"internalType": "address",
"name": "collectModule",
"type": "address"
},
{
"indexed": false,
"internalType": "bytes",
"name": "collectModuleReturnData",
"type": "bytes"
},
{
"indexed": false,
"internalType": "address",
"name": "referenceModule",
"type": "address"
},
{
"indexed": false,
"internalType": "bytes",
"name": "referenceModuleReturnData",
"type": "bytes"
},
{
"indexed": false,
"internalType": "uint256",
"name": "timestamp",
"type": "uint256"
}
],
"name": "PostCreated",
"type": "event"
}

Here’s our ABI entry we are missing!

Take this ABI entry and copy/paste it into our subgraph’s abi.json. Now, our subgraph’s ABI is ready as it can now see the PostCreated event emitted from the implementation smart contract that passes through the proxy smart contract.

Problem #1 has been solved!

Now that we have a subgraph that can accurately see the PostCreated event passing through the proxy smart contract we’re indexing, we can move onto our next problem to solve.

As we recall, the initial intent of spinning up our subgraph was to see if we could find a file ID to trigger File Data Sources. However, as we had to manually generate our ABI, we encounter another interesting problem to solve.

Problem #2: Handling the PostCreated Event

Now that our subgraph can see the PostCreated event, we must extend our subgraph to handle the PostCreated event data. Usually the CLI scaffolds this out automatically, but as we manually had to update the ABI, we must also manually extend our subgraph. This will be a good exercise to learn how data is handled in a subgraph.

Solution: Extend our Subgraph to Index the PostCreated Event

Starting with our subgraph.yaml file, let’s add the PostCreated event handler, as well as define its PostCreated entity:

specVersion: 0.0.5
schema:
file: ./schema.graphql
dataSources:
- kind: ethereum
name: Contract
network: matic
source:
address: "0xDb46d1Dc155634FbC732f92E853b10B288AD5a1d"
abi: Contract
startBlock: 29735286
mapping:
kind: ethereum/events
apiVersion: 0.0.7
language: wasm/assemblyscript
entities:
- PostCreated
abis:
- name: Contract
file: ./abis/Contract.json
eventHandlers:
- event: PostCreated(indexed uint256,indexed uint256,string,address,bytes,address,bytes,uint256)
handler: handlePostCreated
file: ./src/contract.ts
subgraph.yaml

Let’s move onto schema.graphql.

Here, we will include the PostCreated entity we just defined in subgraph.yaml. We won’t include all the fields on the PostCreated event as we don’t need that data. We just need to see if the contentURI field contains a file ID to trigger File Data Sources.

type PostCreated @entity(immutable: true) {
id: Bytes!
ownerId: BigInt!
contentURI: String!
timestamp: BigInt!
}
schema.graphql

With our subgraph.yaml manifest and the schema.graphql ready to accept the PostCreated on-chain event, we are ready to build the handlePostCreated handler in our mappings.ts.

Before we do so we should run graph codegen in our terminal.


Tip: any time we alter schema.graphql, we should run graph codegen to update our types as we import those autogenerated files at the top of our mappings.ts .


Read the comments to take a dive into a step-by-step explanation of our mappings.ts logic.

// Import the event helper code and rename to improve readability.
import { PostCreated as PostCreatedEvent } from "../generated/Contract/Contract";
// Import the types generated from the schema we created.
import { PostCreated } from "../generated/schema";
// Create a new entity with a unique ID. This unique ID is important,
// as it must be the same ID used in our `handlePostContent` handler built later
// in this tutorial.
export function handlePostCreated(event: PostCreatedEvent): void {
let entity = new PostCreated(
Bytes.fromUTF8(
event.params.profileId.toString() + "-" + event.params.pubId.toString(),
),
);
// Send the `contentURI` emitted as an event parameter into our defined entity.
// When this subgraph is deployed and indexing this property, we will look through
// `contentURI` to see if we can find some IDs.
entity.contentURI = event.params.contentURI;
// Assign other parameters emitted from the event that might be helpful when
// filtering.
entity.ownerId = event.params.profileId;
entity.timestamp = event.params.timestamp;
entity.save();
}
mappings.ts

We’ve updated our subgraph.yaml , our schema.graphql , and our mappings.ts files to properly reflect our updated ABI.

Problem #2 has been solved!

We’ve faced two problems and solved them! Now is the time to re-deploy our subgraph.

Go back to Subgraph Studio and run the graph-deploy command to re-deploy.

Once it is deployed, run these two queries in your subgraph’s Playground to see the various URIs passed through the contentURI:

Example Arweave query and response

{
postCreateds(where: {contentURI_contains: "arweave"}, first: 1) {
id
contentURI
}
}
Arweave query
{
"data": {
"postCreateds": [
{
"id": "0x312d313030",
"contentURI": "https://arweave.net/UjlOEosUyYqZ8oWlTi879YRTg9xyd5gUtKQ2zMSHQEg"
}
]
}
}
Arweave response

Example IPFS query and response

{
postCreateds(where: { contentURI_contains: "ipfs" }, first: 1) {
id
contentURI
}
}
IPFS query
{
"data": {
"postCreateds": [
{
"id": "0x312d31",
"contentURI": "https://ipfs.infura.io/ipfs/QmR7baNsHXNXEThcZNSw1SpRu1ZvKjaCnakEemT94Ur9Pn"
}
]
}
}
IPFS response

We have found file IDs within the URIs!

  • Arweave example file ID s9qinED7mYvrNrYTEbRlNb60bVT754LkVpQoW7Ffi24
  • IPFS example file ID:QmTWJEzcxxcPjnB8Xj8S4EUJ7RxHGkfcDfMZyYRcQz7eYN

Let’s move on to our second subgraph’s spec: indexing these off-chain files from Arweave and IPFS with File Data Sources.


Indexing Off-chain Files from Arweave and IPFS with File Data Sources

In this section we will focus on programmatically triggering File Data Sources with the file IDs we have found, as well as the handling the this off-chain data.

Let’s start this process by designing two subgraph templates for Arweave and IPFS data:

File Data Sources Templates Specs

We will create two templates. One template will be for Arweave and the other will be for IPFS. Each template will:

  • Have a unique name.
  • Name the PostContent entity as the entity that will receive both Arweave and IPFS data.
  • Name the handlePostContent handler as the handler for both Arweave and IPFS data that will pass data to the PostContent entity.
  • Maintain off-chain and on-chain data separation. Read more about data separation and other File Data Sources limitations here.

Adding Two Templates to subgraph.yaml

We will add the ArweaveContent and IpfsContent templates below the proxy smart contract we are already indexing:

specVersion: 0.0.5
schema:
file: ./schema.graphql
dataSources:
- kind: ethereum
name: Contract
network: matic
source:
address: "0xDb46d1Dc155634FbC732f92E853b10B288AD5a1d"
abi: Contract
startBlock: 29735286
mapping:
kind: ethereum/events
apiVersion: 0.0.7
language: wasm/assemblyscript
entities:
- PostCreated
abis:
- name: Contract
file: ./abis/Contract.json
eventHandlers:
- event: PostCreated(indexed uint256,indexed uint256,string,address,bytes,address,bytes,uint256)
handler: handlePostCreated
file: ./src/contract.ts
templates:
- kind: file/arweave
name: ArweaveContent
mapping:
kind: ethereum/events
apiVersion: 0.0.7
language: wasm/assemblyscript
entities:
- PostContent
abis:
- name: Contract
file: ./abis/Contract.json
handler: handlePostContent
file: ./src/contract.ts
- kind: file/ipfs
name: IpfsContent
mapping:
kind: ethereum/events
apiVersion: 0.0.7
language: wasm/assemblyscript
entities:
- PostContent
abis:
- name: Contract
file: ./abis/Contract.json
handler: handlePostContent
file: ./src/contract.ts
subgraph.yaml

Extending our Subgraph to Reflect our new Templates

Our templates refer to the PostContent entity and handlePostContent handler. Let’s extend our subgraph to reflect these changes. We will:

  • Create PostContent in schema.graphql
    • This is where the off-chain data will be stored.
  • Update handlePostCreated in mappings.ts
    • This is where we will pass the file IDs gathered from the on-chain event into the template using DataSourceTemplate.createWithContext().
  • Create handlePostContent in mappings.ts
    • Once File Data Sources is triggered, we need to pass that data into the PostContent entity we previously created in schema.graphql .

Let’s get to work!

Create PostContent Entity in schema.graphql

This entity will store content and the file ID that triggered File Data Source indexing from the Arweave and IPFS files.

type PostCreated @entity(immutable: true) {
id: Bytes!
ownerId: BigInt!
contentURI: String!
timestamp: BigInt!
}
type PostContent @entity(immutable: true) {
id: Bytes!
hash: String!
content: String!
}

Please note: the entity’s id is specific to the entity itself and not the file ID of the off-chain file. The file ID of the off-chain file will be stored as hash.


Tip: Any time we are creating an entity that is an event-log of either on-chain or off-chain data with no alterations in the mappings.ts, it’s best to include (immutable: true) as seen in the snippet below as this greatly improves indexing speed. See this blog post from David Lutterkort to learn more about immutable entities and their performance benefits.


Now our subgraph.yaml has the File Data Source templates ready, and our schema.graphql is ready to accept data from both on-chain and off-chain sources.

Let's move on to extending our handlers in mappings.ts .

Update handlePostCreated in mappings.ts

For the sake of this tutorial, we’ll keep it simple by just gathering specific IDs from just one targeted URI structure for both Arweave and IPFS; just know that we wanted to gather all the Arweave and IPFS files, we’d need to build out more strategies to gather all the various returned URIs.

Read the comments to get a step-by-step explanation.

// POST_ID_KEY will be used as the key for a key-value pair passed into the
// `context` argument in the createWithContext(name: string, params: string[], context: DataSourceContext)
const POST_ID_KEY = "postID";
export function handlePostCreated(event: PostCreatedEvent): void {
let entity = new PostCreated(
Bytes.fromUTF8(
event.params.profileId.toString() +
"-" +
event.params.pubId.toString()
)
);
entity.ownerId = event.params.profileId;
entity.contentURI = event.params.contentURI;
entity.timestamp = event.params.timestamp;
entity.save();
// EXTRACT THE ID FROM OUR ENTITY
// Find the index of where "arweave.net" or "/ipfs/" is within the contentURI.
// This is a relatively naive way of finding whether the content is from
// Arweave or IPFS. Feel free to extend this further to capture all the various
// ways that IDs present in the Arweave and IPFS URIs.
let arweaveIndex = entity.contentURI.indexOf("arweave.net/");
let ipfsIndex = entity.contentURI.indexOf("/ipfs/");
// If both arweave and ipfsIndex return -1, which means the strings were not found.
// At that point, there's nothing else to do, so the function ends.
if (arweaveIndex == -1 && ipfsIndex == -1) return;
// PREPARE `CONTEXT` - PASS IN OUR ID
// DataSourceContext() is passed in a key,value pair that is converted into Bytes
// to be passed into other handlers. The key was defined outside this function as
// POST_ID_KEY and the value is the entity.id. This allows consistency between
// handlers as the data is being indexed.
let context = new DataSourceContext();
context.setBytes(POST_ID_KEY, entity.id);
// If Arweave data or IPFS data is found in the URI, the data hash is extracted
// from contentURI. We now have the three variables we need! Pass them into
// .createWithContext() to trigger File Data Sources indexing!
if (arweaveIndex != -1) {
let hash = entity.contentURI.substr(arweaveIndex + 12);
DataSourceTemplate.createWithContext("ArweaveContent", [hash], context);
return;
}
if (ipfsIndex != -1) {
let hash = entity.contentURI.substr(ipfsIndex + 6);
DataSourceTemplate.createWithContext("IpfsContent", [hash], context);
}
}
mappings.ts

Now that we’ve triggered our templates, they are looking for a handlePostContent handler that will handle the off-chain data that they are being passed.

Create handlePostContent in mappings.ts

Read the comments to get a step-by-step explanation.

export function handlePostContent(content: Bytes): void {
// Remember DataSourceTemplate.createWithContext()? We can access
// everything we just passed into that method here!
// Gather the `hash` aka the ID with dataSource.stringParam()
// Gather the `context` that has our ID encoded as Bytes as dataSource.context(),
// then decode it.
let hash = dataSource.stringParam();
let context = dataSource.context();
let id = context.getBytes(POST_ID_KEY);
// We pass in the same ID used in the previous `PostCreated` handler here to
// link the on-chain PostCreated ID with the off-chain PostContent id.
let post = new PostContent(id);
post.hash = hash;
post.content = content.toString();
post.save();
}
mappings.ts

Go ahead and redeploy the subgraph to start indexing with File Data Sources!

Send this query through Subgraph Studio’s Playground to search through and see if we’re getting both Arweave and IPFS data.

{
postCreateds(first: 100) {
id
contentURI
}
postContents(first: 100) {
id
hash
content
}
}
Query to gather on-chain postCreateds and off-chain postContents

We are now indexing with File Data Sources!

Here is our File Data Sources subgraph published on The Graph Network as well as the final repo.

From here, there are so many more things that we could do!

  • Extend the subgraph further to parse the JSON data and populate entities. See how this could be accomplished with this example.
  • What if we wanted to build a dapp with this data? We could plug this subgraph into ScaffoldEth and continue building!
  • Use python to query this subgraph using Playgrounds and perform data analysis.
  • Create no-code dashboards from DappLooker.
  • Trigger File Data Sources indexing of another off-chain file from within another off-chain handler. See this feature.

Thank you for joining me on this journey into indexing using File Data Sources. I’m looking forward to seeing what you build!

Marcus Rein

Developer Relations and Success

Edge & Node - Working on The Graph


Category
Graph Builders
Author
Marcus Rein
Published
February 2, 2024

Marcus Rein

View all blog posts