The Graph Unlocks a New Web3 Use Case: Organizing Off-chain Data

This blog was written by Craig Tutterow, Principal Data Scientist at Edge & Node, a core developer team of The Graph. Prefer to hear this blog aloud? Listen to the contents of this blog via The Graph Podcast.


TL;DR - The Graph can be used to organize off-chain data. Anyone can create a workflow where traditionally off-chain data is posted to IPFS and IPFS hashes are posted on-chain. This data can then be indexed by subgraphs (open APIs) – which in turn can be used to serve organized off-chain data to front ends in a performant way with low cost and effort.


The Graph’s use cases are constantly growing. While The Graph is widely known for organizing on-chain data, a longtime mission of the protocol has been to provide easy access to allof the world’s public knowledge and information in a decentralized way. One way to make a broader pool of data even more accessible via The Graph is one that has not yet been described at length: organizing and serving off-chain data. This post details a method of using The Graph to economically and scalably publish and serve complex, dynamic, open-source data at scale without the overhead of building or maintaining proprietary APIs. It’s a use case that presents a unique opportunity to access and organize traditionally off-chain, centrally hosted data sources and create useful new data economies for users, developers, data scientists, researchers and more.

Web3’s Unique Value Proposition and Tradeoffs

Web3 architectures simplify back end development by enabling users to read and write to the underlying blockchain either directly or through middleware service providers. At scale, however, one tradeoff that arises is cost. On-chain transaction costs (i.e., gas) are prohibitive for more complex computation or larger scale data storage. This has limited the complexity of applications, or caused developers to depart from more open-source data development models by building proprietary off-chain, centrally-hosted APIs to augment on-chain data sources.

Web3’s unique value proposition is enabling a competitive market for software application development by leveraging cryptography to give users ownership of their public or private data. The public nature of blockchain transactions reduces user “lock-in” and switching costs, giving users more confidence in the permanence and integrity of data through protocol incentives that reward redundancy and consensus. Data portability of this nature allows new types of experiences to be built at lower cost by crowdsourcing development effort in a permissionless way that is not beholden to a single company’s shareholders or management. Web3’s open data is a stark contrast to custodial web2 data silos (what technologist Jaron Lanier and economist Glen Weyl have called “siren servers”).

The Graph Network is an ideal example of how to serve blockchain data in a way that reduces dependence on proprietary APIs and can be maintained or hosted in a permissionless way. The Graph is often synonymous with organizing on-chain data, but the permissionless nature and indexing efficiency of subgraphs can be extended to serve off-chain data as well. Though off-chain data does not benefit from cryptographic guarantees like on-chain data, data that is generated off-chain nevertheless still fuels economies worldwide. Data of diverse sources could benefit from a standard querying method – a GraphQL API (aka subgraphs).

Creating Web3 Data Pipelines Using The Graph and Off-chain Data

Let’s dig into how to use The Graph to publish and serve open-source, dynamic data on blockchains and permaweb storage. With this method, you can reliably and affordably index and serve data to third-party consumers. Decoupling the publishing, indexing, and serving of data minimizes development operations expense. Plus, by using The Graph’s decentralized network, access to this data remains robust and reliable without requiring any additional effort on the part of the data publisher.

The approach consists of the following components:

  • An off-chain cron job that performs complex computation and posts data to a permaweb source indexable by The Graph (currently this can only be done with IPFS using the network’s support for file data sources, which can be extended to Arweave or Filecoin once the protocol adds indexing support for file content on those protocols)
  • The cron job should produce an on-chain transaction that posts the file hash and any desired metadata (time, topic name, etc., ideally using a gas-minimized data edge contract that performs no computation, but persists the IPFS hash. Developers can reuse a Data Edge contract Edge & Node has already created for this purpose. OpenZeppelin Autotasks make this as easy as an API call.)
  • Publication of a subgraph that indexes IPFS files based on the file hashes posted to the chain.

Once the subgraph is published, it can be picked up and served by Indexers, allowing third-party developers and users to query the data at their own expense with no dependencies on a central entity to host and manage development operations associated with serving the data.

In other words: once data exists on-chain, anyone can use subgraphs to query traditionally siloed information and display it on a front end or spin up new use cases – all without needing to operate an expensive, failure-prone server to host that data.

For public/open-source data, this workflow outsources the substantial overhead associated with maintaining a proprietary API server by leveraging The Graph’s distributed network of Indexers. By open sourcing and indexing the data, this workflow allows third parties to compose on subgraphs and extend their utility to other use cases, while minimizing transaction and storage costs to the data publisher. In this workflow, the data publisher is only responsible for the cost of running the 1) cron job, 2) IPFS node pinning (currently subsidized by The Graph) or permaweb file storage, and 3) the costs associated with the on-chain transactions, which are generally low on high throughput chains.

In addition to the cost advantages for data publishers, using The Graph to index IPFS file content gives users the performance benefits associated with the geo-distribution and redundancy of The Graph’s indexing nodes. As a result of not having built-in protocol incentives related to file availability or quality of service, IPFS can be challenging to use as a back end for any high traffic web application. Since The Graph’s Indexers compete for (and are paid directly for) queries, they have built-in incentives to provide the quality of service needed for scaling consumer-grade web applications and serve as a DNS layer or gateway for indexing permaweb files in a performant way.

Furthermore, the decentralized structure of The Graph and permaweb storage provide a number of benefits for encouraging participation and giving more confidence to third party developers by reducing dependence on a singular point of failure in which a proprietary API provider is needed to make the data available. These benefits include:

  • Underlying data can be replicated and verified using the content hash (which is immutably stored on-chain), giving third party developers or IPFS node providers the ability to replicate/copy/self-host/verify the data
  • Subgraph API manifests (containing the schema and often logic) are persisted to chain and can be indexed permissionlessly by multiple Indexers, thereby providing geo-distribution and redundancy in hosting and serving the data

An Example in The Graph Ecosystem

Edge & Node recently produced an oracle to publish and serve network cost and quality of service metrics using this method. Every five minutes, the oracle:

  1. Posts aggregated quality and cost of service data from available gateways to IPFS
  2. Posts the IPFS file hash to Gnosis chain via aDataEdge contract

These IPFS files can then be indexed in asubgraph, which can then be consumed by protocol stakeholders. For example code, one can reference this subgraph repo, built by Juan Manuel Rodriguez Defago of GraphOps, which parses data from the IPFS files and stores values in the Indexer database. You can read more about this particular implementation oracle-subgraph pipeline in Edge & Node’s developer documentation.

This approach allows us to minimize devops overhead, while making data available to community members such as DappLooker, empowering them to build useful tools without a dependency on the data publisher to host and maintain a proprietary API (e.g.: two dashboards the DappLooker team built based on the quality and cost of service subgraph: gateways’ quality of service dashboard, Indexers’ quality of service dashboard)

Costs

This workflow can be implemented at very low cost. Below are the costs associated with our quality of service oracle.

  • Gnosis chain DataEdge contract transactions: ~$1-2 per month on Gnosis chain for ~ 10k transactions (not a typo, it’s actually $1-2)
  • IPFS node pinning: subsidized. Independent SaaS options starting at $20/month, with free tiers for simple/low volume usage.
  • Serving: $0 for publisher, query fees paid by data consumer (costs set by Indexer market based on the complexity of the subgraph and the volume of queries). Prospective users can see how much current users are paying for different levels of query volumes per month in this spreadsheet.

Exploring New Web3 Use Cases

Dynamic data from an oracle served via The Graph could open up some interesting new possibilities for permaweb sites, including:

Blogs and content hubs with lean, cheap back ends:

Offload a blog’s content hosting to IPFS, and feed your front end with the blog’s content by using a subgraph. Content publishers could host a blog using ENS by periodically posting content to IPFS and chain using this method. The front end code for the blog would use the subgraph as a back end, and would remove the need for hosting, server maintenance, or DNS domain name registration. The publisher would only need to have a loaded billing balance to pay for queries based on the site’s traffic. Depending on traffic and frequency of updates, this could be done at a variable cost of <$1/month.

Algorithmic choice and personalized experiences:

Prioritize content that’s curated for you or your community. Data publishers could create custom feed ranking scores by performing complex machine learning workloads for recommending content off-chain, and making model output available in a subgraph. Permaweb or federated social network front ends could then use model output as an input in a front end to create personalized experiences for users based on their publicly expressed preferences and behaviors.

The partitioning of data publishers from app/front end operators allows for specialization and division of labor of front and back end engineers in an open source community, and could be a promising path forward for decentralized social applications and protocols like Bluesky, Lens, Mastodon, among others.

Real-time monitoring:

Permaweb sites that incorporate real-time data from an oracle could be used to monitor a wide range of systems and processes, from supply chains to markets to weather conditions. Edge & Node currently uses this workflow to allow the community to monitor aggregated Indexer and subgraph quality of service on queries on the decentralized network.

It’s exciting to imagine how The Graph can be used to enable access to useful (but currently siloed) public data around the web and increase web3 participation by traditional institutions which publish data used by third parties. Whether your goal is to create dynamic sites powered by lean, decentralized infrastructure, or to incentivize crowdsourced analytics on large data sets, subgraphs can make it possible. If you are interested in using this method for publishing or hosting dynamic off-chain data in your dapps, or collaborating to produce open source libraries to simplify the process, the data science team at Edge & Node would love to hear from and collaborate with you! Please share your ideas with the author of this blog via The Graph Forum or share your thoughts with the broader Graph community in The Graph Discord.

Acknowledgements: Thanks to Brian Berman, Michael Macaulay, Doug Antin, Zac Burns and Ricky Esclapon, Noelle Becker Moreno, and Aaron Kelly for their input and feedback on this post.

About The Graph

The Graph is the source of data and information for the decentralized internet. As the original decentralized data marketplace that introduced and standardized subgraphs, The Graph has become web3’s method of indexing and accessing blockchain data. Since its launch in 2018, tens of thousands of developers have built subgraphs for dapps across 40+ blockchains - including  Ethereum, Arbitrum, Optimism, Base, Polygon, Celo, Fantom, Gnosis, and Avalanche.

As demand for data in web3 continues to grow, The Graph enters a New Era with a more expansive vision including new data services and query languages, ensuring the decentralized protocol can serve any use case - now and into the future.

Discover more about how The Graph is shaping the future of decentralized physical infrastructure networks (DePIN) and stay connected with the community. Follow The Graph on X, LinkedIn, Instagram, Facebook, Reddit, and Medium. Join the community on The Graph’s Telegram, join technical discussions on The Graph’s Discord.

The Graph Foundation oversees The Graph Network. The Graph Foundation is overseen by the Technical Council. Edge & Node, StreamingFast, Semiotic Labs, The Guild, Messari, GraphOps, Pinax and Geo are eight of the many organizations within The Graph ecosystem.


Category
Graph Protocol
Author
Craig Tutterow
Published
May 18, 2023

Craig Tutterow

View all blog posts