DeFi is Eating the World: Outage Post-Mortem
On the morning of June 24, at 12:00am PST, The Graph’s hosted service started experiencing significant service degradation. While users were still able to interact with smart contracts on Ethereum, several front-end applications (most notably in DeFi) were impacted. During this time a significant number of requests were slow and failed with 500 errors. Service was fully restored at 11:10am PST. Later at 11:35am PST today we saw another disruption in service which lasted until 12:20pm PST. We sincerely apologize to all the teams that were affected and appreciate the support of our community while our engineers worked to fix the issues.
We really care about helping developers build amazing products - products that load fast and are reliable. We fell short of our goal yesterday. We do not want teams to have to trust us—that’s why we’re working on decentralized infrastructure. Yesterday’s incident highlights the importance of decentralization at every layer of the stack to ensure that applications keep running no matter what.
Over the past two weeks, The Graph has seen query volume grow from 25M queries/day to 45M queries/day. This is largely due to the recent growth in adoption of DeFi applications; including product launches, liquidity mining programs, and crypto price appreciation.
Despite having 50% headroom in our Google Cloud database, on June 23, our database CPU maxed out at 100%, causing requests to fail. While we always plan for growth, the resources required to support queries were greater than our projections.
In addition to increased usage, we began to receive a handful of especially complex queries that contributed additional strain to our database. While GraphQL is extremely flexible, allowing developers to compose custom queries for their own unique needs, this is also a double-edged sword for the database administrators who have to make sure that they’re capable of efficiently executing all of these queries.
Our core indexing and query processing software, Graph Node, used to process complex queries as a series of smaller SQL queries. To prevent overly complex queries from being executed against the database, we wrapped the entire query execution in a timeout. A long running GraphQL query composed of many smaller SQL queries would time out and we would abort the execution in between SQL query invocations.
In order to significantly speed up performance for complex queries, we added a feature called SQL Query Combination (also called Prefetch) to combine nested queries into fewer but more complex SQL queries. A misconfiguration caused the timeout to get dropped allowing overly complex queries to execute on our production database.
Most of our engineering team is located in the US, Canada, and South America. Since the incident occurred at 12am PT, the engineers with the best knowledge of the affected areas were unavailable for hours after the incident started. This caused us to be slow to respond and greatly extended the time to resolution.
Need for Trust
Our goal from day one was to build The Graph as a decentralized network. This ensures that developers have a solid foundation to build on without having to trust specific teams. Unfortunately, we’re still months away from being able to launch the decentralized network and we understand that teams are depending on us today.
We take this responsibility seriously and are taking immediate steps to ensure that developers can feel comfortable depending on our hosted service for their production applications.
How We Are Improving
The Graph’s hosted service runs on a giant Kubernetes cluster with monitoring and alerts. We’ve continuously improved our devops tooling over the past 18 months.
Since yesterday, we’ve taken immediate steps to prevent expensive queries from running on the database. In addition, we’ve identified a list of improvements that we’ve already begun to implement and will be rolling out over the next few days including:
- Introducing more sophisticated query complexity costing
- Optimizing query processing with better caching
- Setting more aggressive triggers and alerts
- Horizontal database scaling with failover infrastructure for quickly responding to unforeseen traffic spikes
- Selective rate-limiting
- Improving our pager duty process
We will also be looking to hire an additional engineer in Europe / Asia to give us greater geographical / timezone coverage. We are committed to taking the necessary steps to ensure that teams can rely on the hosted service.
The mission of The Graph is to enable internet applications that are entirely powered by public infrastructure.
Full-stack decentralization will enable applications that are resistant to business failures and rent-seeking and also facilitate an unprecedented level of interoperability. Users and developers will be able to know that software they invest time and money into can’t suddenly disappear.
DeFi and Web3 will allow entrepreneurs to build modern internet-native institutions and applications that scale human coordination with more power in the hands of individuals. The Graph will ensure that the crypto economy has a reliable open data layer for building novel applications on a trustworthy foundation.
Follow The Graph’s journey:
Developer Documentation: https://thegraph.com/docs/
Contact: [email protected]