A Polkadot Postmortem - 24.05.2021

On 24 May 2021, Polkadot nodes failed with an out of memory (OOM) error on block 5,202,216. This block contained an on-chain solution to the validator election, which is normally computed off-chain and only takes place on-chain if no off-chain solution is submitted.

By Bastian KöcherMay 27, 2021

TL;DR: On 24 May 2021, Polkadot nodes failed with an out of memory (OOM) error on block 5,202,216. This block contained an on-chain solution to the validator election, which is normally computed off-chain and only takes place on-chain if no off-chain solution is submitted. Due to the large number of nominators, the election overflowed the memory allocated in the Wasm environment.

While an update was being prepared to fix the issue, validators were asked to temporarily downgrade their node software to a previous version that includes a native (non-Wasm) version of the runtime. The native version is not constrained by the Wasm memory allocator. The network recovered after an hour and ten minutes of downtime.

Later, on block 5,203,204, several nodes failed with a “storage root mismatch” error. After investigation, this was due to a difference in the compiler version that built the native runtime and the on-chain Wasm runtime. The solution was to implement a feature that allows overriding the on-chain Wasm runtime with a Wasm runtime build with the correct compiler version.

The issue has since been resolved and precautions have been implemented to prevent this from happening again in the future.

The bad

On 24.05.2021, Polkadot nodes failed with an out of memory (OOM) error while trying to build block 5202216. The nodes themselves did not crash, but the runtime did (i.e. the blockchain’s state transition function). Polkadot’s runtime is written in WebAssembly and is executed either by a Wasm interpreter or a Wasm compiler. However, as part of the runtime execution environment, a fixed amount of memory is always provided (64MB at that time) and this wasn’t enough for this block.

This block was the last block of the penultimate session in the era, meaning that a new validator set needed to be elected for the new era that would start after the next session. The election of the validator set can be done off-chain or on-chain, but off-chain is preferred as the election algorithm is quite a heavy computational task. However, for this session no validator submitted a solution (presumably because they also ran into the same OOM while doing the election off-chain), so it needed to be done on-chain and the result of this was the OOM all validators got while trying to author this block. The solution to the OOM was rather quite easy—to increase the default memory size of the Wasm runtime to 128MB: https://github.com/paritytech/substrate/pull/8892.

To bring this change to all validators, a new release would need to be cut, and a large number of validators would need to update. However, there was a much easier solution to this problem in the short-term (and most importantly faster to deploy). Polkadot’s runtime is compiled not only to Wasm but also to native code for better performance, and most importantly, the native runtime does not put any bounds on memory usage during execution. But the native runtime only matches the on-chain runtime when the running node is from the same release as the on-chain runtime. The on-chain runtime at this point was the runtime matching the v0.8.30 release, which was released on 08.04.2021. Since then, there had already been 3 new releases, meaning most of the validators already were running the latest node release (v0.9.x).

So, in an effort to overcome the problematic block as fast as possible, all validator operators were asked to downgrade their validators to v0.8.30 and to run them with the `--execution native` flag to force running with the native runtime. Overall, it took about 1 hour and 10 minutes from detecting the issue, coming up with a short-term solution, announcing it to validators and ultimately having new blocks built and having the network fully recover.

After the network was back, we started preparing the 0.9.3 release to distribute the increase of the Wasm max memory usage so we could support using the Wasm runtime again. In this process, we took a node and wanted to check that syncing the problematic block with the increased memory ceiling now worked with Wasm. The problematic block worked indeed, but we encountered a storage root mismatch while trying to import 5203204.

The ugly

A storage root mismatch means that importing a block doesn’t lead to the same storage root advertised by the block author. In general, in a blockchain the same input should always lead to the same output. However, in this case the network was still running and building blocks, which could only mean that there was a non-determinism between the native and the Wasm runtime, because we had instructed all validators to run with the native runtime.

So we started to investigate the mismatch between the native and Wasm runtimes. We tried to sync the chain locally first with the same release and the native runtime. However, this also led to the same storage root mismatch. This was even more alarming, because the same code compiled for the same architecture should always produce the same results. When we compile the Wasm runtime we do this using the so-called `no-std` environment, which involves using different code paths. So, it is “easier” to introduce some mismatch, but compiling the native runtime twice should result in code that is doing the same thing both times.

This brought us to the assumption that the rust compiler may have been generating faulty code that resulted in the mismatch we had seen. Due to some extreme luck (otherwise our endeavour would probably have taken a bit longer), someone at Parity still had a binary of this release lying around that wasn’t the same as the one attached to the release on github. This binary was able to sync the chain with the native runtime without any problems. The only difference between this binary and the one we built before was the rust compiler version that had been used. So we thought maybe something had changed between the latest compiler version and the version that we used to build the node back then. And yes, after downgrading the rust compiler and re-building the release branch, the node now managed to sync successfully.

The good

After verifying that the native runtime compiled with the old rust compiler could sync the chain, we also tried compiling the Wasm runtime with this rust compiler. There is a special flag for the Polkadot node that allows us to override the on-chain Wasm runtime with a local version, and we used this to verify that syncing worked. So the question became, why did we have this mismatch between the native and Wasm runtimes of the 0.8.30 release? You need to know that we use the rust nightly compiler to compile the Wasm runtime (the nightly is required because not everything we use in the Wasm build is yet in the stable rust compiler). The compiler versions used for the node and the Wasm runtime are part of the release announcement.

So something must have changed between the 1.51.0 stable rust compiler (released on 23.03.2021 and used to build the native runtime) and the rust nightly compiler from 7.04.2021 that was used to build the Wasm runtime. After some time bisecting the rust toolchains between these dates, we found the nightly from 05.03.2021 to be the first one that broke our determinism. So we only needed to check the commits that got merged between 04.03.2021 and 05.03.2021 and found the problematic commit.

Compiling the rust compiler without this commit and using the self-built compiler to compile our node showed that the native runtime produced the correct data and we could sync the chain. The commit changed the `binary_search_by` function in a way that it could return a different index when there are multiple matches. As we use this function in the runtime, it can lead to a slightly different ordering of the data that is stored in the state, which leads to a different storage root.

So this meant that we now had blocks built by the native runtime that could not be synced with the Wasm runtime, and we could not change the on-chain Wasm runtime to fix this, because you cannot rewrite the history of the blockchain without forking. We came up with a pull request that introduces `code_substitute` to the chain specification. The chain specification is mainly used to store the genesis and some other information about the chain. This new field `code_substitute` is a map that uses a block hash as key and maps to a Wasm runtime code blob. It instructs the node to overwrite the on-chain Wasm runtime with the given one from every block after the one specified in the chain specification until the spec version of the runtime doesn’t match anymore.

We also created a pull request that uses the `code_substitute` with the correct values to enable the nodes to sync again using Wasm. Anyone can rebuild the runtime using `srtool` to make sure that what’s being built is the code from v0.8.30 and that they get the same Wasm blob.

With the 0.9.3 release the node contains all the required fixes to make the chain work as expected.

In future we will improve the current situation even more:

From the blog

July 2024: Key network metrics and insights

Welcome to your go-to source for the latest tech updates, key metrics, and discussions within Polkadot, brought to you by the Parity Success Team. This blog series covers a variety of topics, drawing insights from GitHub, project teams, and the Polkadot Forum.

Polkadot 2.0: The rebirth of a network

Polkadot 2.0 reimagines blockchain with a bold rebrand and powerful features: Agile Coretime, Async Backing, and Elastic Scaling. Step into a more flexible, faster, and scalable network. Learn about the improvements and changes that led to this next era of Polkadot.

Meet the Decentralized Futures grant recipients: transforming ideas into impact on Polkadot

The Decentralized Mic is here to spotlight the innovative projects and teams driving Polkadot’s growth. Join us as we explore the achievements of Decentralized Futures grant recipients and their contributions to the Polkadot ecosystem on the new ecosystem community call series.

The ultimate 2024 Polkadot grants and funding guide

Explore Polkadot ecosystem funding: grants, venture capital, bounties, and community initiatives. Discover opportunities for blockchain builders today.

Decoded 2024: Polkadot’s vision for a decentralized future

Polkadot Decoded 2024 in Brussels brought together top blockchain minds to explore the future of Web3. Highlights included Björn Wagner's insights on payments and Dr. Gavin Wood's vision for digital individuality. Showcasing technical breakthroughs and real-world use cases, Polkadot affirmed its leadership in the multi-chain future.

June 2024: Key network metrics and insights

Welcome to your go-to source for the latest tech updates, key metrics, and discussions within Polkadot, brought to you by the Parity Success Team. This blog series covers a variety of topics, drawing insights from GitHub, project teams, and the Polkadot Forum.

Introducing the New Polkadot Ledger App

Discover the new Polkadot Ledger app for seamless, secure transactions. Now available on Ledger Live, it supports Polkadot, Kusama, and more.

Polkadot’s May Ecosystem Insights

Welcome to your go-to source for the latest tech updates, key metrics, and discussions within Polkadot, brought to you by the Parity Success Team. This blog series covers a variety of topics, drawing insights from GitHub, project teams, and the Polkadot Forum.

Top takeaways from the decentralization panel at Consensus

Consensus by Coindesk 2024: a blockbuster success

Empowering Decentralization: Polkadot DAO Allocates 3M DOT for DeFi Growth

With an overwhelming majority of voters in favor, the Polkadot community has chosen to allocate 3 million DOT to enhance the ecosystem’s decentralized finance (DeFi) landscape. Made through three separate proposals via Polkadot’s decentralized governance (OpenGov), this decision provides an accessible, deep layer of native liquidity to help the ecosystem flourish. It also demonstrates the power of community-driven initiatives to shape the future of decentralized finance.Hydration (formerly known as HydraDX) focuses on improving DeFi liquidity, while StellaSwap aims to optimize the efficiency of automated market makers (AMMs).

Consensus 2024: Get Ready, Get Set, Polkadot

Polkadot is revving up for Consensus 2024 in Austin, Texas, from May 29th to May 31st. The road to this year’s conference is fueled by the community Indy 500 sponsorship and ecosystem teams and is set to be an unforgettable journey into Polkadot.

Async Backing: The way to 10x throughput lift on parachains

Parity engineer Dmitry Sinyavin explains how blockchains on Polkadot can achieve a 10x throughput increase through a combination of async backing and proof-of-validity (PoV) reclaim, enhancing transaction efficiency and scalability.