Resolving Desyncs in Production – A Post-Mortem, Lessons & Remediations
Around 20:25 UTC on April 17th, Vertex experienced a desync, causing on-chain withdrawals to be delayed for multiple hours while the root cause was being investigated and a fix implemented.
The immediate cause of the desync was that the total USDC balance was off by $0.000000000000000013 (yes, that many zeros!), resulting in circuit breakers getting triggered on the DEX.
The team spent the subsequent 20 hours determining the root cause – working towards successfully bringing back service.
Vertex’s Architecture – Technical Performance & Nuances
To understand what’s known as a “desync” and why such a small difference can have such an outsized impact, we need a brief explainer on Vertex’s architecture.
Vertex is different from other orderbook exchanges in the sense that the protocol's source of truth is actually decentralized – where every trade ultimately settles on Ethereum.
The hard part is being both decentralized and fast.
To accomplish this, Vertex runs a “fast-forward” version of its Solidity smart contracts in the matching engine, which allows Vertex to simulate transactions before they happen, enabling users to know the results of trades well before they clear on-chain.
Implementing this feature requires various technical nuances that necessarily deviate from what actually runs on-chain.
From a performance perspective, this works well. Vertex’s fastest automated traders see latencies as low as 800 microseconds, which is a bit less than the amount of time it takes for light to travel from Tokyo to Nagoya.
High-speed traders implicitly trust that the optimistic state from the Vertex matching engine matches what the on-chain state will be when trades settle a few minutes later. Upon settlement, they can independently verify the optimistic state from the matching engine is equivalent to the state on-chain.
Desyncs Happen
Desyncs can happen, and the team built tooling like checksums and global sync to mitigate the effects of a potential desync.
Specific to the issue on April 17th, some buggy logic was triggered that caused the matching engine state to differ from what the on-chain state would be when trades cleared, resulting in a desync for a minuscule USDC balance difference of $0.000000000000000013.
While a difference that small may not feel like much, the results can quickly magnify throughout the system. For example, if a user tries to withdraw the maximum amount of USDC it may be the difference between a healthy and an unhealthy account. The difference in total deposits results in a different utilization ratio and a different interest rate, causing the balances of all other users to become marginally off as well.
Because Vertex computes a checksum every 10 minutes to assert that states are the same, alerts were triggered by 20:29 UTC, and the relevant engineers were woken up and debugging the issue by 20:38 UTC.
Recovery in Production – Solving the Issue
Upon alerts being triggered, engineers were quickly split into two groups:
- One group to figure out a strategy to reconcile the on-chain and matching engine state.
- Another group to figure out the root cause to ensure a desync doesn’t happen again.
The first group initially only knew that the total USDC deposit was off by a small number, and set about figuring out what was responsible for that difference by inspecting the state of the system. We used several methods to better inspect the state of the whole system, including:
- Engine Save: This naively serializes the state of the matching engine and stores it in S3.
- Global Sync: This takes the state of a set of smart contracts on some EVM-compatible blockchain, and loads the raw key value store behind it. The tricky thing about Geth is that while it allows queries for a value given a certain key, it doesn’t expose listing out the keys. As a result, this process involves syncing a Geth node, reversing the storage structure, and walking the Merkle Patricia Trie for the chain data directly.
By 23:48 UTC, the outputs from the two methods were available, and we could start comparing the states between the matching engine and what was on-chain. Because the Arbitrum mainnet has a very large state, and our method was not optimized to reduce the number of disk seeks, global sync took over two hours to execute.
An additional complication is that Geth does not store the raw slot number as the key in its key-value store. Instead, it stores the Keccak-256 hash of the slot number as the key. So the process of going from “some key in the key-value store is off” to “interpretable value like this user’s balance is off” requires an additional search process.
Around 00:38 UTC on April 18th, we identified the specific account that had a slightly incorrect balance and came up with a plan to manually deposit a small amount of USDC to fix the discrepancy.
This action required a contract migration, which we were able to implement and execute by 02:51 UTC. Withdrawals were restored, but we still did not have a bug fix for the root cause, which occurred again at around 04:30 UTC.
In parallel, we were able to start reproducing the cause of the issue locally at around 01:59 UTC, allowing us to make significant progress in diagnosing it. Iteration times on this were slow because the reproduction involved replaying historical transactions on a checkpointed Vertex mainnet state and running a checksum over each transaction, which was time-intensive because mainnet runs tens of thousands of transactions per second, and that’s a lot of expensive checksums to run.
By 09:20 UTC, we had an explanation for the root cause, and by 10:42 UTC, a fix was implemented and began undergoing testing. The issue discovered as the root cause was subtle and would only arise in rare circumstances.
More specifically, when a reduce-only taker order for a spot position matches with more than one maker order, and the amount of the spot traded and the cumulative interest index of the spot token are just the right values, there can be a rounding error where the traded amount in the matching engine is different from the amount on-chain by around 1e-17.
Once identified, the remaining steps were to test and deploy the fix, before repeating the earlier process of fixing the state discrepancy.
At 11:14 UTC, the fix was deployed to production. Unfortunately, the process of fixing the state discrepancy was much more complicated this time around, because the discrepancy had impacted the interest rate, and as a result, sprawled out to additional accounts.
At 12:14 UTC, we were able to determine the storage keys that were off and went through the original process of interpreting the storage keys into things like balances and states in our contracts. By 14:00 UTC, we were able to interpret all of the incorrect keys and constructed the necessary diffs we would need to make the contract state.
By 15:00 UTC, we started fixing keys using manual deposits. Finally, after reviewing everything that happened and doing some final checks, on-chain withdrawals resumed at 15:44 UTC, with the backlogged withdrawal queue fully clearing by 16:33 UTC.
Lessons & Remediation
Overall, we were prepared for an issue like this eventually arising. We have pretty good tooling that enabled practically no downtime for trading while the issue was being investigated and fixed – only on-chain withdrawals were delayed.
Despite the available tooling like checksums and global sync to mitigate a potential desync issue, the resolution process took longer than expected. For example, global sync taking over two hours, or having the ability to replay production data be pretty slow are both areas to improve on moving forward.
For remediation, we’re working on optimizing our internal tooling alongside gearing more of our test suite towards running trading that has actually happened in production by recovering production state from the tooling we’ve built – in addition to our existing integration and unit tests.
As of right now, global sync is down to a more acceptable 30 minutes.
Finally, this issue demonstrated the need for actively communicating errors that impact users in real-time to the community. The engineering team is very small and was pretty heads down throughout the process, so many community members were left in the dark as some of their withdrawals displayed pending on the Vertex app for multiple hours.
In the future, the engineering team will be more proactive in sending updates regarding issues that directly impact users as they’re actively being investigated, which we think will go a long way. It will also enable community moderators to be more responsive throughout such instances.
We hope this post-mortem is a step in the right direction.