20.11.2023

Post-Mortem Analysis: Bridge Automatic Withdrawal Service Disruption

On Nov 14th, 2023 we experienced an unexpected interruption in our Bridge Automatic Withdrawal Service. Here’s a post-mortem about it.

Date of Incident: Tuesday, November 14, 2023

Duration: 24h

Introduction

On the aforementioned date, we experienced an unexpected interruption in our Automatic Withdrawal Service for the Starkgate Bridge. This service simplifies bridging funds between Starknet and Ethereum. It enhances user experience by eliminating the need for additional L1 claim transactions, once the bridge is finalized on Ethereum mainnet. Although no user funds were at risk and normal service resumed within a day, this report aims to dissect the incident, identify its causes, and outline measures to prevent future occurrences.

Background

The automatic withdrawal service simplifies the user experience in eliminating the need for claim action on Ethereum Mainnet. It allows users to transfer funds from Starknet to Ethereum seamlessly, in 1 click, without incurring an additional fee. The amount that is paid to the SpaceShard Relayer, covers the L1 gas fee for the claim. Our system, through an indexer, monitors these transactions to ensure the successful transfer of funds to our address. Upon confirmation of the transaction on L1, our relayer executes a claim and transfer to the user's address on their behalf.

Incident Overview

The service interruption was first detected at block 395985, where our system ceased receiving data from the indexer. This failure interrupted our ability to recognize and process transactions involving the additional fee for the Automatic Withdrawal Service so we paused the service as a precaution.

Root Cause Analysis

Our investigation revealed that the core issue came from a recent change implemented by Apibara, third-party indexer. Apibara introduced an option to exclude transaction receipts to manage the extensive data generated by ETH events. This new feature, however, required an update in the configuration file. Unfortunately, while the TypeScript SDK of Apibara was updated to reflect this change, the Python SDK, which our system relies on, was not and we unfortunately weren’t informed of a change. Consequently, our system was affected, leading to the data reception issue.

Resolution Steps

Upon identifying the problem, we promptly reached out to the Apibara team. They swiftly helped us to fix the issues and we managed to restore our service to normal functionality.

Future Prevention Measures

To prevent similar incidents in the future, we have implemented several measures:

  • Enhanced communication: We have improved a communication protocol with Apibara to receive prior notifications regarding any updates or changes.
  • Regular system reviews: Our team will conduct frequent reviews of our systems and dependencies, ensuring compatibility with external updates.
  • Alert system improvements: We are upgrading our internal alert systems to detect similar issues more promptly.

Conclusion

We are sorry for all the inconvenience caused by this interruption and are committed to ensuring the reliability and efficiency of our services. The measures outlined above are a testament to our dedication to continuous improvement and the security of our users' assets.

Downlaod all images