Official Post-Mortem: Datab...

Between June 19 and June 21, Ventrata experienced four brief periods of service disruption due to automated restarts on our primary database instance. Adjusted to UTC, the specific downtime windows occurred on June 19 from 14:00 to 14:10 UTC and 19:23 to 19:27 UTC, June 20 from 11:21 to 11:23 UTC, and June 21 from 17:51 to 17:57 UTC. All systems automatically recovered within minutes of each incident. The underlying database migration has since been successfully completed, and permanent architectural safeguards have been deployed to prevent this specific failure mode from recurring.

The root cause of the instability stems from a planned schema migration and database backfill initiated on June 19 across 14 of our largest tables to introduce a new supplier_id column. Ventrata utilizes a Google Cloud SQL High Availability topology consisting of a primary instance handling all write operations, and a read replica pool of three servers handling 99 percent of all read traffic. A critical component using this pool is our webhook engine, which processes over 100 requests per second to power client integrations, guest communications, and dynamic pricing tools.

To guarantee data accuracy, the webhook engine featured a legacy fallback mechanism. If the replication lag on the read replicas exceeded a webhook's original enqueue timestamp, the engine would bypass the replica and query the primary write instance directly to ensure it read the most up-to-date data. When the heavy backfill operation naturally caused replication lag to increase, this fallback logic unexpectedly triggered at scale. The primary instance was suddenly flooded with thousands of read queries per second, causing localized resource exhaustion and triggering automated database restarts.

Isolating this root cause proved highly complex because initial telemetry pointed heavily to the data backfill operation itself as the culprit. In response, our engineering team aggressively scaled up our database infrastructure, increasing IOPS and storage capacity orders of magnitude beyond what the backfill required. When these upgrades did not alleviate the spikes, we isolated replication lag as the sole remaining anomaly, which led to the discovery of the webhook fallback loop. The webhook logic has since been completely refactored. Moving forward, if replication lag exceeds the safety threshold, webhook jobs will safely remain in the processing queue until the read replicas catch up rather than failing over to the primary instance.

This incident was an isolated, self-inflicted issue caused by an architectural edge case during a major structural migration, rather than an indication of platform capacity limitations or organic scaling issues. During normal operations, the Ventrata platform remains exceptionally stable and fully resourced. With the backfill complete and the webhook engine permanently hardened against replication anomalies, we have total confidence in our platform's performance heading into the peak summer season. We sincerely apologize for the disruption this caused and appreciate your patience as we continue to build a more robust platform.

Official Post-Mortem: Database Instability (June 19 to 21)