Increased number of HTTP 50...

Description

A failure and significant delays in service discovery system (Google Cloud Cloud DNS) caused our load balancers to mark primary webserver pool as unhealthy, causing majority of requests to fail with HTTP 503 errors.

Prevention actions taken

Automatic failover to secondary webservers running outside of primary infrastructure implemented and tested, this failover is not dependent on the service discovery system, allowing to take over even in a complete service discovery disruption.

Timeline

2:50pm UTC - Uptime monitoring created an incident, investigation begins
2:51pm UTC - Issue identified - unhealthy primary webservers, investigation of a root cause begins
2:54pm UTC - Application servers restarted, no signs of recovery
2:59pm UTC - Application servers reverted to previous versions, no signs of recovery
3:03pm UTC - Partial recovery of the service discovery, slow recovery. Unfortunately, issue reoccurs and incident continues
3:04pm UTC - Manual override to fallback servers
3:06pm UTC - Incident automatically resolved by uptime monitoring, situation being monitored
3:07pm UTC - Incident confirmed as fully resolved (mitigated), situation being monitored
3:10pm UTC - Afterwards
- Reintroducing primary webserver pool once service discovery recovered
- Implementing long-term fallback solution not dependent on the service discovery that just failed

Increased number of HTTP 503 errors

Description

Prevention actions taken

Timeline