Back to overview
Downtime

Increased number of HTTP 503 errors

Jun 20 at 03:50pm BST

Resolved
Jun 20 at 04:07pm BST

Incident resolved

Created
Jun 20 at 03:50pm BST

Description

A failure and significant delays in service discovery system (Google Cloud Cloud DNS) caused our load balancers to mark primary webserver pool as unhealthy, causing majority of requests to fail with HTTP 503 errors.

Prevention actions taken

Automatic failover to secondary webservers running outside of primary infrastructure implemented and tested, this failover is not dependent on the service discovery system, allowing to take over even in a complete service discovery disruption.

Timeline

  • 2:50pm UTC - Uptime monitoring created an incident, investigation begins
  • 2:51pm UTC - Issue identified - unhealthy primary webservers, investigation of a root cause begins
  • 2:54pm UTC - Application servers restarted, no signs of recovery
  • 2:59pm UTC - Application servers reverted to previous versions, no signs of recovery
  • 3:03pm UTC - Partial recovery of the service discovery, slow recovery. Unfortunately, issue reoccurs and incident continues
  • 3:04pm UTC - Manual override to fallback servers
  • 3:06pm UTC - Incident automatically resolved by uptime monitoring, situation being monitored
  • 3:07pm UTC - Incident confirmed as fully resolved (mitigated), situation being monitored
  • 3:10pm UTC - Afterwards
    • Reintroducing primary webserver pool once service discovery recovered
    • Implementing long-term fallback solution not dependent on the service discovery that just failed