Root Cause Analysis: Elevat...

What happened

On 30 June 2026, between 5:26 PM and 5:34 PM UTC, the platform experienced
elevated latency and a number of failed requests over an approximately
eight-minute window. The issue began during planned database maintenance and
was fully resolved by 5:34 PM UTC. No customer data was lost or corrupted.

What caused it

The incident was triggered by planned maintenance on our primary database,
which briefly required locking parts of the database. At the same time, a
high-frequency background write — a small update our systems performed on
every device request to record activity — was running continuously against
the same data. When the maintenance locks met this constant stream of small
writes, requests started waiting on one another. As new requests kept
arriving, the backlog grew faster than it could clear, exhausting available
database connections and producing the latency and errors during the window.

In short: the maintenance was the trigger, but the underlying fragility was
an inefficient per-request write pattern that placed continuous load on a
single part of the database.

What we've done

We removed that per-request write from the live request path and replaced it
with an efficient batched job that updates recently-active devices once per
minute in a single operation. We also changed related metadata updates to
write only when something has actually changed, and applied the same
improvement to connection-activity tracking. Together these changes reduced
write volume by orders of magnitude, ended the continuous background database
churn, and took this work off the critical path of customer requests.

We're sorry for the disruption. The underlying inefficiency has been
resolved, and the platform is more resilient as a result.