Today at approx. 05:50 UTC, one of our shared database clusters ran into a query backlog that affected not only the website that caused the backlog but also other websites using the same database cluster. Our engineering team resolved this issue by terminating the offending queries about 20 minutes later.
We are very sorry for this performance issue, not only because it caused website downtime for some of our customers, but also because we could have resolved it much faster. We sincerely apologise to all affected customers. In the following, we'd like to outline our tooling and process weaknesses that contributed to our slow response and how we will resolve these issues.
Even though this issue was not complex from a technical standpoint, our handling of it was not nearly optimal. A combination of several factors allowed this incident to stay unnoticed by us for too long. Here is what we were able to identify so far:
- The health indicators we are monitoring on our databases do not include the number of long-running queries. An unusually high number of long-running queries would have been the leading indicator in this incident.
- A secondary indicator during this performance issue was the number of open database connections. While this number increased significantly, it did not reach the threshold we had set for on-call staff to be alerted.
- The former point is related to a limitation of our current monitoring solutions, which can compare current KPIs to static thresholds but not identify a deviation from expected behaviour over time.
- Our external website monitoring did detect downtime but did not trigger alerts because none of the affected websites was assigned a 24/7 uptime SLA.
- This was incorrect in one case where an affected website did have such an SLA but did not have alerts set up due to a clerical error.
Now that we identified these weaknesses, this is how we are going to resolve them:
- We are already in the process of building a much more sophisticated monitoring infrastructure. This solution will allow us to detect anomalies by comparing key performance indicators with static thresholds and analysing their trends over a specific period. We expect this new monitoring solution to be in place by the end of next month.
- We will extend the number of health indicators we monitor to cover current blind spots like long-running database queries. If possible, we'll already implement them using our current monitoring technology.
- We are evaluating alternatives to our current use of shared database clusters to reduce the blast radius of issues caused by single websites. This process will result in new options for our customers in our managed hosting product offerings.
- We will improve our change management process to ensure that changes to service level agreements are immediately reflected in our managed hosting infrastructure.
Please let us again offer our sincere apologies for the website downtime caused by this incident. We are determined to work hard to reduce or, ideally, eliminate the risk for issues like this to reoccur. We will keep you updated on our progress via our blog and email newsletter.