Database failure

Incident Report for freistilbox

Resolved

This incident has been resolved.

Posted Dec 05, 2023 - 13:38 UTC

Monitoring

We are happy to report that we have successfully restored redundancy and thus standard operation for the database cluster db16.

We will keep monitoring the situation to catch any regression as quickly as possible.

There was no loss of data except for a very short period during which traffic was not cleanly routed to only one database node; however, this seems to have mostly affected ephemeral data like cache contents.

After getting some highly necessary rest, we will initiate a series of follow-up tasks, the most important one being conducting a thorough incident review in which we analyze root causes and mitigation timeline, and determine necessary improvements to our hosting infrastructure and operations processes. We will publish this review by the end of next week.

We sincerely apologize for the downtime this incident caused for the affected customers, and will take any possible measure to prevent an outage like this in the future.

Posted Dec 05, 2023 - 03:10 UTC

Update

We were able to successfully restore a consistent data set on the previously broken, but reliably working cluster node.

Unfortunately, when we executed the final recovery step, the switch of network traffic back to this node, the routing change on the data centre network level didn't go through cleanly and created a "split-brain" situation that destroyed the newly restored data synchronization between both nodes within seconds. The active node is still fully operational, and website operation is not impacted, but the standby node has been rendered unusable.

This forces us to immediately follow up with a task that we would rather have tackled at a later time, which is to set up data replication to a completely new cluster node. It's a silver lining that we will be able to use our standard operating procedure for this process, but it will require a few more hours of work.

Posted Dec 05, 2023 - 01:31 UTC

Update

We have successfully transferred the majority of the active data set and are about to launch the final transfer phase which requires a database lock to ensure data consistency. We're expecting a database downtime of 10 to 15 minutes.

Posted Dec 05, 2023 - 00:14 UTC

Update

Since all our many attempts at cloning the active database server using our regular backup software ended up unsuccessful, we will now take a new approach. This alternative process will not have to rely on database stability because it operates on the filesystem level. But it has the downside that its final phase will require a downtime of the database server during which website operation will not be possible. We are shifting as much of the necessary data transfer into the initial phase that allows the database server to operate normally, in order to keep the duration of the final offline phase as short as technically possible.

Posted Dec 04, 2023 - 23:22 UTC

Update

We were able to get the database server back online and are relieved to see it serving data to websites again. We are resuming our attempts to restore the full active data set on the broken node. In parallel, we are preparing last night's database backup to restore it as a last resort after we've tried all other avenues.

Posted Dec 04, 2023 - 21:38 UTC

Update

After multiple failed attempts at doing the necessary backup, the server has become unresponsive. We are "all hands on deck" and are working with datacenter staff to get it back online.

Posted Dec 04, 2023 - 20:51 UTC

Identified

An operator error left the active node on our database cluster db16 in a broken state. As per our standard operating procedures, we performed a failover to the cluster's standby node, which successfully took over serving data to its associated websites.

Unfortunately, an instability in this newly active node keeps causing the full backup to fail that we need to restore the broken node, and thus redundancy in the cluster. We are investigating the cause of this instability as well as possible approaches to finish a successful backup of the whole data set.

This restoration work might require us to restart the database server which will cause a short service downtime. We apologize for the service interruption this will inevitably cause; we are doing our best to keep them at an absolute minimum.

Posted Dec 04, 2023 - 18:47 UTC

Investigating

One of our database clusters suffered a server failure. We're switching operation to the standby node and will be back with an update.

Posted Dec 04, 2023 - 16:38 UTC

This incident affected: Database clusters.