Data centre outage

Incident Report for freistilbox

Postmortem

On Thursday, 2018-05-24, freistilbox operation was severely disrupted by a power failure at our data centre provider. This is the Incident Review we promised for after we have resolved all relevant issues.

We apologise for this outage. We take reliability seriously and an interruption of this magnitude as well as the impact it causes to our customers is unacceptable — though, as you will read below, 90% of it was out of our control in this case.

What happened?

On Thursday, 2018-05-24, at 09:05 UTC, our on-call engineer was alerted by our monitoring system that a number of servers suddenly had gone offline, and the list was exceptionally long. This indicated at least a network outage, so we posted a short initial notice to our status page, which we subsequently updated with any developing news. We then contacted the data centre provider’s support team for clarification of the situation.

While we didn’t get a direct answer, the data centre provider posted a first public status update roughly 45 minutes later, explaining that three out of the fourteen server rooms in their data centre park had suffered an outage due to a spike in the power grid and internal redundancies subsequently failing. Our long-term customers might remember a very similar incident back in 2014 but back then, only one of those rooms failed. We didn’t exaggerate when we called this the worst power outage we experienced.

Rather than giving you the painstaking minute-by-minute play-by-play (which you can re-read in the status page entry for this incident anyway), I’d like to explain why this did have the impact it had and how we addressed the situation in general.

But, just to list the major events:

Thur. 05-24 / 0900 - Outage discovered
Thur. 05-24 / 1030 - Power restored to all affected data centre rooms by the data centre provider
Thur. 05-24 / 1215 - All databases and shared Storage back in operation, ~20 customers still experiencing outages due to damaged, to-be-replaced hardware
Thur. 05-24 / 1900 - All other systems operational again, too
Fri. 05-25 / 1700 - Affected database 1/2 redundancy & optimal performance restored
Sat. 05-26 / 0900 - Affected database 2/2 redundancy restored
Mon. 05-29 / 1615 - Affected database 2/2 optimal performance restored
Tues. 05-30 / 1300 - Last affected application server replaced, incident closed

Why did this have the impact it had?

While we have our infrastructure distributed across 11 of the available data centre rooms, specifically to create wider redundancies and lessen the impact of incidents like these, we were still hit rather severely (please see here for a virtual tour of one of the data centers, if the term “room” sounds underwhelming in context).

To provide you with the best possible performance under ideal situations we do have a “primary” rack in one room, where web boxes are located physically near their database and file storage nodes. And of course, that room was one of the three that lost power. “When it rains, it pours”.

Since we run the passive nodes of our database clusters in different rooms for this situation, we executed a failover procedure to the standby nodes. This restored operation for a large part of our hosting infrastructure.

Some clusters were still running with less capacity than usual because it took our data center provider 5 hrs to power everything back on, so customers who only booked the “solo” plan were completely offline still, until power was restored to the affected rooms and the servers that “did not want to start on their own” were restarted manually by the data centre staff.

Even though database functionality had been restored at this point, the greater distance between standby nodes and web boxes caused degraded performance.

Only once the power was fully restored were we able to assess the damage: 2 primary databases were physically damaged due to the power loss, as were 2 shared application servers, as well as “a few” hard drives in various other servers plus some other "various" issues.

At that time, our data centre provider had roughly 3500 support tickets to work through, which is why it took “some time” for our requests to be fulfilled (restart servers that didn’t come up on their own, replace damaged hard drives, check networking on some servers etc). Due to that ticket overload, we decided against creating custom server orders to have the replacement databases placed into the “primary” room from the get-go, as these normally take ~2 days to be fulfilled even under ideal conditions whereas “we need a server, we don’t care where” will get filled automatically and makes the server available to us in minutes.

This allowed us to rather quickly replace all damaged database nodes and application servers, enabling us to restore all affected clusters to full capacity and start rebuilding the redundancy for the affected databases.

Since databases hold a lot of data, fully syncing the data took more than 20 hours depending on the database, which is why database redundancy was only restored on the evening of day 2 of the incident and on the morning of day 3 for the last affected database.

Optimal database performance was restored on day 5 (Monday), once our data centre provider got to work on our request to move the last new database back into our primary room.

What we could have done better & are changing going forward

Recovery speed

We could have maybe been 30-45 minutes faster in restoring things on day 1 of the incident. I say “could” simply because by resolving this incident, we learned how to tackle these kinds of incidents better in the future — we fortunately did not have much opportunity to train for an incident like this, where “everything” is affected all at once. It was a stressful situation and some human error inevitably snuck in, and, when handling large amounts of data, any time an action has to be repeated due to a mistake, it can obviously cost quite some time.

We are sorry for that, but to claim that next time we won’t make some new mistakes would just be an obvious lie — no matter how much we automate day-to-day operation, in these incident situations, the human factor will sneak in again in some fashion.

Degraded Network Speed

The one thing I did not address before is the “downgraded network card performance” that I mentioned in one of the status updates. I’ll spare you the details, but a few customers were still experiencing degraded performance even though everything looked fine on our end. It turned out that a few network cards did only operate at a hundredth of their possible performance.

How and why this happened is hard to suss out as these speeds are negotiated between the network card and the data centre switch the card is connected to, and, of course, all components are configured to always negotiate for maximum speed. In a few cases, this did not happen on startup, and the cards had to manually be told to renegotiate to reconfigure themselves for maximum throughput.

We did not have monitoring for this in place because “this is not a thing that happens”, we did not think to monitor that before. Hard drives die, servers lose power, someone pulls the network cord of the wrong server due to a lack of coffee — all of these things shouldn’t happen but they do, and even the data center person pulling the wrong cord happens more often than the network equipment developing a mind of its own and deciding to still work, just not as fast. Amongst all the chaos we did not notice the degraded performance ourselves, and only after customer feedback suggested that "things are still slow even though they really should not be", a colleague had the idea to check the network cards. We are working on implementing monitoring for these specific kinds of errors.

Damaged hardware / HDDs

Above, I mentioned “damaged” systems. 90% of those damages were related to hard drives failing. Here is the dirty secret about that: Hard disk drives (HDDs) are the weak spot of any server - HDDs are lovingly called “spinning rust” throughout the industry for a reason.

HDDs are an enclosure around a stack of wafer-thin platters of metal that spin at 5000 - 1000 rotations per minute in a vacuum - at an average of 7500 RPMs that comes down to 125 rotations per second. To spare you my personal treatise on HDDs, let’s put it like this: When some keys your car, at least they “just” scratch the paint on the door - in this case, someone scratched through the gas-tank, at 125 scratches per second.

Since SSD-technology has vastly improved over the recent years, and mature SSDs come with not only vastly more performance but also less potential to be damaged by a power outage or similar event, we are in an ongoing process of replacing all servers with SSD-based ones wherever possible.

All shell boxes exclusively run on SSDs, we are down to two application servers still running on HDDs, as is the one database that was not damaged in some way during this incident.

Additionally, some servers running Varnishes and some Edgerouters are still running with HDDs - which in the large scheme of things does not matter as they hardly need their drives for anything beyond the OS itself, everything else comes in over the network and goes straight into the RAM.

Ideally we would have replaced the HDD-based databases before this incident, but of course, replacing these was scheduled for “soon” - because performance there was adequate and we prioritised the servers running application and shell boxes higher, as a larger performance gain for our customers could be achieved there.

All servers affected during the incident were subsequently replaced with SSD-only servers, and any HDD-based servers mentioned above will be replaced in the future, too.

What can you do to help us in these situations?

While there are a few things, these are not really out of the ordinary.

Help us keep you informed

Please take this opportunity to subscribe to our status page at http://status.freistilbox.com/. Once subscribed, you will automatically get notified when an incident happens/gets updated or when we schedule a system maintenance of some kind (and we only use this for that exact purpose, no need to fear spam).

This helps you to get notified and be kept up to date on any developing situation, and it helps us greatly because it reduces the number of tickets asking "if something is wrong", enabling us to resolve the situation that much more quickly.

When creating tickets, please be as precise as possible

This goes for every ticket, always, but especially in these situations, as every moment we have to spend on “figuring it out” is a moment that is lost to actually solving the issue.

An optimal ticket contains your ClusterID, the siteID of the affected site, a copy-and-pasted error message or even a screenshot of the error encountered and maybe a description of what is happening vs what should be happening.

If you have a theory or “just a feeling” as to what might be wrong by any means please do include it as you know the feel of your website much better than we do, but please make it clear to us which is which — fact vs theory — to avoid confusion on our end.

Final words

The above describes the situation for about 90% of our customers. Some had encountered more specific errors we were only able to address on day 2 of the incident once the dust had settled and general operation had been restored throughout the infrastructure — we are very sorry for the additional troubles experienced there.

While this will surely be hard to hear for those customers specifically, from our perspective, this incident went over rather well, all things considered.

This was the most severe outage we ever experienced. No data was lost. We did not make major mistakes on our end that made things worse for our customers or prolonged the incident in general, beyond that aforementioned 30-45 minutes.

We took this “opportunity” to replace old hardware with new, which in some cases resulted in significantly improved performance for individual customers, and all customers as a whole in regards to the updated database. And (read this as gallows humour if you must) we have now tested, verified and improved almost all emergency and restoration procedures we have, as well as our monitoring even for edge cases.

We will be able to do better next time — not flawlessly because, in the end, we are all human, but better still.

Thank you for your patience while waiting for this report, and please know that feedback is always welcome.

Best regards,

Simon

Posted Jun 19, 2018 - 13:54 UTC

Resolved

The new webservers for the mentioned clusters c124, c116, c82, c105, c108, c113, c122 have just been activated.

This concludes the incident.

Your freistil IT Ops team

Posted May 30, 2018 - 13:07 UTC

Update

The last database has just been restored to normal operation (better than normal actually, due to significantly updated hardware), fully restoring performance of our shared infrastructure to "normal".

Tomorrow afternoon we will relocate some webservers for the following clusters from a server slightly damaged by the power outage to a new, shiny one: c124, c116, c82, c105, c108, c113, c122
You will not notice this happening, but if you need to know the IPs of your webservers, please contact our support team.

The webserver relocation tomorrow constitutes the last (externally noticable) cleanup task around this incident which will then be resolved.

The promised Incident Review post will then follow in the next days.

Your freistil IT Ops team

Posted May 29, 2018 - 16:23 UTC

Update

The database mantenance is complete and the locks have been removed, we are now back at full redundancy for our databases.

There are quite some cleanup tasks left for us to do next week, but we are back at full operation.

The database performance for some clusters will still not be at 100% until we have our database provider relocate 2 database nodes next week.

We will update you again on monday.

Posted May 26, 2018 - 09:32 UTC

Update

In about 5 minutes, we will set the final database lock to complete synchronization of the last affected database.
This will interrupt database operation for a few minutes.

We will update here when we are done.

Posted May 26, 2018 - 09:09 UTC

Update

The aforementioned databases redundancy has been successfully restored and the mentioned lock has been removed.

The last database still in need of redunancy is still syncing which will take until sometime tomorrow, we'll update you tomorrow when we have to initiate that databases write lock for the sync to finalize.

Posted May 25, 2018 - 17:52 UTC

Update

One of the aforementioned database locks is happening right now, sorry we didn't update you sooner, we got excited ;)

Also our datacenter provider has restored the power backups for 2 out of the 3 affected DC rooms.

Posted May 25, 2018 - 17:03 UTC

Update

Update on the current situation & our plans for the day.

The systems are stable, no further incidents occurred since our last update here.

2 of our Databases have taken a rather heavy hit due to the power outage so we will have to fully rebuild their partner systems to regain full redundancy - this will lead to a short (~15 minutes) read/write lock of those databases.
We will notify you here when we will start these tasks so that it is clear to you that this is not a new incident developing.

Posted May 25, 2018 - 11:43 UTC

Monitoring

Finally, operation has been restored throughout our infrastructure.

Some redundancy is still missing (some database backups are still syncing and the datacenter provider still has to restore the power backups), but all systems are operational again.

We will provide a full post-mortem in the next days, but to give you a rough preview:
Today we had everything. Power outage that led to cascading power failure that led to network failure that led to cascading network failure; physical servers that took the power outage rather harshly and had components dying as a result of the hard shutoff; network interfaces that got downgraded in performance due to the DCs faulty network equipment; on and on.. This is why we weren't quirte our ususal cheery selves in the ticket systems today, it was a rather tough fight today.

We will provide additional updates tomorrow, and should you still be suffering outages please do contact us.

Your Freistil IT Ops Team

Posted May 24, 2018 - 18:58 UTC

Update

The situation has stabilised, but is not solved yet.

Beyond the one server still not having power restored, allmost all databases were affected in some manner (though no data has been lost) and replication has to be restored immediately to prevent further issues should another powerloss event ocurr.
This puts additional load on the databases wich leads to some sites being very slow - we are sorry, but we can not influence that, the restore has to complete for speeds to normalize.

Should you still experience _full_ outage issues please contact us.

This is the worst outage of our datacenter provider that has happened in years, with some impact in almost all parts of our infrastructure - we are working hard to solve the general issues and your specific website problems, please bear with us if ticket replys take longer than you would like them too.

Posted May 24, 2018 - 13:59 UTC

Update

We have recovered a good portion (about 3/4) of the affected systems, things are looking up.
Databases have been restored, Shared storage has been restored, so far we can see no data loss.

Currently only one Virtualization server is contistenly down and we are waiting on our datacenter provider to restore operation, which leaves 20 customers still affected.

We will update you with information as soon as we have it.

Posted May 24, 2018 - 12:19 UTC

Update

Since our datacenter provider has tentatively restored operation, we are in the process of cleaning up the effects of the outage. Operation to most customers will be re restored "soon".

We will give another update in the next 30 minutes as to the state of things.

We already know that a few systems are still not powered up again and our datacenter provider is working on these systems, more details will follow in the next update.

Posted May 24, 2018 - 11:12 UTC

Update

Our datacenter provider has come forward with another update:

Power has been restored (without backup-redundancy) to all affected datacenters.
Their technicians are working on restoring said redundancy and manualy ensuring that all servers that have not automatically been started are being started/repaired if physical damage was taken.

Identified cause was a power spike in their local power grid.

We will keep you updated.

Posted May 24, 2018 - 10:35 UTC

Update

Our datacenter provide has come forward with a more detailed update:

3 data center rooms experienced a power outage, which led to a failiure of core networking equipment, which in turn led to an outage accross the whole datacenter.
They are currently in the process of recovery, we will keep you updated.

Posted May 24, 2018 - 10:27 UTC

Update

Our datacenter provider has come forward with the information that the cause is a power outage and not just a network outage, they are working to resolve the issue.

Posted May 24, 2018 - 09:54 UTC

Identified

Our Datacenter provider has experienced a major network outage.
The situation is currently being resolved, services are recovering.
We'll update you here when we have news.

Posted May 24, 2018 - 09:37 UTC

Investigating

We are currently experiencing a network-related outage that is affecting almost all services. It is our highest priority right now and we hope to have it resolved as soon as possible.

We will update this post as soon as we have more information.

Posted May 24, 2018 - 09:12 UTC

This incident affected: Network, freistilbox clusters, Database clusters, and Storage clusters.