On Thursday, 2018-05-24, freistilbox operation was severely disrupted by a power failure at our data centre provider. This is the Incident Review we promised for after we have resolved all relevant issues.
We apologise for this outage. We take reliability seriously and an interruption of this magnitude as well as the impact it causes to our customers is unacceptable — though, as you will read below, 90% of it was out of our control in this case.
On Thursday, 2018-05-24, at 09:05 UTC, our on-call engineer was alerted by our monitoring system that a number of servers suddenly had gone offline, and the list was exceptionally long. This indicated at least a network outage, so we posted a short initial notice to our status page, which we subsequently updated with any developing news. We then contacted the data centre provider’s support team for clarification of the situation.
While we didn’t get a direct answer, the data centre provider posted a first public status update roughly 45 minutes later, explaining that three out of the fourteen server rooms in their data centre park had suffered an outage due to a spike in the power grid and internal redundancies subsequently failing. Our long-term customers might remember a very similar incident back in 2014 but back then, only one of those rooms failed. We didn’t exaggerate when we called this the worst power outage we experienced.
Rather than giving you the painstaking minute-by-minute play-by-play (which you can re-read in the status page entry for this incident anyway), I’d like to explain why this did have the impact it had and how we addressed the situation in general.
While we have our infrastructure distributed across 11 of the available data centre rooms, specifically to create wider redundancies and lessen the impact of incidents like these, we were still hit rather severely (please see here for a virtual tour of one of the data centers, if the term “room” sounds underwhelming in context).
To provide you with the best possible performance under ideal situations we do have a “primary” rack in one room, where web boxes are located physically near their database and file storage nodes. And of course, that room was one of the three that lost power. “When it rains, it pours”.
Since we run the passive nodes of our database clusters in different rooms for this situation, we executed a failover procedure to the standby nodes. This restored operation for a large part of our hosting infrastructure.
Some clusters were still running with less capacity than usual because it took our data center provider 5 hrs to power everything back on, so customers who only booked the “solo” plan were completely offline still, until power was restored to the affected rooms and the servers that “did not want to start on their own” were restarted manually by the data centre staff.
Even though database functionality had been restored at this point, the greater distance between standby nodes and web boxes caused degraded performance.
Only once the power was fully restored were we able to assess the damage: 2 primary databases were physically damaged due to the power loss, as were 2 shared application servers, as well as “a few” hard drives in various other servers plus some other "various" issues.
At that time, our data centre provider had roughly 3500 support tickets to work through, which is why it took “some time” for our requests to be fulfilled (restart servers that didn’t come up on their own, replace damaged hard drives, check networking on some servers etc). Due to that ticket overload, we decided against creating custom server orders to have the replacement databases placed into the “primary” room from the get-go, as these normally take ~2 days to be fulfilled even under ideal conditions whereas “we need a server, we don’t care where” will get filled automatically and makes the server available to us in minutes.
This allowed us to rather quickly replace all damaged database nodes and application servers, enabling us to restore all affected clusters to full capacity and start rebuilding the redundancy for the affected databases.
Since databases hold a lot of data, fully syncing the data took more than 20 hours depending on the database, which is why database redundancy was only restored on the evening of day 2 of the incident and on the morning of day 3 for the last affected database.
Optimal database performance was restored on day 5 (Monday), once our data centre provider got to work on our request to move the last new database back into our primary room.
We could have maybe been 30-45 minutes faster in restoring things on day 1 of the incident. I say “could” simply because by resolving this incident, we learned how to tackle these kinds of incidents better in the future — we fortunately did not have much opportunity to train for an incident like this, where “everything” is affected all at once. It was a stressful situation and some human error inevitably snuck in, and, when handling large amounts of data, any time an action has to be repeated due to a mistake, it can obviously cost quite some time.
We are sorry for that, but to claim that next time we won’t make some new mistakes would just be an obvious lie — no matter how much we automate day-to-day operation, in these incident situations, the human factor will sneak in again in some fashion.
The one thing I did not address before is the “downgraded network card performance” that I mentioned in one of the status updates. I’ll spare you the details, but a few customers were still experiencing degraded performance even though everything looked fine on our end. It turned out that a few network cards did only operate at a hundredth of their possible performance.
How and why this happened is hard to suss out as these speeds are negotiated between the network card and the data centre switch the card is connected to, and, of course, all components are configured to always negotiate for maximum speed. In a few cases, this did not happen on startup, and the cards had to manually be told to renegotiate to reconfigure themselves for maximum throughput.
We did not have monitoring for this in place because “this is not a thing that happens”, we did not think to monitor that before. Hard drives die, servers lose power, someone pulls the network cord of the wrong server due to a lack of coffee — all of these things shouldn’t happen but they do, and even the data center person pulling the wrong cord happens more often than the network equipment developing a mind of its own and deciding to still work, just not as fast. Amongst all the chaos we did not notice the degraded performance ourselves, and only after customer feedback suggested that "things are still slow even though they really should not be", a colleague had the idea to check the network cards. We are working on implementing monitoring for these specific kinds of errors.
Above, I mentioned “damaged” systems. 90% of those damages were related to hard drives failing. Here is the dirty secret about that: Hard disk drives (HDDs) are the weak spot of any server - HDDs are lovingly called “spinning rust” throughout the industry for a reason.
HDDs are an enclosure around a stack of wafer-thin platters of metal that spin at 5000 - 1000 rotations per minute in a vacuum - at an average of 7500 RPMs that comes down to 125 rotations per second. To spare you my personal treatise on HDDs, let’s put it like this: When some keys your car, at least they “just” scratch the paint on the door - in this case, someone scratched through the gas-tank, at 125 scratches per second.
Since SSD-technology has vastly improved over the recent years, and mature SSDs come with not only vastly more performance but also less potential to be damaged by a power outage or similar event, we are in an ongoing process of replacing all servers with SSD-based ones wherever possible.
All shell boxes exclusively run on SSDs, we are down to two application servers still running on HDDs, as is the one database that was not damaged in some way during this incident.
Additionally, some servers running Varnishes and some Edgerouters are still running with HDDs - which in the large scheme of things does not matter as they hardly need their drives for anything beyond the OS itself, everything else comes in over the network and goes straight into the RAM.
Ideally we would have replaced the HDD-based databases before this incident, but of course, replacing these was scheduled for “soon” - because performance there was adequate and we prioritised the servers running application and shell boxes higher, as a larger performance gain for our customers could be achieved there.
All servers affected during the incident were subsequently replaced with SSD-only servers, and any HDD-based servers mentioned above will be replaced in the future, too.
While there are a few things, these are not really out of the ordinary.
Please take this opportunity to subscribe to our status page at http://status.freistilbox.com/. Once subscribed, you will automatically get notified when an incident happens/gets updated or when we schedule a system maintenance of some kind (and we only use this for that exact purpose, no need to fear spam).
This helps you to get notified and be kept up to date on any developing situation, and it helps us greatly because it reduces the number of tickets asking "if something is wrong", enabling us to resolve the situation that much more quickly.
This goes for every ticket, always, but especially in these situations, as every moment we have to spend on “figuring it out” is a moment that is lost to actually solving the issue.
An optimal ticket contains your ClusterID, the siteID of the affected site, a copy-and-pasted error message or even a screenshot of the error encountered and maybe a description of what is happening vs what should be happening.
If you have a theory or “just a feeling” as to what might be wrong by any means please do include it as you know the feel of your website much better than we do, but please make it clear to us which is which — fact vs theory — to avoid confusion on our end.
The above describes the situation for about 90% of our customers. Some had encountered more specific errors we were only able to address on day 2 of the incident once the dust had settled and general operation had been restored throughout the infrastructure — we are very sorry for the additional troubles experienced there.
While this will surely be hard to hear for those customers specifically, from our perspective, this incident went over rather well, all things considered.
This was the most severe outage we ever experienced. No data was lost. We did not make major mistakes on our end that made things worse for our customers or prolonged the incident in general, beyond that aforementioned 30-45 minutes.
We took this “opportunity” to replace old hardware with new, which in some cases resulted in significantly improved performance for individual customers, and all customers as a whole in regards to the updated database. And (read this as gallows humour if you must) we have now tested, verified and improved almost all emergency and restoration procedures we have, as well as our monitoring even for edge cases.
We will be able to do better next time — not flawlessly because, in the end, we are all human, but better still.
Thank you for your patience while waiting for this report, and please know that feedback is always welcome.
Best regards,
Simon