Degraded file storage performance
Incident Report for freistilbox
Resolved
After working on various performance and load improvements to ensure the reliable operation of our central network storage system, we're happy to close this incident. We've put a lot of effort into investigating the many sources of file system load and made significant improvements.

One finding was that Drupal, by default, writes its Twig cache files to the shared storage. Since these cache files become part of the application codebase, retrieving them over the network comes with a substantial performance penalty. Over the last two weeks, we built and tested a change to our configuration snippets that forces Drupal to maintain its Twig cache on our web boxes' much faster local storage. (Please make sure to use our configuration snippets, they make sure that your website makes optimal use of our managed hosting features!)

We also discovered that our metrics monitoring system by default collects disk space utilisation every second. This insight answered the question of why the storage system reported tens of thousands of "statfs" requests per minute, every minute, day and night. Reducing the collection frequency to 10s removed 90% of these requests.

We've also made a few more minor improvements and, as you can tell from our latest maintenance announcement, are in the process of adding more powerful hardware to our storage infrastructure.

All these learnings and measures give us complete confidence that freistilbox will continue to reliably store and deliver the many Terabytes of content our customers put into our care. Thank you for your trust, and have a great weekend!
Posted Oct 15, 2021 - 13:21 UTC
Update
Here's an update on what we've done and learned so far.

Our first step was to make a change to the storage metadata server that makes sure the hourly metadata dump process will run on a different CPU core than the main process. This has reduced the impact of metadata persistence substantially. However, it still is a clear performance hit, so we're working on further load reduction measures.

One of these measures was to take Varnish logs off the storage system, as we mentioned before. This reduced the number of low-level write operations on the storage system by a significant amount, freeing up performance for essential website operation and further improvements.

Customer feedback led us to the discovery that by writing its Twig cache to the file area, Drupal 8 basically moves parts of its active codebase onto the shared network storage. This is causing a huge drop in website performance as well as substantial load on the storage system. We're currently testing a Drupal configuration change that will move the Twig cache to local web box storage where it belongs.

That's it for now! We're happy with our progress and will be back shortly when we have another update for you. Thanks for bearing with us!
Posted Sep 29, 2021 - 16:26 UTC
Investigating
Some of our customers have contacted us about unusually high page loading times or even short periods of website downtime. We've been able to track some of these incidents back to a performance degradation in our shared network storage system.

The freistilbox network storage system is built on a distributed cluster of file storage nodes controlled by a central file metadata server. Since we found performance bottlenecks in both the metadata server and the file storage nodes, we've decided to make this an official incident that we're going to manage according to our established incident management process.

Since the start of our investigation yesterday morning, we've identified several service components that are a potential or confirmed cause of performance issues. We've determined mitigation options, and assessed their risk of negatively impacting website performance and uptime. Finally, we've decided on a number of measures that we're going to implement first.

So far, none of these these first-aid measures will require any action from your side. One of them is going to reduce write load, at least temporarily, by taking Varnish log files off the shared storage system. We're still going to store them locally on the Varnish servers, so if you need us to provide you with full web requests logs, please let us know via the freistilbox dashboard.

We are taking great care that the necessary changes will not negatively impact website operation. However, experience tells us that unexpected effects might prevent us from fully achieving this goal. That's why we ask you to bear with us while we're making sure that your websites will continue to be able to grow with your success.

We'll update this incident report as soon as we gather new insight. If you have any questions or would like to share feedback, please contact us at support@freistilbox.com!
Posted Sep 28, 2021 - 14:05 UTC
This incident affected: Storage clusters.