tag:status.freistilbox.com,2005:/historyfreistilbox Status - Incident History2024-03-29T00:06:25Zfreistilboxtag:status.freistilbox.com,2005:Incident/200844942024-02-27T09:26:00Z2024-02-27T09:26:00ZDegraded network performance<p><small>Feb <var data-var='date'>27</var>, <var data-var='time'>09:26</var> UTC</small><br><strong>Resolved</strong> - As we have now learned, the outage was caused by a router failure. We're sorry for the website downtime this caused for some of our customers, and will discuss this incident and possible steps we can take to mitigate issues like this.<br />https://status.hetzner.com/incident/cccb44b0-ce1a-4f4e-a796-90b50d11126f</p><p><small>Feb <var data-var='date'>27</var>, <var data-var='time'>05:06</var> UTC</small><br><strong>Monitoring</strong> - While we haven't heard back from the data centre yet, the network seems to have stabilized again and affected websites are back online. We apologize for the outage and keep monitoring the situation.</p><p><small>Feb <var data-var='date'>27</var>, <var data-var='time'>04:36</var> UTC</small><br><strong>Update</strong> - We are continuing to investigate this issue.</p><p><small>Feb <var data-var='date'>27</var>, <var data-var='time'>04:10</var> UTC</small><br><strong>Investigating</strong> - Parts of our infrastructure are affected by degraded network performance. We have contacted data centre support and are working on ways to mitigate the problem.</p>tag:status.freistilbox.com,2005:Incident/195051272023-12-22T10:23:16Z2023-12-22T10:23:16ZLimited availability<p><small>Dec <var data-var='date'>22</var>, <var data-var='time'>10:23</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Dec <var data-var='date'>22</var>, <var data-var='time'>09:51</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Dec <var data-var='date'>22</var>, <var data-var='time'>09:21</var> UTC</small><br><strong>Identified</strong> - We switch operation to the standby database node which restored operation. We are working on a fix on the stalled node.</p><p><small>Dec <var data-var='date'>22</var>, <var data-var='time'>09:14</var> UTC</small><br><strong>Investigating</strong> - We have noticed limited availability issues that affect a few customers. We are analysing the situation and are going to update this incident soon.</p>tag:status.freistilbox.com,2005:Incident/193162872023-12-05T13:38:06Z2023-12-05T13:38:06ZDatabase failure<p><small>Dec <var data-var='date'> 5</var>, <var data-var='time'>13:38</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Dec <var data-var='date'> 5</var>, <var data-var='time'>03:10</var> UTC</small><br><strong>Monitoring</strong> - We are happy to report that we have successfully restored redundancy and thus standard operation for the database cluster db16.<br /><br />We will keep monitoring the situation to catch any regression as quickly as possible.<br /><br />There was no loss of data except for a very short period during which traffic was not cleanly routed to only one database node; however, this seems to have mostly affected ephemeral data like cache contents.<br /><br />After getting some highly necessary rest, we will initiate a series of follow-up tasks, the most important one being conducting a thorough incident review in which we analyze root causes and mitigation timeline, and determine necessary improvements to our hosting infrastructure and operations processes. We will publish this review by the end of next week.<br /><br />We sincerely apologize for the downtime this incident caused for the affected customers, and will take any possible measure to prevent an outage like this in the future.</p><p><small>Dec <var data-var='date'> 5</var>, <var data-var='time'>01:31</var> UTC</small><br><strong>Update</strong> - We were able to successfully restore a consistent data set on the previously broken, but reliably working cluster node.<br /><br />Unfortunately, when we executed the final recovery step, the switch of network traffic back to this node, the routing change on the data centre network level didn't go through cleanly and created a "split-brain" situation that destroyed the newly restored data synchronization between both nodes within seconds. The active node is still fully operational, and website operation is not impacted, but the standby node has been rendered unusable.<br /><br />This forces us to immediately follow up with a task that we would rather have tackled at a later time, which is to set up data replication to a completely new cluster node. It's a silver lining that we will be able to use our standard operating procedure for this process, but it will require a few more hours of work.</p><p><small>Dec <var data-var='date'> 5</var>, <var data-var='time'>00:14</var> UTC</small><br><strong>Update</strong> - We have successfully transferred the majority of the active data set and are about to launch the final transfer phase which requires a database lock to ensure data consistency. We're expecting a database downtime of 10 to 15 minutes.</p><p><small>Dec <var data-var='date'> 4</var>, <var data-var='time'>23:22</var> UTC</small><br><strong>Update</strong> - Since all our many attempts at cloning the active database server using our regular backup software ended up unsuccessful, we will now take a new approach. This alternative process will not have to rely on database stability because it operates on the filesystem level. But it has the downside that its final phase will require a downtime of the database server during which website operation will not be possible. We are shifting as much of the necessary data transfer into the initial phase that allows the database server to operate normally, in order to keep the duration of the final offline phase as short as technically possible.</p><p><small>Dec <var data-var='date'> 4</var>, <var data-var='time'>21:38</var> UTC</small><br><strong>Update</strong> - We were able to get the database server back online and are relieved to see it serving data to websites again. We are resuming our attempts to restore the full active data set on the broken node. In parallel, we are preparing last night's database backup to restore it as a last resort after we've tried all other avenues.</p><p><small>Dec <var data-var='date'> 4</var>, <var data-var='time'>20:51</var> UTC</small><br><strong>Update</strong> - After multiple failed attempts at doing the necessary backup, the server has become unresponsive. We are "all hands on deck" and are working with datacenter staff to get it back online.</p><p><small>Dec <var data-var='date'> 4</var>, <var data-var='time'>18:47</var> UTC</small><br><strong>Identified</strong> - An operator error left the active node on our database cluster db16 in a broken state. As per our standard operating procedures, we performed a failover to the cluster's standby node, which successfully took over serving data to its associated websites.<br /><br />Unfortunately, an instability in this newly active node keeps causing the full backup to fail that we need to restore the broken node, and thus redundancy in the cluster. We are investigating the cause of this instability as well as possible approaches to finish a successful backup of the whole data set.<br /><br />This restoration work might require us to restart the database server which will cause a short service downtime. We apologize for the service interruption this will inevitably cause; we are doing our best to keep them at an absolute minimum.</p><p><small>Dec <var data-var='date'> 4</var>, <var data-var='time'>16:38</var> UTC</small><br><strong>Investigating</strong> - One of our database clusters suffered a server failure. We're switching operation to the standby node and will be back with an update.</p>tag:status.freistilbox.com,2005:Incident/191333662023-11-16T03:58:15Z2023-11-16T03:58:15ZNetwork infrastructure maintenance<p><small>Nov <var data-var='date'>16</var>, <var data-var='time'>03:58</var> UTC</small><br><strong>Resolved</strong> - Maintenance work has finished successfully and connectivity has been fully restored. We apologize for the short partial outage.</p><p><small>Nov <var data-var='date'>16</var>, <var data-var='time'>03:45</var> UTC</small><br><strong>Investigating</strong> - Network maintenance is being done in one of our datacentres. During this maintenance window, some of our servers will be unreachable for up to 60 minutes. We're monitoring the situation and will update this report as we have new information.</p>tag:status.freistilbox.com,2005:Incident/191062202023-11-13T11:01:09Z2023-11-13T11:03:18ZDatabase overload on db16<p><small>Nov <var data-var='date'>13</var>, <var data-var='time'>11:01</var> UTC</small><br><strong>Resolved</strong> - We have noticed limited availability issues that affect a few customers. <br /><br />While analysing the situation we identified a database which saturated all available network resources on that shared DB cluster.<br />Together with the customer we already fixed the root issue and the DB is running fine again.<br /><br />We apologize for the interruption!</p>tag:status.freistilbox.com,2005:Incident/189297782023-10-26T02:20:00Z2023-10-26T02:54:44ZNetwork infrastructure maintenance<p><small>Oct <var data-var='date'>26</var>, <var data-var='time'>02:20</var> UTC</small><br><strong>Resolved</strong> - Our datacenter provider has had to perform urgent maintenance work on network hardware connecting our infrastructure.<br /><br />Because of this one of our active edgerouters became unreachable. We switched operations to a standby edgerouter to restore operations.<br /><br />There was a downtime of about 7 minutes for affected websites.</p>tag:status.freistilbox.com,2005:Incident/179553382023-07-27T06:33:36Z2023-07-27T06:33:36ZRegular maintenance<p><small>Jul <var data-var='date'>27</var>, <var data-var='time'>06:33</var> UTC</small><br><strong>Completed</strong> - The scheduled maintenance has been completed.</p><p><small>Jul <var data-var='date'>27</var>, <var data-var='time'>06:00</var> UTC</small><br><strong>In progress</strong> - Scheduled maintenance is currently in progress. We will provide updates as necessary.</p><p><small>Jul <var data-var='date'>26</var>, <var data-var='time'>11:31</var> UTC</small><br><strong>Update</strong> - During this maintenance window hardware maintenance on solr4.freistilbox.net will take place as well.<br />The solr service on solr4 will be unavailable for a few minutes during this window.</p><p><small>Jul <var data-var='date'>25</var>, <var data-var='time'>13:15</var> UTC</small><br><strong>Scheduled</strong> - This is just to notify you of work we're going to do during our maintenance window on Thursday between 6am and 8am UTC.<br /><br />While standard operating procedures such as system restarts might cause short interruptions, we don't expect any significant impact on your website uptime. If you have any questions, don't hesitate to contact us at support@freistilbox.com — we'll be happy to help.</p>tag:status.freistilbox.com,2005:Incident/174419052023-06-15T07:59:42Z2023-06-15T07:59:43ZInfrastructure upgrades on June 15th<p><small>Jun <var data-var='date'>15</var>, <var data-var='time'>07:59</var> UTC</small><br><strong>Completed</strong> - The ugprade has been completed.<br /><br />The new web and varnish boxes replaced the old ones.<br /><br />In the next few minutes you should be able to login to the new shell boxes as well, as soon as DNS updates reach you.<br /><br />If you notice anything unusual, please let us know immediately!</p><p><small>Jun <var data-var='date'>15</var>, <var data-var='time'>07:00</var> UTC</small><br><strong>In progress</strong> - Scheduled maintenance is currently in progress. We will provide updates as necessary.</p><p><small>Jun <var data-var='date'> 1</var>, <var data-var='time'>11:42</var> UTC</small><br><strong>Scheduled</strong> - On Thursday 2023-06-15 between 7 to 9 am UTC, we are going to switch a batch of clusters to upgraded server infrastructure. <br /><br />In this batch, we're going to upgrade the following clusters: <br /> <br />- c63<br />- c67<br />- c86<br />- c102<br />- c103<br />- c116<br />- c127<br />- c128<br />- c156<br />- c172<br />- c182<br />- c183<br />- c184<br />- c188<br />- c189<br />- c192<br />- c194<br />- c195<br />- c198<br />- c199<br />- c200<br />- c202<br />- c203<br />- c204<br />- c207<br /><br />During this maintenance window, we will deactivate all cron jobs on these clusters, then switch to the new infrastructure, and finally reactivate cron jobs. We don't expect this change to affect website uptime. <br /> <br />We will not transfer any contents of user home directories between the old and new shell boxes. For that reason, please back up all files stored in user home directories on your cluster's shell box that you will need in the future in advance of this maintenance. You can either download them or move them into the cluster's shared storage. <br /> <br />This change will not require any DNS changes for your websites. <br /> <br />However, please note that IP addresses of the individual boxes (Varnish, memcached, web application and shell) making up your cluster are going to change. Should you rely on individual IP addresses to access these services, we urge you to instead use the configuration snippets we provide and keep updated for you. <br /> <br />The only change that you are going to notice for sure is the changed SSH host key for the new shell box. Your SSH client will display an error message similar to this: <br /> <br />-------✂--------✂--------✂--------✂---------✂--------✂--------✂-------- <br /> <br />@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ <br />@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ <br />@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ <br />IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! <br />Someone could be eavesdropping on you right now (man-in-the-middle attack)! <br />It is also possible that a host key has just been changed. <br />The fingerprint for the RSA key sent by the remote host is <br />e5:72:51:dc:9e:19:37:fb:26:2c:2f:8a:09:02:e5:e4. <br />Please contact your system administrator. <br />Add correct host key in /Users/username/.ssh/known_hosts to get rid of this message. <br />Offending RSA key in /Users/username/known_hosts:1 <br />remove with: ssh-keygen -f "/Users/username/.ssh/known_hosts" -R c61s.freistilbox.net <br />Password authentication is disabled to avoid man-in-the-middle attacks. <br />Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks. <br /> <br />-------✂--------✂--------✂--------✂---------✂--------✂--------✂-------- <br /> <br />Simply remove the conflicting old host key. In the terminal, you can do this via `ssh-keygen -R `. The new key will be stored once you approve it at your next login. <br /><br />If you are doing backups using `drush sql-dump` or `mysqldump` please note that with the mysql-client update to version 8.0 there are additional options necessary. Please refer to https://docs.freistilbox.com/faq/mysqldump_fails/ for details.<br /> <br />We will notify you of the completed maintenance under this maintenance announcement. Once we're done, please take a minute to make sure the changes did not affect your website operation. <br /> <br />---- <br /> <br />If you have any questions or want to make special arrangements for your cluster upgrade, feel free to get in touch with us via the freistilbox dashboard. We're here to help! <br /> <br />---- <br /> <br />Documentation links: <br />Shared storage: https://docs.freistilbox.com/how_it_works/filesystem/#shared-storage<br />Configuration snippets: https://docs.freistilbox.com/how_it_works/includes/</p>tag:status.freistilbox.com,2005:Incident/173001282023-05-25T08:21:31Z2023-05-25T08:21:32ZInfrastructure upgrades on May 25th<p><small>May <var data-var='date'>25</var>, <var data-var='time'>08:21</var> UTC</small><br><strong>Completed</strong> - The ugprade has been completed.<br /><br />The new web and varnish boxes replaced the old ones.<br /><br />In the next few minutes you should be able to login to the new shell boxes as well, as soon as DNS updates reach you.<br /><br />If you notice anything unusual, please let us know immediately!</p><p><small>May <var data-var='date'>25</var>, <var data-var='time'>07:00</var> UTC</small><br><strong>In progress</strong> - Scheduled maintenance is currently in progress. We will provide updates as necessary.</p><p><small>May <var data-var='date'>19</var>, <var data-var='time'>09:03</var> UTC</small><br><strong>Scheduled</strong> - On Thursday 2023-05-25 between 7 to 9 am UTC, we are going to switch a batch of clusters to upgraded server infrastructure. <br /> <br />In this batch, we're going to upgrade the following clusters: <br /> <br />- c53<br />- c68<br />- c74<br />- c77<br />- c80<br />- c95<br />- c96<br />- c99<br />- c100 <br />- c109 <br />- c119<br />- c126<br />- c129<br />- c136<br />- c150<br />- c151<br />- c158<br />- c168<br />- c179<br />- c181<br />- c190<br />- c191<br />- c209<br /> <br />During this maintenance window, we will deactivate all cron jobs on these clusters, then switch to the new infrastructure, and finally reactivate cron jobs. We don't expect this change to affect website uptime. <br /> <br />We will not transfer any contents of user home directories between the old and new shell boxes. For that reason, please back up all files stored in user home directories on your cluster's shell box that you will need in the future in advance of this maintenance. You can either download them or move them into the cluster's shared storage. <br /> <br />This change will not require any DNS changes for your websites. <br /> <br />However, please note that IP addresses of the individual boxes (Varnish, memcached, web application and shell) making up your cluster are going to change. Should you rely on individual IP addresses to access these services, we urge you to instead use the configuration snippets we provide and keep updated for you. <br /> <br />The only change that you are going to notice for sure is the changed SSH host key for the new shell box. Your SSH client will display an error message similar to this: <br /> <br />-------✂--------✂--------✂--------✂---------✂--------✂--------✂-------- <br /> <br />@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ <br />@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ <br />@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ <br />IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! <br />Someone could be eavesdropping on you right now (man-in-the-middle attack)! <br />It is also possible that a host key has just been changed. <br />The fingerprint for the RSA key sent by the remote host is <br />e5:72:51:dc:9e:19:37:fb:26:2c:2f:8a:09:02:e5:e4. <br />Please contact your system administrator. <br />Add correct host key in /Users/username/.ssh/known_hosts to get rid of this message. <br />Offending RSA key in /Users/username/known_hosts:1 <br />remove with: ssh-keygen -f "/Users/username/.ssh/known_hosts" -R c61s.freistilbox.net <br />Password authentication is disabled to avoid man-in-the-middle attacks. <br />Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks. <br /> <br />-------✂--------✂--------✂--------✂---------✂--------✂--------✂-------- <br /> <br />Simply remove the conflicting old host key. In the terminal, you can do this via `ssh-keygen -R `. The new key will be stored once you approve it at your next login. <br /> <br />We will notify you of the completed maintenance under this maintenance announcement. Once we're done, please take a minute to make sure the changes did not affect your website operation. <br /> <br />---- <br /> <br />If you have any questions or want to make special arrangements for your cluster upgrade, feel free to get in touch with us via the freistilbox dashboard. We're here to help! <br /> <br />---- <br /> <br />Documentation links: <br />Shared storage: https://docs.freistilbox.com/how_it_works/filesystem/#shared-storage<br />Configuration snippets: https://docs.freistilbox.com/how_it_works/includes/</p>tag:status.freistilbox.com,2005:Incident/172065742023-05-10T14:06:24Z2023-05-10T14:06:25ZLinux kernel security related reboots<p><small>May <var data-var='date'>10</var>, <var data-var='time'>14:06</var> UTC</small><br><strong>Completed</strong> - We are very sorry about the confusion, but we were a bit early to announce such window. <br />There are no updates related to the current Linux kernel security issue available yet so we can't do upgrades to deploy them.<br /><br />We will announce the reboot window again as soon as we are sure to have them available.</p><p><small>May <var data-var='date'>10</var>, <var data-var='time'>12:51</var> UTC</small><br><strong>Scheduled</strong> - On Thurdsay, May 11th, between 6 and 7am UTC, we will perform security related reboots throughout our infrastructure.<br /><br />This will affect the following freistilbox clusters:<br /><br />- c58<br />- c69<br />- c124<br />- c145<br />- c146<br />- c196<br />- c201<br />- c208<br />- c210<br /><br />We will reboot in a coordinated way so freistilbox PRO and Enterprise customers should experience no downtime.</p>tag:status.freistilbox.com,2005:Incident/164662472023-03-13T03:50:47Z2023-03-13T03:50:48ZLimited availability<p><small>Mar <var data-var='date'>13</var>, <var data-var='time'>03:50</var> UTC</small><br><strong>Resolved</strong> - A fix has been implemented, and we keep monitoring all affected systems. We will conduct a thorough review of this incident in the morning.</p><p><small>Mar <var data-var='date'>13</var>, <var data-var='time'>02:04</var> UTC</small><br><strong>Investigating</strong> - We have noticed limited availability issues that affect a few customers. We are analysing the situation and are going to update this incident soon.</p>tag:status.freistilbox.com,2005:Incident/161196262023-02-13T15:05:00Z2023-02-13T15:05:00ZLimited availability<p><small>Feb <var data-var='date'>13</var>, <var data-var='time'>15:05</var> UTC</small><br><strong>Resolved</strong> - The downtime was caused by a short cloud infrastructure outage. All services resumed operation by themselves.</p><p><small>Feb <var data-var='date'>13</var>, <var data-var='time'>10:27</var> UTC</small><br><strong>Investigating</strong> - We have noticed limited availability issues that affect a few customers. We are analysing the situation and are going to update this incident soon.</p>tag:status.freistilbox.com,2005:Incident/107530852022-08-03T16:41:30Z2022-08-03T16:41:31ZPartial network outage<p><small>Aug <var data-var='date'> 3</var>, <var data-var='time'>16:41</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Aug <var data-var='date'> 3</var>, <var data-var='time'>15:02</var> UTC</small><br><strong>Monitoring</strong> - Data centre staff has resolved the issue. All websites are available again. We apologize for the downtime and will keep monitoring the situation.</p><p><small>Aug <var data-var='date'> 3</var>, <var data-var='time'>14:46</var> UTC</small><br><strong>Investigating</strong> - Parts of our infrastructure are affected by a router outage. We have contacted data centre support and are working on ways to mitigate the problem.</p>tag:status.freistilbox.com,2005:Incident/94787862022-03-07T10:29:22Z2022-03-07T10:29:22ZCluster c190 unavailable<p><small>Mar <var data-var='date'> 7</var>, <var data-var='time'>10:29</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Mar <var data-var='date'> 7</var>, <var data-var='time'>09:41</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Mar <var data-var='date'> 7</var>, <var data-var='time'>08:16</var> UTC</small><br><strong>Identified</strong> - Due to an outage of a cloud node at our data center, cluster c190 is currently unavailable. Datacenter support is informed and working on fixing the issue.</p>tag:status.freistilbox.com,2005:Incident/92153002022-02-01T23:20:38Z2022-02-01T23:20:39ZLimited availability due to incoming attack<p><small>Feb <var data-var='date'> 1</var>, <var data-var='time'>23:20</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Feb <var data-var='date'> 1</var>, <var data-var='time'>23:10</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Feb <var data-var='date'> 1</var>, <var data-var='time'>22:27</var> UTC</small><br><strong>Identified</strong> - The issue has been identified and a fix is being implemented.</p><p><small>Feb <var data-var='date'> 1</var>, <var data-var='time'>22:20</var> UTC</small><br><strong>Investigating</strong> - High amounts of malicious traffic are affecting parts of our network infrastructure. We've contacted our data centre and will post details as soon as we have updates.</p>tag:status.freistilbox.com,2005:Incident/88651172021-12-16T14:14:32Z2021-12-16T14:14:33ZLimited availability<p><small>Dec <var data-var='date'>16</var>, <var data-var='time'>14:14</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Dec <var data-var='date'>16</var>, <var data-var='time'>14:05</var> UTC</small><br><strong>Monitoring</strong> - We have noticed limited availability issues that affect a few customers. One edgerouter was unavailable. <br />We already restored operation by switching to a standby server.</p>tag:status.freistilbox.com,2005:Incident/88125562021-12-10T17:43:34Z2021-12-10T17:43:34ZCVE-2021-44228 - Log4j RCE 0-day mitigation<p><small>Dec <var data-var='date'>10</var>, <var data-var='time'>17:43</var> UTC</small><br><strong>Resolved</strong> - A zero-day exploit for a vulnerability in the popular Apache Log4j library (CVE-2021-44228, https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228) was made public on December 9, 2021. This exploit allows attackers to execute arbitrary code on the vulnerable system.<br /><br />While this exploit does not affect our primary web hosting infrastructure, it could affect the Java-based Apache Solr service that we provide to our customers for high-performance content search. We applied a configuration change that mitigates the vulnerability.<br /><br />Having neutralized the immediate threat, we will monitor the situation and take additional measures if necessary.</p>tag:status.freistilbox.com,2005:Incident/82287272021-10-21T05:21:08Z2021-10-21T05:21:09ZShared storage maintenance<p><small>Oct <var data-var='date'>21</var>, <var data-var='time'>05:21</var> UTC</small><br><strong>Completed</strong> - The scheduled maintenance has been completed.</p><p><small>Oct <var data-var='date'>21</var>, <var data-var='time'>05:10</var> UTC</small><br><strong>Verifying</strong> - Verification is currently underway for the maintenance items.</p><p><small>Oct <var data-var='date'>21</var>, <var data-var='time'>05:03</var> UTC</small><br><strong>In progress</strong> - Scheduled maintenance is currently in progress. We will provide updates as necessary.</p><p><small>Oct <var data-var='date'>14</var>, <var data-var='time'>14:22</var> UTC</small><br><strong>Update</strong> - We will be undergoing scheduled maintenance during this time.</p><p><small>Oct <var data-var='date'>14</var>, <var data-var='time'>14:22</var> UTC</small><br><strong>Scheduled</strong> - To finalise upgrade work on our shared storage, we need to stop the shared storage service for a short time frame.<br /><br />In this window, shared storage won't be accessible for about 5 minutes.<br /><br />Web site requests accessing shared storage during this window will experience no or delayed delivery.</p>tag:status.freistilbox.com,2005:Incident/81020502021-10-15T13:21:15Z2021-10-15T13:21:16ZDegraded file storage performance<p><small>Oct <var data-var='date'>15</var>, <var data-var='time'>13:21</var> UTC</small><br><strong>Resolved</strong> - After working on various performance and load improvements to ensure the reliable operation of our central network storage system, we're happy to close this incident. We've put a lot of effort into investigating the many sources of file system load and made significant improvements.<br /><br />One finding was that Drupal, by default, writes its Twig cache files to the shared storage. Since these cache files become part of the application codebase, retrieving them over the network comes with a substantial performance penalty. Over the last two weeks, we built and tested a change to our configuration snippets that forces Drupal to maintain its Twig cache on our web boxes' much faster local storage. (Please make sure to use our configuration snippets, they make sure that your website makes optimal use of our managed hosting features!)<br /><br />We also discovered that our metrics monitoring system by default collects disk space utilisation every second. This insight answered the question of why the storage system reported tens of thousands of "statfs" requests per minute, every minute, day and night. Reducing the collection frequency to 10s removed 90% of these requests.<br /><br />We've also made a few more minor improvements and, as you can tell from our latest maintenance announcement, are in the process of adding more powerful hardware to our storage infrastructure.<br /><br />All these learnings and measures give us complete confidence that freistilbox will continue to reliably store and deliver the many Terabytes of content our customers put into our care. Thank you for your trust, and have a great weekend!</p><p><small>Sep <var data-var='date'>29</var>, <var data-var='time'>16:26</var> UTC</small><br><strong>Update</strong> - Here's an update on what we've done and learned so far.<br /><br />Our first step was to make a change to the storage metadata server that makes sure the hourly metadata dump process will run on a different CPU core than the main process. This has reduced the impact of metadata persistence substantially. However, it still is a clear performance hit, so we're working on further load reduction measures.<br /><br />One of these measures was to take Varnish logs off the storage system, as we mentioned before. This reduced the number of low-level write operations on the storage system by a significant amount, freeing up performance for essential website operation and further improvements. <br /><br />Customer feedback led us to the discovery that by writing its Twig cache to the file area, Drupal 8 basically moves parts of its active codebase onto the shared network storage. This is causing a huge drop in website performance as well as substantial load on the storage system. We're currently testing a Drupal configuration change that will move the Twig cache to local web box storage where it belongs.<br /><br />That's it for now! We're happy with our progress and will be back shortly when we have another update for you. Thanks for bearing with us!</p><p><small>Sep <var data-var='date'>28</var>, <var data-var='time'>14:05</var> UTC</small><br><strong>Investigating</strong> - Some of our customers have contacted us about unusually high page loading times or even short periods of website downtime. We've been able to track some of these incidents back to a performance degradation in our shared network storage system.<br /><br />The freistilbox network storage system is built on a distributed cluster of file storage nodes controlled by a central file metadata server. Since we found performance bottlenecks in both the metadata server and the file storage nodes, we've decided to make this an official incident that we're going to manage according to our established incident management process.<br /><br />Since the start of our investigation yesterday morning, we've identified several service components that are a potential or confirmed cause of performance issues. We've determined mitigation options, and assessed their risk of negatively impacting website performance and uptime. Finally, we've decided on a number of measures that we're going to implement first.<br /><br />So far, none of these these first-aid measures will require any action from your side. One of them is going to reduce write load, at least temporarily, by taking Varnish log files off the shared storage system. We're still going to store them locally on the Varnish servers, so if you need us to provide you with full web requests logs, please let us know via the freistilbox dashboard.<br /><br />We are taking great care that the necessary changes will not negatively impact website operation. However, experience tells us that unexpected effects might prevent us from fully achieving this goal. That's why we ask you to bear with us while we're making sure that your websites will continue to be able to grow with your success.<br /><br />We'll update this incident report as soon as we gather new insight. If you have any questions or would like to share feedback, please contact us at support@freistilbox.com!</p>tag:status.freistilbox.com,2005:Incident/80141172021-09-17T08:40:46Z2021-09-17T08:40:47ZDashboard unavailable<p><small>Sep <var data-var='date'>17</var>, <var data-var='time'>08:40</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Sep <var data-var='date'>17</var>, <var data-var='time'>06:14</var> UTC</small><br><strong>Identified</strong> - The issue has been identified and a fix is being implemented.</p><p><small>Sep <var data-var='date'>17</var>, <var data-var='time'>06:09</var> UTC</small><br><strong>Investigating</strong> - The freistilbox dashboard is currently unavailable. We are currently investigating this issue.<br /><br />The operation of your web applications is not affected by this.</p>tag:status.freistilbox.com,2005:Incident/79290442021-09-06T12:52:31Z2021-09-06T12:52:31ZLimited availability<p><small>Sep <var data-var='date'> 6</var>, <var data-var='time'>12:52</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Sep <var data-var='date'> 6</var>, <var data-var='time'>10:57</var> UTC</small><br><strong>Monitoring</strong> - We have noticed limited availability issues that affect a few customers. <br />One of our edgerouters is unreachable.<br />We already switched operation to the standby server which restored operation.</p>tag:status.freistilbox.com,2005:Incident/76310972021-07-29T05:30:00Z2021-07-29T16:12:54ZLimited availability of database cluster caused downtimes<p><small>Jul <var data-var='date'>29</var>, <var data-var='time'>05:30</var> UTC</small><br><strong>Resolved</strong> - We experienced limited availability issues that affected a few customers. <br /><br />A database cluster was overloaded which resulted in a database query backlog that blocked web requests.<br /><br />We are still investigating this incident and will post a more detailed update soon.</p>tag:status.freistilbox.com,2005:Incident/71961002021-06-08T12:22:41Z2021-06-08T12:22:41ZFastly CDN issues<p><small>Jun <var data-var='date'> 8</var>, <var data-var='time'>12:22</var> UTC</small><br><strong>Resolved</strong> - Update from Fastly: "A fix was applied at 10:36 UTC. Customers may continue to experience decreased cache hit ratio and increased origin load as global services return."</p><p><small>Jun <var data-var='date'> 8</var>, <var data-var='time'>10:24</var> UTC</small><br><strong>Identified</strong> - Websites using Fastly are currently impacted by issues on the CDN side. Please refer to the Fastly Status Page (https://status.fastly.com/) for details.</p>tag:status.freistilbox.com,2005:Incident/71953042021-06-08T08:40:00Z2021-06-08T09:03:12ZLimited availability<p><small>Jun <var data-var='date'> 8</var>, <var data-var='time'>08:40</var> UTC</small><br><strong>Resolved</strong> - We experienced limited availability issues that affected a few customers for about 6 minutes that were caused by network issues. <br /><br />The issues are resolved again.</p>tag:status.freistilbox.com,2005:Incident/69734352021-05-18T06:46:08Z2021-05-18T06:46:08ZInfrastructure upgrades on May, 18th<p><small>May <var data-var='date'>18</var>, <var data-var='time'>06:46</var> UTC</small><br><strong>Completed</strong> - The scheduled maintenance has been completed.</p><p><small>May <var data-var='date'>18</var>, <var data-var='time'>06:15</var> UTC</small><br><strong>Update</strong> - Web operation has been fully migrated. We are continuing work on migrating the shell boxes now.</p><p><small>May <var data-var='date'>18</var>, <var data-var='time'>06:00</var> UTC</small><br><strong>In progress</strong> - Scheduled maintenance is currently in progress. We will provide updates as necessary.</p><p><small>May <var data-var='date'>11</var>, <var data-var='time'>12:26</var> UTC</small><br><strong>Scheduled</strong> - On Tuesday 2021-05-18 between 6 to 8 am UTC, we are going to switch the next batch of clusters to our upgraded server infrastructure, as announced in our recent blog post on https://www.freistil.it/infrastructure-wide-os-upgrades/.<br /><br />In this batch, we're going to upgrade the following clusters:<br /><br />- c67<br />- c68<br />- c69<br />- c95<br />- c97<br />- c102<br />- c109<br />- c118<br />- c124<br />- c125<br />- c127<br />- c129<br />- c136<br />- c184<br />- c185<br /><br />During this maintenance window, we will deactivate all cron jobs on these clusters, then switch to the new infrastructure, and finally reactivate cron jobs. We don't expect this change to affect website uptime.<br /><br />We will not transfer any contents of user home directories between the old and new shell boxes. For that reason, please back up all files stored in user home directories on your cluster's shell box that you will need in the future in advance of this maintenance. You can either download them or move them into the cluster's shared storage. <br /><br />This change will not require any DNS changes for your websites. <br /><br />However, please note that IP addresses of the individual boxes (Varnish, memcached, web application and shell) making up your cluster are going to change. Should you rely on individual IP addresses to access these services, we urge you to instead use the configuration snippets we provide and keep updated for you.<br /><br />The only change that you are going to notice for sure is the changed SSH host key for the new shell box. Your SSH client will display an error message similar to this:<br /><br />-------✂--------✂--------✂--------✂---------✂--------✂--------✂--------<br /><br />@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@<br />@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @<br />@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@<br />IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!<br />Someone could be eavesdropping on you right now (man-in-the-middle attack)!<br />It is also possible that a host key has just been changed.<br />The fingerprint for the RSA key sent by the remote host is<br />e5:72:51:dc:9e:19:37:fb:26:2c:2f:8a:09:02:e5:e4.<br />Please contact your system administrator.<br />Add correct host key in /Users/username/.ssh/known_hosts to get rid of this message.<br />Offending RSA key in /Users/username/known_hosts:1<br />remove with: ssh-keygen -f "/Users/username/.ssh/known_hosts" -R c61s.freistilbox.net<br />Password authentication is disabled to avoid man-in-the-middle attacks.<br />Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.<br /><br />-------✂--------✂--------✂--------✂---------✂--------✂--------✂--------<br /><br />Simply remove the conflicting old host key. In the terminal, you can do this via `ssh-keygen -R `. The new key will be stored once you approve it at your next login.<br /><br />We will notify you of the completed maintenance under this maintenance announcement. Once we're done, please take a minute to make sure the changes did not affect your website operation.<br /><br />----<br /><br />If you have any questions or want to make special arrangements for your cluster upgrade, feel free to get in touch with us via the freistilbox dashboard. We're here to help!<br /><br />----<br /><br />Documentation links:<br />Shared storage: https://docs.freistilbox.com/how_it_works/filesystem/#shared-storage<br />Configuration snippets: https://docs.freistilbox.com/how_it_works/includes/</p>