12/7 Outage Post Mortem

On 12/7, we experienced a series of outages outlined below that caused extended downtime and unavailability for a large number of clients. Below, we’ve outlined the events that transpired.

8:24 – 8:49 AM (PST): A 25-minute outage occurred, where incoming connections were rejected and users saw a variety of 5XX errors

8:36 AM: Our team isolated the issue to an error in the underlying filesystem, leading to network connection failures. A reboot was performed to allow an automated filesystem integrity check. The server was fully functioning by 8:49 AM.

9:12 – 9:28 AM: We start to observe 403 errors being served to clients on the affected server and we identify the issue as a recurrence of the initial filesystem issues observed, leading to the filesystem entering read-only mode. A reboot and manual filesystem check on XFS is performed and the server is fully functional by 9:28 AM. We investigate the root cause of this corruption to prevent it from happening again.

10:27 AM: We identify the root cause of the corruption as a software bug and mitigate the bug to prevent further corruption. We also identify, however, that the present corruption is too erratic to resolve without significant impacts on availability.

10:59 AM: To permanently prevent reoccurrence, we make the decision to perform a platform migration.

12:39 – 12:57 PM: This issue recurs due to present corruption and the same symptoms (403 errors) are seen and then mitigated with an offline repair.

1:00 PM: We start the process of planning the platform migration, working with upstream to coordinate a seamless transition.

3:30 PM: Planning and preparing for the platform migration is finalized and we begin to coordinate with our supply chain to source a replacement backend server.

4:33 PM4:48 PM: This issue recurs due to present corruption and the same symptoms (403 errors) are seen and then mitigated with an offline repair.

6:30 PM: We start to initialize and setup the backend replacement server

— No further filesystem incidents were logged —

(12/8) 9 AM: The backend server is initialized and readied for migration

9:10 AM: Approval is given for migration to occur

10:24 AM: Initial data migration process is started and copies of all accounts and data are copied over

2:00 PM: All data is confirmed to be copied and sites are properly working

3:00 PM: A target network migration window of 12 PM PST on 12/9 is set

(12/9) 11:55 AM: We begin an incremental data synchronization of all account files and database.

12:40 – 1:20 PM: As sites are migrated, some (~30%) of site requests were temporarily served a “Suspended” accounts response to prevent change conflicts. For impacted sites, this issue lasted ~10 minutes.

1:55 PM: The final data synchronization is finished and the network transfer is approved

2:06 – 2:11 PM: Brief network outage occur on each transferred IP address, lasting ~2 minutes each as the network is re-routed to the new backend

3:00 PM: Migration is declared complete and system integrity is verified

9:00 PM: Any remaining bugs or inconsistencies from the migration are resolved and systems are fully functional. The initial software that caused the corruption has been disabled, and we’ve also made changes to ensure our filesystem is more resilient to similar issues.

Besides these recent outages, we also recognize that our overall stability and performance has not been up to par in the recent months, especially in context of the previous network attack issues. We’ve taken this chance to make several infrastructure upgrades to our system, including significant more powerful compute and storage resources, which should ensure a more stable and performant system moving forward.

Additionally, we are actively communicating with our network providers to push for concrete resolutions and improvements to any potential issues with respect to network stability and performance.

These changes should ensure that moving forward, issues like these will not be commonplace again. We sincerely apologize to clients’ impacted by this outage and recognize the large impacts on you and your businesses. We’d like to thank you for your continued support, and rest assured system stability is our number one priority.

All clients impacted are eligible for 75-100% credit of this month’s cost (depending on the scope of impact). Simply open a ticket with our billing department and we’ll have this processed for you.

Additionally, we’re releasing Imunify360 free of charge to all clients as a token of appreciation for your continued patience. Imunify360 offers automated inspection and blocking of malicious requests, brute force protection, real-time scanning of uploads, automatic halting of malicious code execution and has been activated for everyone. No changes are needed on your side.

Finally, as a note of caution, PHP 7.4 is End-of-Life as of 11/30. This means the maintainers of PHP are no longer supporting PHP 7.4 to any extent, including security updates. This also means that any security vulnerabilities that are found and disclosed in PHP 7.4 will not be addressed, which could mean malicious actors entering systems on your site.

Moving forward, PHP 8.1 will be the default PHP version for all sites. Note that the change from PHP 7.4 to PHP 8.X is significant and brings many breaking changes. WordPress sites and other applications which have not been updated in a long time likely will have code that does not support PHP 8, which can result in site errors and other issues.

We highly, highly recommend upgrading the plugins, themes, and codebase of all your sites to support the latest PHP version to ensure un-interrupted and secure service.

As always, please let out team know if any help is needed on anything.

Leave a Reply

Your email address will not be published. Required fields are marked *