5/24 Proton HP Server Outage Post Mortem

Timeline of Outage (Times are in PST)

10:28 PM – Logs show the server halted.

10:29 PM – We’re notified and began diagnosing the issue

11:10 PM – We temporarily move sites to a failover instance, bringing them online.

11:39 PM – The primary instance is fully restored and operational.

11:42 PM – All sites were transferred back to primary.

11:45 PM – Incident was marked as “Recovered”

Total Outage Duration: 42 minutes

Total Incident Duration: 77 minutes

Initial Incident Response

Upon notification, we found the server was completely unresponsive. No early warning metrics were triggered. At 10:28 PM, the server simply ceased all activity. This impeded our initial response and we attempted to force a reboot of the system. Because it was unresponsive, a clean reboot was not possible.

On boot, XFS attempted to replay the filesystem journal to restore the filesystem to a clean state. This hung, so our response diverged to two priorities:

Bring sites up, even temporarily.
Restoring access the root volume and getting the server fully functional

As we already had a hot instance on standby, we were able to quickly restore the data to a live state on that instance and update the backend to match this. This brought all sites online and ended the period of unavailability.

Restoring the primary volume required significant debugging, but we isolated the issue to a problem with the storage that resulted in an I/O block, preventing volume writes, mounts, and repair functions from functioning correctly.

We rectified this issue and by 11:39 PM the server was fully functional. Based on our investigation, this was the same issue that caused the initial outage.

Post Incident Audit

We focus our post-incident audits on the follow:

Prevention
Detection
Remediation
Communication

Prevention: We’re currently working with our vendors to ensure this and similar incidents do not happen again. We’ll have more updates on this.

Detection: The spontaneous nature of this outage leaves little room for proactive and pre-outage monitoring. However, we are working to develop protocols to diagnose such issues faster and streamline recovery.

Remediation: The presence of an instance on standby for failover was instrumental in mitigating the impact of this outage. That is a policy we’ll be continuing. Based on our review, the resolution could have been sped up by improved coordination and training. We’ll be working to improve DR training on the new platform and ensure a smooth response.

Communication: We understand there is significant room for improvement in the communication category. This outage took down our client area and billing system, preventing us from communicating with clients in an orderly manner. This also created a delay in emailing clients regarding this outage. We’ll be looking into high-availability solutions for our billing area to ensure communication channels are open during outages.

Compensation

Because of the exceptional nature of this outage, we’re providing 20% of your monthly payment as compensation. This will be automatically issued to your account in the coming days. Over the past year, we’ve had exactly one incident like this, and that is this one. We don’t take lightly the severity and impact on your business.

We’d like to apologize once again for this incident. We completely understand the trust in us to host your business and mission-critical websites. If there’s anything we can do at all, please reach out to us.

Sincerely,

The CynderHost Team

Timeline of Outage (Times are in PST)

Initial Incident Response

Post Incident Audit

Compensation

Leave a Reply Cancel reply