2/25 Post Mortem - CynderHost

Incident Overview

On February 25, 2025, at 9:23 AM PST, we noticed significant disruption across our high-performance network. Approximately 40% of traffic started seeing server timeouts and HTTP 502/503 errors. Many of our users on our high performance platform (including our billing portal) experienced degraded service, making it difficult or impossible to access their data and services.

Timeline of Events

9:23 AM PST: Monitoring alerted us to widespread timeouts and errors.
9:25 AM PST: Our engineers began investigating immediately.
9:35 AM PST: Initial investigation pointed to problems with our storage cluster / filesystem. We traced these issues back to a network stability improvement deployed the previous day, which unintentionally affected the filesystem, leading to capacity and I/O contention issues.
9:45 AM PST: Initial fixes were deployed, slightly improving stability, but not fully resolving the issue. Errors continued intermittently.
10:00 AM PST: Decided to take the primary storage machine offline completely to investigate and address the root cause.
11:30 AM PST: Isolated the exact issue and began rolling out comprehensive fixes.
12:15 PM PST: All systems were fully restored and confirmed stable. All traffic returned to normal error rates.

Communication Failures

We fell significantly short in how we communicated during this incident. Our primary client portal, hosted on the affected infrastructure, went offline, leaving us without a direct way to inform clients. This caused unnecessary confusion and frustration among users who rightfully expected prompt and clear communication. We should have foreseen this risk and had a reliable backup plan in place, but unfortunately, we didn’t.

Additionally, our status page, designed specifically for such incidents, was affected by internal caching and deployment issues. Despite our efforts to publish updates, these failed silently due to configuration problems, resulting in outdated or completely inaccessible status updates. While we created incidents internally and updated the status, we did not adequately validate that these changes were properly published to all users globally on our status page.

Our recent decision to phase out our social media presence, particularly Twitter/X, also turned out to be a mistake in hindsight. We had not included it in our current incident response plans, and therefore during the incident did not post public updates onto our social media.

We know we let clients down here. Transparent and timely communication is crucial during incidents, and we didn’t deliver on that promise. We’re aware that this significantly impacted client trust and satisfaction – we’re very sorry and are working very hard to fix all these mistake and build more robust plans moving forward.

Moving forward:

Immediate Fixes: We resolved problematic configurations and fully restored the storage cluster to stable performance.
Better Monitoring: Enhanced our monitoring systems to catch these issues earlier, preemptively, before they fail catastrophically
Self-Recovery Tools: Deployed tools that enable automatic recovery to minimize downtime.
Redundant Communications: We’re separating our main communication systems from our primary infrastructure to avoid similar failures in the future.
Social Media: Bringing back and integrating social media platforms like Twitter/X into our incident response strategy.
Status Page Improvements: Fixing internal issues to ensure the reliability of status page updates. We will extensively test these systems, make sure they are redundant, and during incidents, assign a specific point person to make sure these issues are communicated properly and actively.

Support

We also recognize that our support response times and quality have not been up to the very high standards we hold ourselves to recently. We are aware of this and will fix it ASAP. We’ve scaled up significantly in load over the past few months and have actively onboarded new engineers to help handle support requests, but will very soon have additional capacity that will improve the experience for everyone.

We’re genuinely sorry for the trouble caused by this outage and even more so for the poor communication. It’s not acceptable, and we’re committed to improving significantly from this experience.

All clients are eligible for a full refund of their most recent month’s invoice. Please reach out to us and let us know how we can make things right.

Leave a Reply Cancel reply