Categories
Technical Details

Gravity Node Kernel Failure Post-Mortem

To our clients:

At 6:30 AM UTC on Tuesday, 10/20 we had a significant downtime for our shared hosting for all clients and visitors, totaling ~5 hours of site inaccessibility and more than ~55 hours of control panel inaccessibility. We do not overlook the severity of this incident and are outlining and disclosing a timeline of events and changes, in hopes of leading the way forward to gaining back your trust and building a better platform.

Approximately 1 hour prior to the outage, there was an exponential increase of RAM usage on the Gravity node, effectively rendering the server unresponsive. As per our protocols, the server was rebooted to restore access, but despite numerous attempts, it continued to exceed reasonable levels right after all system services had booted up and usable control of the server could not be acquired. Attempts to boot into rescue mode also failed due to poor implementation by the datacenter. While we continued to investigate and resolve this issue, our extended-downtime plan was followed and all users were temporarily placed on a separate failover server, created from our most recent offsite backup. This brought most sites back online.

As mounting a separate disk image to access the main filesystem had failed, we proceeded with attempting to physically mount and boot from a new hard drive. This required formatting the newly mounted drive in order to install a temporary operating system. Despite receiving multiple assurance from our datacenter that the default drive had been properly swapped to the newly mounted drive, this proved incorrect, as once the system OS was installed, it was revealed the main RAID array had been overwritten instead. Post-incident audits conducted by our datacenter showed that onsite technicians had failed to properly adjust our RAID controller while swapping the disks. Ultimately, despite multiple recovery attempts by both onsite technicians at our datacenter, our internal staff, and third-party contracted recovery technicians, the original data could not be restored.

Unfortunately, since the data was corrupted and had been wiped, we proceeded with restoring the panel access and OS from backups. This process was encountered significant delays as well. Furthermore, upon the restoration of offsite backups, we discovered that two customer accounts had not been backed up properly offsite. We’ve reached out to these two clients already in order to make amends.

We conducted extensive audits post-incident on our side and we’re outlining here mistakes that we’ve made and aim to change the the future.

  1. It’s clear that our monitoring stack requires major improvement. Our current system of monitoring essential services and only alerting in cases of unexpected latency or unreachability fails to account for abnormal resource usage, resulting in post-disaster notification, which in this and some cases, is too late.
    1. We’re rewriting our monitoring agent to analyze resource usage, disk usage, network interfaces, as well as injest kernel and server logs in real-time to detect issues, and kill memory leaks before they happen – and make the proper fixes to ensure similar scenarios don’t happen in the first place.
  2. It’s clear that our backup policy plan also requires improvement. Our current backup policy only backs up to the offsite locations weekly. While this protects against total data loss, significant site progress can incur as the result of a total server failure. Furthermore, the integrity of two backups failed, resulting in partial data loss for two users.
    1. We’re reworking our offsite backup plan to increase coverage, by backing up to offsite storage once every 24 hours across all servers on our network, on all plans, verifying backup and account integrity, as well as conducting monthly check of the said integrity through full restoration. This will ensure that any potential corruption can be addressed in a timely manner, and the highest amount of data loss incurred is only over 24 hours. We are also looking into an real-time file synchronization solution to a external server for double data redundancy.
  3. Communication, publicly, also required significant improvement. There were significant delays between updates, as well as a limited reach of updates, resulting in multiple queries regarding the downtime.
    1. We will be examining the methods of communication used to alert users of outages especially major ones. Our system of Twitter as well as status page communication clearly does not work.  Furthermore, we will be looking into the addition of a EU based support team to expand our general support availability hours.
  4. We don’t believe in shifting blame to others – this incident is purely our fault. However, we were extremely disappointed with the level of communication, as well as the carefulness of our datacenter provider in resolving the issue. Our internal estimates peg communication with our upstream as the #1 delay during this incident.
    1. As such, we have begun looking for new providers and will be conducting thorough testing and research to find and migrate to a reliable and capable cloud network capable of supporting our infrastructure and aid us in providing fast, reliable, and secure hosting.

Beyond this, we will also be implementing many internal changes to our communication, protocols, and documentation, as well as periodic audits, training, and simulations to improve the flow of our disaster recovery, and extended downtime plans.  

Above all else, we value transparency, honesty, and the never-ending path improvement.

We take full responsibility for everything that has happened, and we’ve proactively refunded all our impacted clients 200% of their monthly fee – please reach out if this has not shown up in your account. Our reseller customers have been contacted for additional compensation. Through in-depth testing, research, development, audits, and disclosure, we’re ensuring outages of this scale are not encountered again.

As always, please reach out to our team with any queries.

The CynderHost Team

2 replies on “Gravity Node Kernel Failure Post-Mortem”

Leave a Reply

Your email address will not be published. Required fields are marked *