8/30 Gravity Shared Hosting Partial Network Outage Post Mortem

On 8/30 at 7:18 AM PST, a secondary network transit provider (LS Networks) that handles certain routes and transit for our data center experienced network congestion that resulted in significant packet loss and disruptions for users. This was resolved fully at 11:05 AM PST.

We utilize a diverse range of fiber connections and hardware gear for transit, and as a result, only paths to certain connections or other ISP were impacted. Unfortunately, one of these routes was to Cloudflare, which is used by a significant portion of users.

As a result, during this time, all sites hosted on Cloudflare were unable to connect to the origin and therefore threw 5XX errors or timed out. Connection to other ISP varied, with some experiencing levels of packet loss in the 10-20%, while others were completely unable to connect.

During this time, users may have experienced degraded performance as a result of the packet loss. For example, we saw routes to AWS EU drop off completely while connecting from Washington DC saw little impact in latency.

DNS continued to function as normal because of our extremely geographically diverse network of Anycast DNS servers hosted on multiple networks. Subdomains not hosted or pointed to our shared server continued to function, as did email sending/receiving for those that used third parties.

Email for those that utilized our server to send/receive email may or may not have been impacted, depending on the network paths taken by various MTAs. For those that are affected, no emails were lost, as MTAs will retry delivery for up to 72 hours.

The issue that caused the network congestion stemmed from a malicious actor hosted on an upstream network sending spoofed packets outbound that caused physical hardware switches and routers to throw issues along certain routes. Our upstream is working with Cisco, who provides the redundant network gear for the datacenter to isolate the cause of the issue and ensure this does not happen again.

Unfortunately, the novel nature of the attack made it extremely hard to track down what the root cause was and isolate the issue, which is why there was prolonged downtime for routes affected.

Ensuring this and similar potential issues like this do not happen again is obviously our number one priority – we understand the impact outages have on your business and apologize once again for any trouble, or loss of traffic/revenue caused.

Rest assured we will continue to investigate and audit what can be improved in general as part of our incident response. While status updates were posted on our status page (https://status.cynderhost.com) some users were not aware of this, so we will be working on making this more transparent.

We’ve historically held an extremely high uptime for our hosting services, hitting ~99.998% uptime since the start of 2021. We will take every step necessary to maintain this standard of reliability.

To highlight the severity of this situation, clients that utilized Cloudflare and experienced a full outage will be compensated ~6x our normal 8x compensation SLA, which is 25% of their monthly payment.

All users are eligible to claim a 10% credit. Please open a billing ticket in our client area for more information.

If you are a reseller, please contact us to arrange separate compensation for any end-users impacted.

Leave a Reply Cancel reply