Post Mortem (9/22): Mitigating the Unknown

On 9/22, we experienced a series of outages and periods of unexpected latency. Below, we’ve outlined a timeline of events (times are in Pacific Standard Time):

4:34 AM: One of our hosting servers, gravity (gravity.cynderhost.com) experienced a large volumetric TCP DDoS attack, overwhelming automated defenses and resulting in network congestions

4:34 AM: Engineers are paged and start diagnosing the issue

4:35 AM: The cause of the issue (DDoS) is identified and our team works with upstreams to implement manual mitigation rules

4:36 AM: Service is restored

4:50 AM: Incident is declared resolved

7:36 AM: DDoS attack recurs with a different profile

7:37 AM: Engineers are notified of increased latency and sporadic outages and incident is reopened

7:38 AM: Cause (DDoS) is identified, and similar mitigation rules are put in place

7:38 AM to 8:43 AM: Availability continues to fluctuate at a dampened level, suggesting initial mitigations were not sufficient. A variety of other mitigation techniques are tried while networking simultaneously investigates the DDoS profile. During this period, roughly ~20% of connections failed.

8:43 AM: Forensics conducted on network traffic show a blend of [REDACTED, pending resolution] traffic, suggesting high attack sophistication.

8:45 AM: Preliminary mitigation steps are drafted and implemented while ongoing mitigation pathways are investigated

8:46 AM to 10:31 AM: Connections mostly stabilize while attacks persist. ~8% of connections fail, primarily clients making repeated requests (bots, crawlers, uptime monitors)

10:31 AM to 10:55 AM: A series of four brief ~1-minute outages are experienced as a result of adapting attack traffic, and more persistent traffic rules are implemented.

11:00 AM: Traffic stabilizes, availability and latency return to normal as attacks halt.

1:31 PM: Attack recurs with a slightly modified profile meant to circumvent existing rules, leading to ~1-minute outage.

1:46 PM: Long-term mitigation plans are drafted after multiple all-hands discussions

2:01 PM to 3:24 PM: Attacks recur with greatly increased volume and sophistication, peaking at 5x initial (2:01 PM) attack volume at 3:23 PM as implemented rules, gateways, and network paths struggle to contain and mitigate issues. ~24% of traffic experienced connection errors, but ~75% of connections experienced significant connection latency (as defined by RTT > 1000ms).

2:01 PM to 2:24 PM: Work on expanding filtering resources is completed to mitigate increased volume

2:04 PM to 3:24 PM: Combined with data from previous attacks, engineers are able to identify signatures of attack traffic and implement broad filters to drop malicious packets.

3:26 PM: Network stabilizes. as attacks continue. Approximately 2% of valid connections were falsely dropped by the mitigation profile.

3:30 PM: Attacks ends and all performance metrics return to pre-attack levels. Mitigation rules are kept in place and <1% of traffic experience failed connections.

4:27 PM to 4:31 PM: Attacks return with similar profile and peak at roughly 50% of highest experienced volume. Malicious traffic is dropped with minimal interruption and network remains stable.

4:33 PM to 4:40 PM: In a push to overwhelm network filtering systems, attack volume is increased to 1.5x peak volume and sustained for ~6 minutes. Mitigation occurs normally and network remains stable. <1% of traffic experience failed connections.

4:42 PM to 6:00 PM: Attacks occur sporadically, but with significantly lower volume. Primary mitigation rules are loosened while signature targets remain in place. No interruptions or latency changes are experienced during this time.

6:00 PM onward: Attacks halt.

10:15 PM: Emergency incident is declared resolved.

Total incident time: 17 hours and 41 minutes

Total time with abnormal network performance (failed connections >1% or 50th percentile latency >1000ms): 4 hours and 28 minutes

Total time with severe network impairment (failed connections >20%): 2 hours and 22 minutes

Total time outage time (failed connections > 50%): 44 minutes

We’re continuing to explore avenues for permanently ensuring this type of attack, but also other possible resource exhaustion mechanisms cannot cause such extended impacts again.

The challenge with DDoS attacks lies in their variability and sophistication. As we saw here, attackers are able to make use of an extremely diverse number of protocols and attack types, combining them simultaneously to hamper mitigation efforts. These attacks actively adapt and change, and are completely unpredictable.

Mitigation at a server level is futile. Limited by individual server computing and throughput limits, resource exhaustion is easily attainable and impossible to mitigate at a certain point

Therefore, addressing issues like these must come higher up in the processing pipeline — before they can reach the target. We’re currently exploring and testing a variety of long-term, advanced mitigation solutions that implement DPI (deep packet inspection) in conjunction with live intelligence rules. Primarily, these compose of either hardware-based, dedicated filtering solutions or cloud-based tunnels, which each contain unique tradeoffs.

Moving forward, our short-term priority is dampening the effect of any similar attacks with the available resource. This means applying what we’ve learned mitigating these types of attacks as well as bringing on new consulting experts to audit our existing architecture and identify potential improvements.

In the long term, we’re pursuing a concrete, dedicated mitigation solution (as outlined above). As we conduct a thorough review, we’ll update this page with details.

Ultimately, we’d like to apologize for the events that occurred here and any impact on your sites. We understand that any downtime has significant disruptions on your business, especially an incident as long as this.

For the many clients who choose us for our track record of reliability, we apologize. This is not a normal occurrence and rest assured we will take every possible step to learn from and ensure this never happens again.

Clients impacted by this period of interruption are eligible for a full refund (100%) of this month’s service. Please open a ticket with us and we’ll get that processed for you.

Sincerely,

Welton, Ivan, and the rest of the CynderHost team

Leave a Reply Cancel reply