Actions

Work Header

Rating:
Archive Warning:
Category:
Fandom:
Additional Tags:
Language:
English
Stats:
Published:
2024-03-25
Words:
1,783
Chapters:
1/1
Comments:
427
Kudos:
4,809
Bookmarks:
214
Hits:
27,845

The AO3 July/August DDoS Attacks: Behind the Scenes

Summary:

The AO3 July/August, 2023 DDoS attacks from the perspective of the OTW Systems Committee.

Work Text:

Introduction

This work provides an overview of the July & August of 2023 DDoS attacks against the Archive of Our Own from the perspective of the OTW’s Systems Committee. As such, it may include some technical terms and information. We’ll do our best to explain or link to external resources to help provide some context as needed. We will focus on the series of events for which we have data & evidence, rather than speculation. All times & dates are in Coordinated Universal Time (UTC) and in 24 hour format unless otherwise stated.

As a reminder, the Systems Committee consists of 8 volunteers (6 at the time of the incident) who donate their free time to maintaining the OTW infrastructure. The events outlined below were fit in during our day jobs, evenings and night times.

Background

The OTW infrastructure consists of multiple servers & networking devices with differing roles. In order to understand how the attack affected us, we’ll need to briefly explain a few of these layers.

At the edge of our network, we have dual redundant firewalls. These are primarily responsible for restricting traffic into our network, but they are also used to load balance between our frontend servers.

The frontend servers are responsible for traffic shaping and load balancing of traffic that will be served by the application servers. The frontends also serve some static files, such as images and stylesheets.

The application servers actually generate the pages of the Archive. They talk to numerous support services, such as our database & Elasticsearch to generate pages for everyone.

July 10th, 2023

The Archive was operating normally. At approximately 11:48 UTC, we began to see increased levels of traffic to the Archive, which rapidly began generating errors and overloading the CPUs on our frontend servers.

Graph showing requests per minute to AO3. A relatively flat line is shown at about 80-90k rpm, then spiking over 200k rpm before dropping to near 0.

The first Systems committee volunteer responded shortly after 12:00 UTC. The volunteer felt that the traffic was likely malicious, but was at that time unable to immediately conclude an attack since some initial signals suggested that the suspicious traffic could have been a browser update or a misbehaving bot.

Until the volunteer had time to further investigate and create a fix, they decided to place the site in maintenance mode around 12:15 UTC, which stopped requests at the frontend layer. This did relieve load to the application servers, but did not reduce the noticeable load on the frontend servers. At 13:52 UTC, the next Systems volunteer checked into our internal chat, and by 14:00, the team had more or less decided that the outage was due to foul play, specifically, an HTTP DDoS attack. Details of the traffic were identified in order to attempt mitigation.

The third Systems team member checked in around 15:21 UTC, and began deploying a potential fix provided by the first responding volunteer. Due to excess load on the frontend servers, this was taking much longer than expected. To allow the deployment to continue, traffic was stopped at our firewalls which allowed the deployment to complete in a reasonable time.

Unfortunately, after allowing traffic back through the firewalls, we continued to see high load on our frontend servers. Around 17:16 UTC, a change was deployed to our firewalls to block traffic to the page that was being abused. This stopped the abusive traffic at the edge of our network, which reduced the load to our frontend servers and kept things mostly at bay. The site returned to more or less normal operation for about an hour.

After an hour of uptime, the attackers began targeting different pages on the site. The 3 team members followed up by blocking requests to those pages, which allowed the site to remain mostly available between 19:05 and 21:42 UTC. Around that time, we began to see traffic spikes in our APM tool of over 1 million requests per minute. For context, normal peak traffic hovers around 150k requests per minute. Since our servers were doing their best to respond to all of these requests, we began to exhaust our physical internet connection.

From 21:42 UTC to about 23:10 UTC, the site was sporadically up and down while the team tried to keep the site alive. We were attempting to deploy completely new rate limiting measures on the fly with not much success. The request spikes continued to reach a peak of 1.5 million requests per minute (including those returning “you’re browsing too fast” responses) which is just what our application servers were able to actually process. There were undoubtedly more requests that congested upstream and thus were not logged.

Requests per minute graph showing numerous spikes near & over 1 million RPM. One spike reaches 1.5 million RPM.

After 23:10 UTC, the site was more or less down as we were completely flooded with more traffic than we could physically handle.

July 11th, 2023

The team continued to attempt to mitigate the attack ourselves into July 11th, but each time we attempted to return to service, we were immediately overwhelmed by the traffic.

At 00:21 UTC, our datacenter informed us that the attack had exceeded 1.2 terabits per second, which is around 600 times the bandwidth capacity we had at the time. This caused temporary disruption for the whole DC until further upstream filtering was enabled. This was likely the result of a DNS amplification attack or similar in addition to the HTTP flooding we were receiving, and was unknown to us until this point.

By 05:00 UTC, the team had spent hours attempting to handle things ourselves, and it was clear we weren’t getting far on our own. At this point, we made the decision to set up Cloudflare to get the site back online, and we worked with our datacenter to make the necessary preparations. Around 09:47 UTC, we started setting ourselves up on the Cloudflare free tier while waiting for the necessary approvals to upgrade, but further changes to the backend were needed. In the meantime, one of our volunteers was able to connect with a Cloudflare employee & fellow user of the site, who referred us to Project Galileo and supported our application internally.

Thanks to these efforts, we were officially approved at 14:04 UTC, which granted us access to more advanced tooling only ~2 hours after applying. At 15:00 UTC, some traffic began successfully hitting our application servers via Cloudflare. This initially included some attack traffic. We worked with our Cloudflare contact to put in some rules to further mitigate the abusive traffic, and their system began to recognize and stop more or less all of the abusive traffic. The Archive was once more fully accessible around 15:42 UTC.

Requests per minute to the Archive returned roughly to normal upon successfully implementing Cloudflare.

Traffic graph from the Cloudflare portal showing unmitigated & mitigated requests. Initially, the majority is not mitigated, but later reverses once Cloudflare’s systems kicked in & specific rules were applied. Times are US Eastern.

July 11th to August 31st

We continued to receive a series of attacks in this time frame. The majority of these attacks had no impact to the site and were mitigated by Cloudflare. However, there were a couple of notable events.

On August 26th, 2023 at 12:29 UTC, we received a large attack peaking at 10 million requests per second. The attack had no notable impact on the Archive, but was the largest attack we had recorded at the time.

On August 28th, 2023 at 20:49 UTC, we received a notification from Cloudflare that a DDoS attack of 6.95 million requests per second was detected. The numbers on these alerts are frequently lower than the actual peak of the traffic, so this was initially alarming to us.

Internal Cloudflare alert showing a 6.95 million RPS attack in progress.

We later found out that the attack had actually peaked at 65 million requests per second. For context, the largest publicly announced HTTP DDoS attack by Cloudflare at the time was a 71 million request per second attack. Additionally, we received information that the attack originated from the Mirai botnet. However, Cloudflare did its job well and we saw very little, if any, impact.

A screenshot from our, at the time, in progress Cloudflare stats dashboard. The peak reaching slightly over 65 million RPS is visible.

On August 30th, 2023 at about 22:15 UTC, we received a set of attacks that was not initially mitigated well by Cloudflare, which caused some disruption. The attack lasted until approximately 00:10 UTC on the 31st. We believe the disruption in this case was due to some long standing issues in a piece of legacy software that was part of our stack, which we disabled at this time and later removed.

In the later hours of August 31st, we received another set of attacks which caused brief problems. Initially the attack was not fully mitigated by Cloudflare, but we were able to put in place some caching rules which helped the situation. Some Cloudflare automatic rules started to kick in, but did cause some brief collateral damage. Although this wasn’t an issue for long, it threw up the default Cloudflare block page, which is a little scary. We later replaced this with a custom page that is nicer and more reassuring. :)

The Cloudflare default block page, stating "Sorry, you have been blocked. You are unable to access archiveofourown.org."

Our custom block page, stating more reassuringly that the user's action was temporarily blocked and does not affect account status.

A number of smaller attacks occurred after this, however essentially none have had any impact on the Archive nor required any major action from us.

Acknowledgements

We thank Cloudflare for the quick turnaround during the initial attack, for providing us with services under Project Galileo, and for continuing to be responsive to our needs. 🧡

We are very grateful to our datacenter for hosting us as long as they have, for their initial support during the attacks, and for their quick response in getting us items needed to enable Cloudflare. 🧡

We thank all of the OTW volunteers who were around to support us during the attacks. We also appreciate everything you do for the org. ❤️

Finally, we thank the users of the Archive for all of the love and support we received during the downtime. We also received a lot of offers to assist us in any way possible from various industry professionals. All of this was incredibly motivating in keeping us going. Thank you. ❤️