Amazon has revealed its findings regarding the cause of the recent AWS outage, which impacted websites and users worldwide but also the application that keep warehouse, delivery, and Amazon Flex employees.
A wide range of Amazon services, including Prime Video, Alexa, and Ring, as well as high-level clients like Facebook and Disney Plus, all experienced unavailability or major slowdowns as a result of an issue in an AWS US region that persisted for many hours.
The corporation has already concluded its investigation into the outage, which it faults on an unusual set of events that were initially intended to strengthen its services.
AWS Service Impact
“An automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from numerous clients inside the internal network,”
“This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks.”
We have taken several actions to prevent a recurrence of this event. We immediately disabled the scaling activities that triggered this event and will not resume them until we have deployed all remediation.
Our systems are scaled adequately so that we do not need to resume these activities in the near-term. Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but, a latent issue prevented these clients from adequately backing off during this event. This code path has been in production for many years, but the automated scaling activity triggered a previously unobserved behavior.
We are developing a fix for this issue and expect to deploy this change over the next two weeks. We have also deployed additional network configuration that protects potentially impacted networking devices even in the face of a similar congestion event. This remediation give us confidence that we will not see a recurrence of this issue.AWS on a blog post
The company also expressed frustration towards this unfortunate event, stating, “We understand that events like this are more impactful and frustrating when information about what’s happening isn’t readily available.”
AWS while concluding the blog post wrote, “Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our track record of availability, we know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.”