Could AWS Crash Again And Roil The Internet? It Already Has — And Competitors Are Taking Notice
Last week's massive Amazon Web Services outage caused chaos for corporations and consumers alike. And now the causes — and fallout — are starting to come into focus.
On the morning of Dec. 7, an outage at AWS data centers in Virginia brought many of the company’s cloud services offline, taking out companies and products reliant on those services in the process.
The outage paralyzed Amazon deliveries, crashed streaming services and applications like Disney+ , Venmo, Tinder and Ticketmaster that rely on AWS cloud services — and disrupted operations at companies and institutions from airlines to major universities.
In the nine hours it took for AWS to fully restore service, home devices from Nest thermostats to Roomba vacuums became inoperable, and municipal infrastructure like bike-share terminals stopped working.
Industry observers say the chaos caused by the outage provides a stark demonstration of the degree to which cloud services are integral to the operation of both large corporations and consumer products.
“The latest AWS outage is a prime example of the danger of centralized network infrastructure,” said Sean O’Brien, a visiting lecturer in cybersecurity at Yale Law School who spoke with the Associated Press.
“Though most people browsing the internet or using an app don’t know it, Amazon is baked into most of the apps and websites they use each day.”
Adding to the confusion for Amazon’s cloud customers on Tuesday was the company’s lack of communication, or perhaps understanding, of what caused the outage.
But experts say a clearer picture has emerged in the days since that has provided insight into not just Amazon, but the present and future of digital infrastructure.
Here’s what you need to know:
What Caused The Outage?
On Saturday, Amazon released a statement explaining the cause of the outage, which the company said occurred at a massive facility in the heart of Virginia’s Data Center Alley known as US-East-1. While the company’s explanation is saturated with tech jargon, the main takeaway is that an error in an automated process overloaded some of the company’s internal infrastructure, causing both internal and external networks to crash and making it exceedingly difficult to diagnose the problem.
“An automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network,” Amazon’s report says.
“This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks.”
Amazon’s statement quelled speculation that the outage was caused by a security breach or external threat. The company’s initial response to the crisis attributing the outage to a surge in activity “from an unknown source” had fanned the flames of speculation.
The Outage Could Give A Boost To Amazon’s Hyperscale Competitors
Tuesday’s outage could be good news for competitors looking to cut into AWS’ market share.
AWS sits at the top of the cloud food chain, controlling around 33% of the global cloud infrastructure, according to Synergy Research Group. By comparison, Microsoft and Google have a 20% and 10% market share, respectively. At least one competing cloud provider seems to think it will benefit from the reputational damage caused by the outage, with Oracle’s Larry Elison taking a shot at AWS in an earnings call on Thursday.
“Let me close with a note that I’m going to paraphrase from a very large telecommunications company who uses our cloud and all the other three North American clouds — Google, Amazon and Microsoft,” Ellison said, according to CNBC. “And the note basically said the one thing we’ve noticed about Oracle, Oracle’s cloud, is that it never ever goes down. We can’t say that about any of the other clouds. We think this is a critical differentiator.”
Industry insiders said that while the outage is unlikely to cause customers to flee Amazon, it may lead them to adopt multiple cloud architectures to manage their risk in the event of an outage.
This could help Microsoft, Google and others make up ground.
Guess What? AWS’ Main Website Crashed Again Three Days Later
AWS had more than one outage last week.
Just days after Amazon was able to restore its cloud services, the main public-facing website for Amazon Web Services went offline, greeting users with an error message, Data Center Dynamics reports.
The site stayed offline for around an hour, although no other Amazon services were impacted, according to reports.
Amazon Promises Better Transparency With Future Outages
As Amazon’s cloud customers spent Tuesday scrambling to get systems and products back up and running, many were unable to get information about the outage, find out what services were operational, report problems and file service tickets.
According to AWS, the same issues behind the outage also took out its customer service portal known as the Service Health Dashboard, leaving customers in the dark. The company now says that a more robust version of the dashboard — one less likely to be impacted by an outage — will go online in 2022.
“We expect to release a new version of our Service Health Dashboard early next year that will make it easier to understand service impact and a new support system architecture that actively runs across multiple AWS regions to ensure we do not have delays in communicating with customers,” AWS said in a statement on its website.