A single mistyped server command during a routine debugging of its billing system in Amazon Web Services’ northern Virginia data center turned out to be the cause of a huge outage Feb. 28 that stilled an estimated 150,000 websites and/or business services for about half the day.
The problems caused websites and apps to become completely unavailable, while others indicated broken links and images, leaving users and companies around the globe frustrated and/or confused.
One wonders that if one or two other faulty commands had taken place, how much more of the Internet would have tanked.
To be fair, AWS outages like this one are extremely rare. The company stayed transparent throughout the event, updating its status page frequently.
Issues Apology to All Users
The Seattle-based web services and storage giant issued an apology March 2 to the thousands of companies and millions of people who use its services daily. It turns out that a command aimed to halt a limited number of servers for one of its Simple Storage Service (S3) subsystems was inputted incorrectly; instead it removed a much larger set of servers for a period of 3.5 hours to 5 hours.
After that mistake was discovered, a full system restart then was required. This took much longer than expected, due to how fast the Amazon Web Services division has grown during the past decade, the company said.
S3, the launch of which in early 2006 helped start the cloud-computing revolution, is Amazon’s largest and most-utilized service. It is used by more than 500,000 of the company’s million-plus customers for cloud storage, AWS said.
In a published postmortem on the incident, AWS said that “we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.”
AWS Offers Timeline of the Event
AWS spelled out the timeline of the event this way:
“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
“The servers that were inadvertently removed supported two other S3 subsystems.”
The company said March 2 that it is making changes to the system to ensure that incorrect commands won’t trigger an outage of its web services in the future.
Services affected on Feb. 28 included Adobe’s services, Amazon’s Twitch, Atlassian’s Bitbucket and HipChat, Buffer, Business Insider, Carto, Chef, Citrix, Clarifai, Codecademy, Coindesk, Convo, Coursera, Cracked, Docker, Elastic, Expedia, Expensify, FanDuel, FiftyThree, Flipboard, Flippa, Giphy, GitHub, GitLab, Google-owned Fabric, Greenhouse, Heroku, Home Chef, iFixit, IFTTT, Imgur, Ionic, isitdownrightnow.com, Jamf, JSTOR, Kickstarter, Lonely Planet, Mailchimp, Mapbox, Medium, Microsoft’s HockeyApp, the MIT Technology Review, MuckRock, New Relic, News Corp, PagerDuty, Pantheon, Quora, Razer, Signal, Slack, Sprout Social, StatusPage, Travis CI, Trello, Twilio, Unbounce, the U.S. Securities and Exchange Commission (SEC), Vermont Public Radio, VSCO and Zendesk, among others.
Numerous Sites Suffered Partial Outages
Airbnb, Down Detector, Freshdesk, Pinterest, SendGrid, Snapchat’s Bitmoji and Time Inc. were working slowly in the afternoon, the company reported.
Apple said it had issues with its App Stores, Apple Music, FaceTime, iCloud services, iTunes, Photos, and other services on its system status page, but it’s not confirmed that they were attributable to the S3 difficulties.