Resilience: Planning for When the Clouds Go Away
On November 25, AWS suffered a major service disruption in its Northern Virginia (US-East-1) region. It left the region out of commission for over 17 hours, and took thousands of sites and services, including Adobe, Twilio, Autodesk, the New York City Metropolitan Transit Authority, The Washington Post offline.
When Planned Upgrades Go Wrong
In a post-mortem, Amazon described how a planned capacity increase to their front-end fleet of Kinesis servers led to the servers exceeding the maximum number of threads allowed by the current configuration. As the servers reached their thread limits, fleet information, including membership and shard-map ownership, became corrupted. The front-end servers in turn were generating useless information, which they were propagating to neighboring Kinesis servers, causing a cascade of failures.
Additionally, restarting the fleet appears to be a slow and rather painful process, in part because AWS relies on this neighboring servers model to propagate bootstrap information (rather than using an authoritative metadata store). Among other things, this meant that servers had to be brought up in small groups over a period of hours. Kinesis was only fully operational at 10:23pm, some 17 hours after Amazon received the initial alerts.
Finally, the failures with Kinesis also took out a number of periphery services, including CloudWatch, Cognito, and the Service Health and Personal Health Dashboards, used to communicate outages to clients. For reasons that aren’t totally clear, these dashboards are dependent on Cognito, and may not be sharded across regions. Essentially, the post-mortem seems to imply that if Cognito goes down for a region, affected customers in that region will have no way of knowing.
Resiliency, or How to Survive the Next Outage
While Amazon posted a number of lessons learned in their post-mortem, which are all worth reading, today we wanted to discuss what customers can do to limit their own risks in the Cloud: