http://searchcloudcomputing.techtarget.com/news/article/0,289142,sid201_gci1378426,00.html?track=NL-1329&ad=743755&asrc=EM_NLN_10614277&uid=1914599#
Teich also said that all of Heroku’s m2.2xlarge instances were running in a single availability zone, which was a mistake. He stressed that Heroku had failover built in already — if 21 instances had failed instead of 22, or if it had spread instances across several zones, “we wouldn’t be talking [about the outage],” he said.
Nevertheless, on Friday, January 2, every m2.2xlarge instance in that availability zone suddenly vanished, despite all other types of EC2 instances running as normal. That’s unheard of in traditional hosting. It would be like every server with a given amount of RAM suddenly shutting down, regardless of operating system, age, brand, hardware or location in the data center, with no effect on its neighbors.
“For us, there’s the stuff you plan for and then there’s the stuff you don’t even know about,” Teich said.
An event like this was an “unknown unknown” that nobody planned for because nobody imagined it. He chalked it up to the learning process and pointed out that everybody in Amazon Web Services was flying by the seat of the pants at least part of the time.
HN thread:
http://news.ycombinator.com/item?id=1046500