Love it or hate it, the cloud is here to stay. It's hard to beat the capabilities and it's nearly impossible to beat the prices. While our site is hosted traditionally, all of our media files for all of our shows are hosted at Microsoft Azure. While it seems like a slam dunk, the growing popularity of cloud services does present an interesting new problem.
As more and more sites and applications begin to rely on services like Microsoft Azure, Google Cloud and Amazon Web Services (AWS), it creates a scenario where there is a single point of failure. If one of the cloud companies has an issue, then hundreds or thousands of users can be affected.
This issue was demonstrated this week when the AWS Simple Storage Service (S3) failed on February 28. Amazon was very open about the issue, keeping their status page up-to-date as the issue was identified, troubleshot and solved. This outage, however, collapsed as many as 150,000 websites and applications, with more having partial outages, for about half of the day.
With this kind of failure, you would expect that something serious, such as a power outage or massive storm, would be to blame. Unfortunately, you would be mistaken. As it turns out, it was an institutional failure within AWS that caused the issue. Let's examine what happened.
According to Amazon, a typo during a billing system change was to blame for the outage in their northern Virginia datacenter. The typo, however, was only a symptom of the actual problem within Amazon. Their system should not be so open to the support team that it allowed one employee to essentially shutdown an entire datacenter with a single command. The system should have denied the command, knowing that there was little to no chance that it was done on purpose.
That is a system and organizational failure, no matter how you slice it. A company who knows how many companies and organizations rely on their services should be more respectful of those customers.