April 29, 2011

                         "Service failure showed what can go wrong, but one government site showed how to be prepared"- Kevin McCaney 
                         This was no blip. Energy’s OpenEI.org site, an open, Semantic Web collaboration platform for sharing work on clean energy, was down nearly two full days. Sites such as Reddit, Foursquare and Quora weren’t down quite as long, but they were still dark for a day or more, an eternity for social networking sites where users post and consume information from minute to minute. For some, it was panic-as-a-service.

                          As of this writing, Amazon had restored essentially all of EC2’s lost services and completed its postmortem, though a full explanation had yet to appear on its AWS status dashboard. Initial posts on the dashboard attributed the problem to remirroring among Elastic Block Storage volumes that seemed to get carried away, essentially eating up capacity. Whatever. Knowing the actual cause of the failure would be useful. But the bottom line — and for government and private-sector websites, uptime is the bottom line — is that sites that depend on those cloud services were out of commission for a long time.

Amazon cloud crash keeps Energy site off-line

So what does this mean for agencies going to the cloud? Does the Amazon crash prove that cloud is a crap shoot for any organization that takes its services and data seriously? Will agency applications in the cloud forever be at the mercy of remirroring run amok?
The short answer is almost certainly no. The longer answer is no but also that the EC2 crash is a cautionary tale, a reminder that moving to the cloud involves the same prep work, attention to detail and contingency planning that goes into any other critical network.
Case in point: Recovery.gov, which is hosted at the AWS Northern Virginia data center where the failure occurred, remained in operation throughout. The reason? The Recovery Accountability and Transparency Board, which runs Recovery.gov, had a backup plan, an agreement with Amazon to move its operation to another location in the event that trouble cropped up, according to a report in InformationWeek.
One lesson from the crash is that cloud services won’t always be perfect. But the better lesson is the importance of a contingency plan.
Agencies are increasingly moving operations to the cloud, and for good reasons. Cloud computing frees data center space, cuts maintenance costs and power use, increases the availability of systems for mobile users, and, above all, saves money. Also, the Office of Management and Budget has decreed it, requiring agencies to move three applications to the cloud in the next 12 to 18 months.
But although cloud services can make some things easier for agencies, getting there isn’t a snap. Even moving e-mail systems, which are generally deemed the lowest of the low-hanging fruit for cloud migration, is fraught with pitfalls, as Rutrell Yasin reports in this issue.
The best approach, experts say, is a careful, thorough one. The impact of Amazon’s EC2 crash — which sites went dark and which stayed up — proves the point.

