Are cloud-based services just as reliable as the weather forecast?

Jason

14 years ago

The weekend did not start off well for cloud-based services like Netflix, Pinterest, and Instagram. Starting last night and continuing into this morning, web services that rely on the Amazon Web Services Elastic Cloud Computing were down regionally or completely. The cause: thunderstorms on the east coast. Data centers across the regions have been giving all data center managers a rough start to the weekend as a result. Of biggest note, Amazon’s data center in Northern Virginia.

Amazon CloudSearch reports Elevated error rates while other services like EC2, Relational Database, and Elastic Beanstalk are all reporting “power issues”. The status details specify electrical storms as the reason for the outages. You can see pictures of the storm that rocked Washington D.C. and left 5 dead from the Washington Post.

While many services have taken to Twitter to report their outages and have also used the platform to report their services are back up, some sites are still dealing with issues with their volumes. Twitter also suffered its own downtime in the past week. Amazon points its users to the AWS Documentation site for assistance in monitoring status checks.

If every cloud has its silver lining, what’s the lining for this one? It seems mostly that cloud-based service providers, many of which offer their services for free (Netflix being the odd one out), can pass the blame onto Amazon. Amazon Web Services then passes the blame onto the doozy of a storm that walloped the East coast.

The Amazon reported error details are included below:

8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.
8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.
8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power.
8:49 PM PDT Power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.
9:20 PM PDT We are continuing to work to bring the instances and volumes back online. In addition, EC2 and EBS APIs are currently experiencing elevated error rates.
9:54 PM PDT EC2 and EBS APIs are once again operating normally. We are continuing to recover impacted instances and volumes.
10:36 PM PDT We continue to bring impacted instances and volumes back online. As a result of the power outage, some EBS volumes may have inconsistent data. As we bring volumes back online, any affected volumes will have their status in the “Status Checks” column in the Volume list in the AWS console listed as “Impaired.”
11:19 PM PDT We continue to make progress in recovering affected instances and volumes. Approximately 50% of impacted instances and 33% of impacted volumes have been recovered.
Jun 30, 12:15 AM PDT We continue to make steady progress recovering impacted instances and volumes. Elastic Load Balancers were also impacted by this event. ELBs are still experiencing delays in provisioning load balancers and in making updates to DNS records.
Jun 30, 12:37 AM PDT ELB is currently experiencing delayed provisioning and propagation of changes made in API requests. As a result, when you make a call to the ELB API to register instances, the registration request may take some time to process. As a result, when you use the DescribeInstanceHealth call for your ELB, the state may be inaccurately reflected at that time. To ensure your load balancer is routing traffic properly, it is best to get the IP addresses of the ELB’s DNS name (via dig, etc.) then try your request on each IP address. We are working as fast as possible to get provisioning and the API latencies back to normal range.
Jun 30, 1:42 AM PDT We have now recovered the majority of EC2 instances and are continuing to work to recover the remaining EBS volumes. ELBs continue to experience delays in propagating new changes.
Jun 30, 3:04 AM PDT We have now recovered the majority of EC2 instances and EBS volumes. We are still working to recover the remaining instances, volumes and ELBs.
Jun 30, 4:42 AM PDT We are continuing to work to recover the remaining EC2 instances, EBS volumes and ELBs.
Jun 30, 7:14 AM PDT We are continuing to make progress towards recovery of the remaining EC2 instances, EBS volumes and ELBs.
Jun 30, 8:38 AM PDT We are continuing our recovery efforts for the remaining EC2 instances and EBS volumes. We are beginning to successfully provision additional Elastic Load Balancers.

As a result of the power outage, some EBS volumes may have inconsistent data. As we bring volumes back online, any affected volumes will have their status in the “Status Checks” column in the Volume list in the AWS console listed as “Impaired.”