404 Tech Support

Reddit is Down, Amazon Blamed 2 – This Time With Confirmation

Just a little over a month ago, a bit of controversy flared up as a result of Reddit’s error message. They implied Amazon web services were to blame but the Amazon AWS status page didn’t reflect the same problems. A nice, long article resulted from the incident to explain why Reddit was down for 6 hours out of 24. This morning, the popular site is down again and blaming Amazon in their error message but this time the message links to the Amazon status page which shows three areas from their Virginia data center having service disruptions. Reddit probably does more of a stress test on Amazon’s services than anybody would need to worry about but with Reddit’s reach, these events can’t be doing them any favors in the marketing department.

Following last month’s event, there seemed to be a conclusion that Reddit needed to move to their own data center and they needed to rewrite their code to be more efficient on a relational database system. I’d love to verify my memories of the discussions that followed but, of course, I can’t get to them since the site is down – including the Reddit blog.

Unlike last time, at least the Amazon Web Services Service Health Dashboard is reflecting that there are issues on-going.

Here is what they say under the more link in case the status isn’t listed later.

Amazon CloudWatch (N. Virginia) Delayed CloudWatch metrics:

2:26 AM PDT We are working on restoring connectivity to a small number of EC2, EBS, and RDS resources in multiple availability zones in the US-EAST-1 region. While we restore connectivity, CloudWatch metrics for those resources will be delayed.

3:04 AM PDT We are continuing to see connectivity issues impacting EC2, EBS, and RDS resources in multiple availability zones in the US-EAST-1 region. While we restore connectivity, CloudWatch metrics for those resources will be delayed. We continue to work towards resolution.

Amazon Elastic Compute Cloud (N. Virginia) – Instance connectivity, latency and error rates:

1:41 AM PDT We are currently investigating latency and error rates with EBS volumes and connectivity issues reaching EC2 instances in the US-EAST-1 region.

2:18 AM PDT We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region. Increased error rates are affecting EBS CreateVolume API calls. We continue to work towards resolution.

2:49 AM PDT We are continuing to see connectivity errors impacting EC2 instances, increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region, and increased error rates affecting EBS CreateVolume API calls. We are also experiencing delayed launches for EBS backed EC2 instances in affected availability zones in the US-EAST-1 region. We continue to work towards resolution.

Amazon Relational Database Service (N. Virginia) – Database instance connectivity and latency issues:

1:48 AM PDT We are currently investigating connectivity and latency issues with RDS database instances in the US-EAST-1 region.

2:16 AM PDT We can confirm connectivity issues impacting RDS database instances across multiple availability zones in the US-EAST-1 region.

3:05 AM PDT We are continuing to see connectivity issues impacting some RDS database instances in multiple availability zones in the US-EAST-1 region. Some Multi AZ failovers are taking longer than expected. We continue to work towards resolution.

So, is Reddit moving to their own data center? Are they working on improving the code and sticking with Amazon Web Services? It’s obvious that any site that has grown as fast as Reddit has and as quickly is going to have growing pains but it’s time to move beyond that observation and start taking actions to resolve the errors. People have even started coming up with mnemonics to remember what the error codes mean: 502, it went through; 504, try once more;

Fortunately for Reddit, their community is loyal but it won’t be long before their traffic growth peaks and starts to decline because visitors get tired of seeing the Reddit is down error messages.

Update: Amazon is going to be hit hard with this one. The downtime is affecting big sites like Foursquare, Quora, Hootsuite, and others in addition to Reddit.

The Amazon Web Services Service Health Dashboard has had more updates posted to it but they’re not sounding very optimistic. Downtime seems to be 6+ hours at this point.