We Interrupt You to Tell You More About Being Interrupted

Last weekend we had a bit of a kerfuffle with one of our video streaming servers which affected those of you on, or connecting to our servers on, the US west coast. And unfortunately it took a bit longer for us to notice than it should have. We're sorry about that and we learned a lot in the process. Now that we've had a chance to fix the problem, understand why it happened, and have a plan for preventing it in the future here's a brief explanation.

Our streaming servers are built in Amazon EC2 instances of which there are numerous shapes, sizes, and configurations. Currently we use m1.large instances to do our streaming and the instance store data storage that comes with it. In fact, we use the m1.large instances because of their copious instance store storage. The truth is we don't need the RAM or the CPU they provide but we do need the disk space. The hundreds of hours of videos we've got on the site encoded at various bit-rates currently totals about 225Gb in storage.

Disappearing videos

We learned a lesson last weekend when the instance store volume, which holds the videos you watch when streaming video, on one of our instances disappeared. Why? Who knows. It was just... gone. And if you've ever had someone come in and pull the hard drive out of your computer while you were trying to read and write data from it you know it's not pretty. I wasn't actually there to witness it but the logs tell a story of a server who was happily chugging along doing its own thing until someone came along snuck up behind it and yelled "Boo!". And it just never really recovered after the initial shock.

CA Outage Stream/CPU Load Graphs

Once we recognized what had happened we were able to get a new instance store volume mounted and configured after a little bit of finagling and then we just had to sit back and wait for it to sync all 225Gb of data from S3, which took about 10 hours total. During that time, however, we were also able to re-route all of our US west coast traffic to servers on either the US east coast or Asia, depending on location, and doing so resolved the playback issues everyone was having, albeit not yet perfect, as the increased network latency would degrade the streaming quality for some of our users. We're now back to being fully operational and along the way got to test out a few of our fallback strategies which worked nicely.

The experience left a bit of a sour taste though, as while our fallback stuff worked 24 hours from when we noticed the issue to final resolution, that was a lot longer than we would have liked and 30+ hours before we even knew it was a problem is also not a good thing. So what are we going to do about it?

New storage and monitoring

None of use want to work weekends, but that doesn't mean our computer counterparts can't do it for us. We've now got monitoring in place to alert us if something like this happens again so we can at least flip the fallback switch, get you back to watching your videos, let us get about our weekends, and then clean up the mess on Monday.

We're also investigating using Amazon's Elastic Block Store (EBS) for storing our files instead of the ephemeral instance store that we're currently using. We knew going in to it that instance store was "temporary" but I mistakenly assumed that meant it wouldn't survive the start/stop of an instance not that it would also randomly disappear. EBS doesn't provide the same data input/output (I/O) rates as instance store — think of it as network attached storage — and thus we were initially concerned about its performance. The I/O rate is about 10Mbs from EBS to EC2 in our testing. However, after about a year of doing this we've got the data to say that EBS is probably fast enough.

So as we speak we're in the process of a deploying a couple of EC2 m1.small instances, backed by an EBS volume, to our fleet to see how they handle it. If it works, the big win here will be that EBS is more permanent than instance store and EBS volumes can be mounted to multiple instances in the same region so we won't have the painful wait of about 24 hours that it takes to sync all 225Gb of data to Asia when we need to create new instances over there. Because who knows, at this rate someone might just come and pull the CPU out next.

At least we'll be a little better prepared this time though.

Add new comment

Filtered HTML

  • Web page addresses and email addresses turn into links automatically.
  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <code class> <ul type> <ol start type> <li> <dl> <dt> <dd><h3 id> <p>
  • Lines and paragraphs break automatically.

About us

Drupalize.Me is the best resource for learning Drupal online. We have an extensive library covering multiple versions of Drupal and we are the most accurate and up-to-date Drupal resource. Learn more