When your systems administrator asks if you’re sitting down, you know it’s not going to be good. He then informed me that a background process I recently wrote had gone a little crazy. How crazy, you ask? $13,000 crazy.
It’s not rare for me to make mistakes. Things happen, our amazing students respond with bug reports, and we fix them. And while this mistake didn’t come from traditional channels, it could have been just as easily caught by listening a little more closely and having a greater awareness of our systems.
In order to understand what went wrong, you’ll need a little background. Since Code School started, we’ve hosted videos on Viddler, an amazing video encoding and hosting service that has and continues to serve us extremely well. In years of working with them, there’s only been a handful of times when Viddler was down long enough for us to be worried. Videos are a critical part of our business, so having a single point of failure has always been a major concern. So, in mid-2014 I was working to add a hot-swap backup for Viddler.
To better organize backups, we moved the video responsibility to a single application (we call it Projector) that allows us to have backups of our videos and a failover in place for the rare times when Viddler is having issues. If you’re watching videos in a course, playing them from our iOS application, or watching videos from Code School, they’re going through Projector.
Part of rolling this project out also meant creating a script that downloaded every existing video we have from Viddler and uploading them to an alternate CDN acting as our hot-spare. This is the piece of the puzzle that went wrong.
Projector itself is an extremely simple Ruby on Rails application. All it needs to do is redirect valid requests for videos to wherever that video is located (Viddler or a backup source). The first time a video is loaded on Projector, it will add a background job to copy that file over to our CDN.
We’re using Delayed Job as our queuing system, which makes creating background jobs as easy as creating a Ruby class and starting a process on the server. If a Delayed Job raises an exception, it’ll be retried later — but there’s a limit on how many times a job will be retried (25 by default). Here’s what our setup looks like for saving these files to our cache server:
All of this code worked as expected! Whenever a new VideoFile object is created, we’ll queue an UploadVideo job for S3. This allows the initial creation of the VideoFile object to happen extremely fast, then in our Delayed Job process we can download and upload the file, allowing that to take as long as we want and not impact the end user.
What Went Wrong?
Did you notice that p.service.save_as(self, video_file) code in the save_as method? That’s our culprit. More specifically, the part of that call that downloadsthe file. Here’s the offending code that caused this error (somewhat simplified). It’s important to know this worked for any file under about 1GB.
Wait, I didn’t tell you what error we were getting, did I? Well, we weren’t getting any kind of error. Delayed Job didn’t say anything was wrong, and BugSnag was quiet as well. Can you tell what’s wrong with it?
If you’re shaking your head, then you can probably understand my grief. Since this code worked in most cases, what was actually going wrong was that our Delayed Job process was hitting the memory limit we set up for it on the system side, and it was being automatically killed and restarted.
It’s all in how I handled the downloading part. The Delayed Job process is going to keep the entirety of the video in memory, ballooning it in size during each download, then releasing it once the file has been uploaded. If a file was above the 1024mb process limit, our automated server processes would kill the Delayed Job without allowing it to finish. This caused the job to be re-run over and over.
How It Cost Us
If you have an extremely fast Internet connection and are constantly downloading in multiple processes, 24 hours a day, for multiple days, it turns out you can rack up quite a large bill. Our host at the time charged based on the percentile of our bandwidth usage. And the week this script went crazy, we managed to download a good 30TB — mostly just the same handful of files over and over again — putting it solidly in the 99th percentile for bandwidth.
While this was a mistake that hurt to learn, it did teach me a lot of important lessons that I hope can save you potential trouble. So I’ll leave you with my biggest takeaways from this experience:
Know all system limitations. This includes everything from the database, the web application, the Delayed Jobs, and server-side scripts.
Have better reporting. If a server script is being killed non-stop, it’s okay to let it be a little chatty. Communicate any concerns with your team no matter where they are — a log isn’t going to let anyone know.
Know your host. Be aware of how the host handles not just server, but also bandwidth. It’s rare that something will go this haywire, but it’s important to understand the worst-case scenario and plan for it.