Reimagining video infrastructure to empower YouTube

Editor's note by Scott Silver, YouTube’s VP of Engineering:

Running a global platform with massive amounts of video being uploaded, stored, and distributed at every moment of the day for its millions of creators and billions of viewers is a complex and demanding task. But if all works as it should, it’s accomplished in a way that no one ever notices. In this installment of our Innovation series, we give a rare inside look into an important innovation that ushered in a new era of video infrastructure for YouTube. Jeff Calow, the lead software engineer, takes us through the creation of a pioneering system that has powered our platform through a surging pandemic viewership, and will carry us well into the future.

In a nutshell, what is the innovation that you just announced at the ASPLOS conference? Can you explain why it’s important for the average YouTube viewer or creator?

Jeff: Our mission is “To give everyone a voice and show them the world.” Let anyone upload a video to show anyone else in the world, for free. That takes a lot of processing power. Several years ago, as the scale of videos on our platform grew to dramatic levels, we needed to come up with a new system that would let creators continue to upload seamlessly, and viewers watch with all the choices they’ve come to expect.

An important thing to understand is that video is created and uploaded in a single format, but will ultimately be consumed on different devices - from your phone to your TV - at different resolutions. Some viewers will be streaming to a 4K TV at home and others watching on their phone riding the bus. The infrastructure team’s job is to get those videos ready for you to watch in a process called transcoding— the compressing of videos so that we send the smallest amount of data to your chosen device with the highest possible quality video. But it’s costly and slow, and doing that processing using regular computer “brains” (called CPUs) is pretty inefficient, especially as you add more and more videos.

So we created a new system for transcoding video that lets us do this process much more efficiently at our data centers, and at warehouse scale. We decided to leverage an idea that computer scientists have been working on for years - to develop a special “brain” for this specific work. In other fields, there are special brains for graphics (GPUs) or artificial intelligence (TPUs). In our case, we developed a custom chip to transcode video, as well as software to coordinate these chips. And we put it all together to form our transcoding special brain – the Video (trans)Coding Unit (VCU). We’ve seen up to 20-33x improvements in compute efficiency compared to our previous optimized system, which was running software on traditional servers.

Picture of a video coding unit — Picture of a Video Coding Unit

Except in the rare case when there’s an outage, it’s easy to forget how much work goes on behind the scenes just to keep YouTube running. Can you give us some technical perspective on the scope and complexity running a global platform of this size 24/7?

Jeff: When I interview candidates for jobs here, I always mention how more than 500 hours of video content on average is uploaded every minute to the platform - that always resonates with them. During the Covid-19 pandemic we saw surges in video consumption as people sheltered at home. In the first quarter of last year, we saw a 25 percent increase in watchtime around the world. And for the first half of last year, total daily livestreams grew by 45 percent. Because we had this system in place, we were able to rapidly scale up to meet this surge. Practically, this meant that videos were available for viewers promptly after the creator uploaded them.

[left] H.264 [right] VP9 — [Left] H.264
[Right] VP9

You first kicked off this project in 2015 - what did you see then that drove the need to find a new infrastructure solution?

Jeff: Several years ago, we saw rising demand for higher quality video (e.g. 1080p, 4K and now 8K). We also saw that the broader Internet wouldn’t be able to accommodate this growth unless we shifted to more data-efficient video codecs (codecs are basically different ways to compress video data). However, data-efficient video codecs like VP9 use more computer resources to encode than H.264. The combination of these dynamics led us to pursue a dramatically more efficient and scalable infrastructure. Here's a comparison of the image quality in a Janelle Monaé video. The VP9 version clearly looks better than the legacy H.264, but it uses 5x more computer resources to encode.

How daunting was it to be a team of software engineers working to create hardware?

Jeff: Luckily, most of what we were doing was a full system, so I had a vertically integrated team that was broadly spread with a clear differentiation of people’s responsibilities. This included colleagues with more hardware experience working down lower close to the hardware, and then other folks that weren't. But to tell you the truth, it didn’t feel that daunting. It was an exciting opportunity to learn a bunch of new and interesting things. Maybe there was a level of optimism and naivety going into it as to how hard and difficult it actually would be. On the flip side, a lot of the hardware development that we actually talked about in our paper had “software-like” aspects to it, which also made this seem less difficult than it actually was. But when you have the caliber of people and collaboration that you do at Google and YouTube, that makes it even less daunting.

What were some of the biggest risks you faced along the way, and how did you confront them? Did you encounter a lot of naysayers?

Jeff: Hardware in general is a risk because it’s a long-term commitment. So a specific fundamental risk was the development of this new chip and getting it right the first time. You spend a lot of time developing it, and if it doesn’t work, you have to go back and fix it and manufacture another chip. And that would push everything back by a long time. Preemptively, we were actually simulating the hardware with software and specialized emulation hardware— a lot of effort went into these simulations to minimize the risk. As for naysayers, there were some, but we had many strong advocates for this on the hardware side of the company, as well as on the YouTube executive side, who were very prescient and saw the value of what we were doing.

You think of a massive project like this across multiple teams and departments, and all the sophistication involved in bringing together technology at this scale. But we heard that at some point you were derailed by a loose screw? What happened?

Jeff: We deployed a machine in the data center and it failed our burn-in test and one of the chips just didn't come up and we had no idea why. So we’re trying to run a whole bunch of diagnostics and then the hardware tech opened up the carrier machine and noticed that sitting on one of the baffles was this loose screw. And it was basically shorting out one of the voltage regulators and therefore that chip couldn't come up - it was a screw that had come loose in shipping. Nothing caught on fire or anything like that, but it was just like yeah, a screw?

What kind of precedent does this new system set for the future of video infrastructure? What’s next for you?

Jeff: One of the things about this is that it wasn't a one-off program. It was always intended to have multiple generations of the chip with tuning of the systems in between. And one of the key things that we're doing in the next-generation chip is adding in AV1, a new advanced coding standard that compresses more efficiently than VP9, and has an even higher computation load to encode.

As for me, I’ll be continuing my work on this project, developing future generations, which will keep me busy for a while.