Video Codecs, and Rapidly Adapting to Network Conditions

Genius lies in inventing stuff that is totally obvious in retrospect
 — Me
Every now and then you see stuff that genuinely surprises you. Oh, not in the “it’s a floor-wax and a salad-dressing” sense, but in the “good god, that is so ridiculously obvious, why didn’t I think of that?” sense.
Fouladi et al (•). have pulled off two of these with their new approach to video encoding — in their approach to architecting streaming video, and actually writing the codecs
Video Streaming
Let’s look at the architecture first. Most (all?) video streaming systems these days have two major components, a codec to encode the video, and a transport protocol to get it to the destination. The way the system usually works is that that the transport protocol figures out the network capacity the best it can, by looking at congestion, RTT, etc. The codec then uses the capacity to figure out what levels of compression to apply to the video, so that the target gets something that is at least somewhat reasonable.
This works pretty well in setting up the video stream (hmm, a low-bandwidth connection, let’s compress the f**k out of the stream, so that your netflix movie is kinda blocky), and adapting when things get better (hey! more bandwidth! let’s send higher quality video down the pipe!).
The Problem
The downside, of course, is that the feedback loop can take a while. As long as the network conditions changes slower than the feedback-loop response it’s all good. But if it’s the other way around, e.g., you’re blasting along the freeway on your cell phone (••), well, things can get … interesting. You can even end up with weird resonance-like edge effects where the buffering gets all f**ked up, and systems crap out.
The issue here is that the loose-coupling used in most streaming architectures (the video codec is on THIS side of the wall, the network transport is on THATside of the wall), can end up causing issues when bandwidth is changing more rapidly than the two sides can communicate.
You can’t just speed up the codec response time, since, in most implementations, the codec state is basically thrown away in real-time. The codec get’s a key-frame (think “the first frame after a scene transition”), encodes/compresses it, and ships it out. It then starts shipping out only the changes between each frame — changes between ‘1’, and ‘2’, changes between ‘2’, and ‘3’, and so on — and pretty much forgets everything that happened before (yes, this is a total like. But it is an understandable lie, and should serve to give you an idea of how things work).
Awesome Thing №1 : Functional Programming
What Fouladi et al. did was rewrite the codec they used (VP8) in functional style, with nice pure functions, no side effects, and external state. This allows the encoder to try out different options when encoding, and if they didn’t like the result, roll back and try it again — it can now change compression techniques on the fly on a frame by frame basis!
Awesome Thing №2 : “Combine” the Codec and Transport
 Now that one can change the codec on the fly, you can have the transport protocol tell the codec about changes in the network capacity on a real-time basis, and the codec can adaptively change the encoding to match. This is really cool because the network traffic can change on an almost packet by packet basis (•••), and if you can get the codec to follow along, then, well, I don’t see how you could do any better!
The way this works is that (a) The transport tells the codec about the latest network conditions, (b) the codec generates frames at a size and quality that matches the network capacity, and (c) this prevents the encoder from creating in-network buffer overflows or queueing delays (since the video transmissions is directly mapped to the network’s varying capacity).
OK, you don’t actually combine the codec and the transport — that’s why I have “Combine” in quotes! — there is still an abstraction boundary between the two. The feedback loop tho’ is about as tight as it can get, which is cool!
Results?
 It’s totally cool stuff, and as the author’s put it,
“Salsify outperforms the commercial state of the art — Skype, FaceTime, Hangouts, and the WebRTC implementation in Google Chrome, with or without scalable video coding — in terms of end-to-end video quality and delay (Figure 1 gives a preview of results).
Also, these “results suggest that improvements to video codecs may have reached the point of diminishing returns in this setting, but changes to the architecture of video systems can still yield significant benefit.
(•) Salsify: Low-Latency Network Video Through Tighter Integration Between a Video Codec and a Transport Protocol” : by Fouladi et al.
(••) I assume you’re the passenger, and not watching video whilst driving…
(•••) Congestion Control is a huge thing on the interwebs — read here for some nifty work that Google did recently (BBR) on helping out with this

Comments

Popular posts from this blog

Cannonball Tree!

Erlang, Binaries, and Garbage Collection (Sigh)