Video Playback - The basics

The basics of video playback

Okay so let’s cover some of the basics of video playback. You must have heard about the term “motion pictures”, which is quite accurate. Video playback is a series of pictures being drawn on the screen in a rapid successive rate. Typically 30 of 60 Frames per second, the rate is so quick that the pictures turn into fluid motion and thus become “a video”.

Easy right? Well almost, when a video is rendered at “60 frames per second” these aren’t always 60 “pictures” per second. Every frame can be different, to optimize for storage there are “full frames” or key frames called i frames that are basically these full pictures you initially thought off. However storing 60 of those bad boys per second for a 2 hour movie becomes huge. A typical 720p or 1080 image can easily be a few hundred Kbs (ignoring compression), results into huge files for just a movie. Try decoding those on an embedded device, downloading them on a slow connection, etc, etc. Yeah not very nice. Trying to watch a 57 GB version of 1 episode of “How I Met Your Mother?” on your 4G device? Ouch indeed.

To overcome this, many video encoding profiles (even the older ones) rely on B and P frames. These are interpolation frames that just depict the changed frame in context of the previous key frame. If you imagine a scene with a person talking, the landscape/scene/surroundings of the person do not change but the person talking does. Exactly that is captured in a B or P frame: the expected motion without the surrounding static. Then every once in a while there is an I-frame or key frame to give the full picture.

How far these I-frames are apart is depicted by the GOP (Group-of-Picture) size. Older encoding profiles will have a GOP size of 12 to 15, while newer more recent ones do 30 or 60 frames apart. Larger GOP size == better compression == lower total file size. Of course various encodings may use additional techniques to further reduce the file size but this is the most generic concept of video encoding.

Ever had the picture get all “blocky” on the darker grey parts of the movie then turn back to normal in a few seconds? I bet it was around 10 seconds and then it flipped back to normal. What you noticed is a B or P frame going wrong or the decoder on your device tripping on something, then recovering on the next I-frame for a full redraw of the scene. Now you know.
Word of advice, do not go too deep into the above. Once you start noticing the different frames on the movie you are playing and start picking up on these encoding errors (they’re more frequent than you think) it’s a royal annoyance and can totally ruin your movie experience.

Of course this is not the only metric for file size, resolution of the video is another key component in combination with bitrate (how fast the images are going to be displayed). Larger pictures == larger size, as you would expect a 240p movie is much smaller compared to a 1080p movie and a movie with a higher FPS means more pictures per second == large size. Mix in encoding and compression schemes and you’ve got 3 key variables that determine the size of a movie.


As playback devices and recording equipment advances so do the different file formats. If you’re like me you remember the 80s with VHS (we even had betamax!) and the resolution of the video was much lower, often 4:3 and around 240p. Now I’m watching 4k content on a tablet, streaming wirelessly over a 5G connection. As we all made that progression so did the way we encode video. Higher video resolution demanded for more advanced video encoding profiles, to further reduce size and ultimately save bandwidth. The processing power of video playback devices went up allowing for more calculations to be done while playing back video. This provided more room for advanced interpolation and arithmetics in between full picture key frames that resulted in higher video compression.

Naturally this means that older devices will support less video encoding profiles. An older SD or HD device will do MPEG-2 and most recent devices support MPEG-4 with x264. The video encoding determines how the individual frames are stored and what magic encoding schemes can do in between full key frames.

I won’t bore you to death with the different encoding schemes and how they work, if you are intrigued I’d recommend you google around for how h.264 works or search for a few of the older video encoding schemes. It is a world of its own and would require a whole book to do justice. But its important you grasp the overall meaning of a video encoding profile and what it means for a device to be able to play it.

Audio tracks are pretty much the same as Video, just well no video frames but audio samples. Just like video, audio has different encoding profiles as well. The most famous one being MP3, but more commonly used is OGG or Flak. Audio has different encodings and compression schemes (loss or lossless) that can be used. Equally they have similar audio decoding driver requirements as video. For embedded devices like STBs audio is often also hardware accelerated and not always handled in software (!) though there is much less variation than with video encoding and processing audio is far less taxing on the device. Meaning you might get away decoding audio on the CPU on a modern Quad core ARM device, where decoding Video on the CPU an absolute no-go. For high resolution video on an embedded device you need a dedicated hardware decoder.

To compare with the PC you might be reading this from, it absolutely decodes everything on the CPU. The CPU is so powerful that even though it is a generic purpose computing processor, it can easily decode most video without breaking a sweat (though laptop users, I’m sure you’ve heard your fans spin up while watching something it sometimes needs put in some effort). Embedded systems simply do not have the horse power to do video decoding efficiently on the CPU and tend to come with high class dedicated video decoders and scalers on board (outside of the CPU). Some of the more premium embedded systems when comparing class to class for a generic desktop computer of the same era will look better when it comes down to decoding video on a big screen. For example a HD video playing on an Intel Groveland embedded chip connected to a TV will look better over a same era desktop PC hooked up to the same TV. This is simply because the Intel Groveland chip and decoders are purpose built for displaying HD on a big HD TV. Whereas the desktop PC is not and does a lot more. It tends to expect a monitor (not a TV) and a full desktop environment. The picture often gets grainy because it’s in a weird resolution for the PC, whereas the decoder and scalers on the Intel Groveland are doing what they do best: creating a crisp and clear HD picture.


Okay so we understand how different frames are stored, however that’s just the video. Often a “movie” is just about video. I mean it was in the early days of motion pictures but today we want audio too. And audio is pretty darn important for the experience. In fact it is so important we have different audio channels to meet different audio needs. From different audio languages (multi language content) to having different audio encodings for different audio usage. For example the 5.1 DTS audio channel versus the Stereo audio channel. To accommodate different audio setups of our end users.

It is not limited to audio and video though, we’d like some subtitles too please. Throw in different types of subtitles, for example: for hard of hearing versus just plain english or a different language. All of these are data tracks that belong to the audio/video and they all share the same time reference. Meaning when a certain video I-frame being is being rendered the audio needs to be synchronized and thus processed at the same time (same goes for the subtitles).

To help with that all these tracks (video, audio, subtitles/data) are stored in a container that combines all these tracks together for a player to process. The storage of all these tracks is done in different parts of the container. As you can imagine your device might support a limited set of containers. This needs to be parsed by the video player, in software, but requires parses/code to understand what it needs to do.

The most common container is MPEG2-TS or MPEG Transport stream. These are, for example, what you would find on a CD or DVD if you where to read the output of the CD/DVD driver. It is the most generic, most commonly used digital media in media broadcasting. However, I’m sure you’ve run into others. This is what typically is stored as the extension of a video file. Like .mp4 means it has a MPEG-4 container or .mkv for matroshka, etc. Different containers require different software to “understand” and find the different tracks to feed to the audio/video decoders on the device.

Streaming method

You thought where we done yeah? Me too. So we got the individual frames, how they’re bundled together in a track and covered how the different audio/video/data tracks are stored into a movie container. We can’t just send monolithic files down the pipe for a device to pick up (though we used to, hehe, but that didn’t work so well). As we now deal with devices that have transient bandwidth: bandwidth that isn’t guaranteed all the time. Moving through different bandwidth speeds of your less optimal WiFi connection because you’re trying to watch YouTube from far bathroom upstairs again. Whatever it is your bandwidth isn’t guaranteed especially if you’re wireless and even worse when you’re on the go. This calls for different bitrates that go with your bandwidth and device, for example send low res 240p while on 3G on a mobile but send full 4k video to your TV hooked up to a direct CAT6 wired connection or strong WiFi mesh network.

Sending monolithic files to the video player isn’t all bad. The HTTP specification allows for a byte offset to directly fast forward to a specific section in the download. This allows players to “hop” to a well known point in the video file. But it’s rather clunky and relies on HTTP Headers to tell the client what the total file size is and on request where the player likes to offset too. This method is often called progressive streaming and most likely one of the oldest ways of streaming video over IP. It works well for short clips, where you do not expect the user to fast forward or rewind, just plain play out of something manageable in size that doesn’t require multiple bitrates to stream for various bandwidth/device capabilities.

However, it is quite limited in functionality, to cope with different bandwidths and a wide variety of devices the brilliant minds of video playback created mechanisms that would allow for seamless bitrate changes, easier rewind/forward capabilities and for better or worse the ability to splice in other videos (for example for Advertisements)

We call that switching between different bitrates adaptive streaming, which means the same video is exposed in different qualities. To accommodate switching between different video bitrates these protocols tend to chunk their video fragments so the video player can seamlessly “hop” to a higher or lower frequency in between chunks.

As you might imagine by now theres more then 1 specification / streaming method that describes adaptive streaming. The golden standard is HTTP Live Streaming (or HLS) which is by now pretty well supported throughout. However, newer, more feature rich, MPEG DASH is rising in popularity. The support for the adaptive or progressive streaming depends on the video player that is being used on the device. Either the browser or a MSE player that handles the parsing of the progressive streaming method manifest and deals with fetching the chunks of data. It does not tend to have a hardware dependency as this just entails how the chunks of data are fetched before providing it to the video pipeline on the device that actually handles the video playback.

This is handled in software on the CPU and needs to be supported by the player of choice, which can be either native or part of your application dependent on the approach chosen on a particular platform or device.

Final words

There are about 4 technical layers to the basics of video playback. The actual frames, the encoding, the container and the streaming method. And that creates a lot of possibilities with room for a lot of variations.

For example one movie can be encoded in:

  • SD MPEG-2
  • HD MPEG-2
  • 4K MPEG-4 H.264/H.265

Add in different containers:

  • MPEG2-TS
  • MPEG4
  • Matroska

With different audio tracks:

  • MPEG-2
  • MPEG-4
  • HE-AAC
  • FLAK
  • WEBM

In different streaming methods:

  • HTTP Progressive
  • HTTP Live Streaming
  • MPEG DASH streaming

Thats 4 video formats times 3 containers times 5 different audio tracks and available through 3 different streaming protocols, thats 180 different possibilities for just this simple example. See where I’m going? These different possibilities require video player support and hardware decoding support from the device’s hardware. Video playback on an embedded device with a dedicated audio/video decode pipeline is hard and I didn’t even mention DRM! DRM is a whole different ball game and requires a chapter on its own.

tl;dr: There’s loads of different “movie playback” options and your device might not support all of them (equally). Getting video to run on a bespoke video device requires custom configuration, effort and time to get right.