Intro to FFmpeg & Basic Concepts of Videos

Oct 2, 2021

Did you once meet the situation that you wanted to upload some videos you just edited to some website, but told the format is not supported? Or you once wanted to play some video you just downloaded, but told the format is not supported by the video player? Then we may need the help of FFmpeg to solve that issue.

What is FFmpeg

FFmpeg is a cross-platform multimedia framework

What is the use of FFmpeg

It can be used for processing media (e.g. encode, decode, transcode, mux, demux, stream, filter, play). For example, you can use it transform a mp4 video file you just downloaded from Internet to mkv file.

Basic concepts

Structure of video files

First of all, we need to know that the video we watch contains multiple components. For most videos, there are at least three parts, the video track, the audio track and some metadata.

Video track

The video track only contains the imaging info without any voice info. It is a “muted” file which is totally separated from the audio file. In other words, it only saved a sequence of images, and only images.

Audio track

The audio track contains only contains the audio info without any imaging info. It is purely an audio file which has no difference to an audio record.

Metadata

The metadata is some description info, which describes the details about the video track, audio track and other info. For example, it may describe the time when the entire video file generated, or how large the video/audio track is, or if there is some other tracks included (e.g. subtitle track).

Other terms

Encode

As we mentioned before, the video track in a video file saves a sequence of images. But if video track simply save all the images without any processing, the video file can become too large to save in our hard drive. On the other hand, considering there is much redundant info between image to image, it is kind of wasteful if not saving images in the way reusing the info which is the same.

Consequently, people come up with some algorithms to reuse the same info between different images. In that way, we can save much space in for video files. And the process we apply those algorithms to compress the file is called “encode”.

The “encode” operation can not only be used for video files, but also applicable for audio files, however we need to come up with different algorithms to achieve that goal.

Decode

As we mentioned before, the “encode” process is to compress the file, and then the “decode” process is the inverse process of “encode”, which is to uncompress or restore the encoded/compressed files.

Transcode

Since we already know the definition of “encode” and “decode”, let us imagine a situation that we need to transform an aac audio file to mp3 file. In this process, we need to change the encoding algorithm (from AAC to MP3).

Firstly, We need to decode the original aac file to PCM (Pulse-code modulation, is a method used to digitally represent sampled analog signals) info, which can be treated as the original audio info. Secondly, We need to encode the audio info by another algorithm to generate a mp3 audio file.

And the combination of the two steps above is called “transcode”.

Mux

Since video tracks and audio tracks are separate, we need to encapsulate them into one container if we want to generate a normal video file. And that process is called “mux”. It happens after we encoded video/audio files.

Demux

Similar to the concept of “decode”, the “demux” process is the inverse process of “mux” as well, which happens before we decode video/audio files.

Overview

The procedure of FFmpeg processing media files is shown below:

 _______              ______________
|       |            |              |
| input |  demuxer   | encoded data |   decoder
| file  | ---------> | packets      | -----+
|_______|            |______________|      |
                                           v
                                       _________
                                      |         |
                                      | decoded |
                                      | frames  |
                                      |_________|
 ________             ______________       |
|        |           |              |      |
| output | <-------- | encoded data | <----+
| file   |   muxer   | packets      |   encoder
|________|           |______________|

Refer to FFmpeg - Documentation

ffmpeg calls the libavformat library (containing demuxers) to read input files and get packets containing encoded data from them. When there are multiple input files, ffmpeg tries to keep them synchronized by tracking lowest timestamp on any active input stream.

Encoded packets are then passed to the decoder (unless streamcopy is selected for the stream, see further for a description).

The decoder produces uncompressed frames (raw video/PCM audio/…) which can be processed further by filtering (see next section).

After filtering, the frames are passed to the encoder, which encodes them and outputs encoded packets.

Finally those are passed to the muxer, which writes the encoded packets to the output file.

Reference

FFmpeg - About FFmpeg
PCM - Wikipedia