Livestreaming with libav* – Tutorial (Part 2) | Computer Science Blog @ HdM Stuttgart

Green screen live streaming production at Mediehuset København

If you want to create videos using FFmpeg there is a basic pipeline setup to go with. We will first take a short overview over this pipeline and then focus on each individual section.

The basic pipeline

I’m assuming you have already captured your video/audio data. Since this step is highly platform dependent it will not be covered in this tutorial. But there are plenty of great tutorials on this from other people: using v4l2 and using pulseaudio

video capturing --> scaling ----> encoding \
                                            \
                                             muxing --> ....
                                            /
audio capturing --> filtering --> encoding /

Scaling/resampling: This is the first step after capturing your video data. Here the per-pixel manipulation like scaling or resampling is done. Because the raw video image can be quite huge you may want to think about doing some of your pixel-magic on the GPU (compositing would fit nicely there). Because FFmpeg uses the planar YUV 4:2:2 pixelformat internally you might need to convert the pixel format you get from your source device (especially webcams only output in packed format).
A list of the FFmpeg pixel formats can be found here.
Filtering: If you want to filter your raw audio input, adjust the volume, do some mixing or other crazy audio stuff, this is the place to do it. But because I’m not great with audio I will skip this part and leave it to the professionals to explain this part 🙂
Encoding: This step is similar for both video and audio. Depending on the codec you want to use, you first have to get some settings straight and then consume the raw frames provided by the previous parts of your pipeline. This step is the most resource-demanding step.
Muxing: This is the step where you combine your audio and video data. Each audio/video/subtitle track will be represented as stream in the FFmpeg-muxer. You will most likely have to do some timestamp-magic in this step. After you have muxed your streams you can then dump the final video into a file or stream it to a server.

You can pack all of these components into one process/thread which makes handling memory a little easier and reduces copying of large memory chunks. If you are planning on using OpenGL for pixel manipulation and hardware acceleration like Quick SYNC for encoding, it might be a good idea to isolate these steps into their own threads so they won’t mess with the rest of your program. This however makes memory handling and communication between the sections much more complicated. Also keep in mind that some of the libav*-calls (eg. sws_scale()) might be blocking.

Datatypes

If you interact with the FFmpeg API you need to use three data types:

AVFrame: This struct holds raw video or audio data. If you want to manipulate your image or sound on a pixel/signal basis you need to do it while holding this struct.
You can allocate an AVFrame by simply calling the provided constructor.
Depending on your capture technique you can reuse the structs and simply replace the pointers. In this case, you can reset your frame with av_frame_unref() to its original state. A call to av_frame_unref() will free all your buffers. Keep in mind that this will also reset all your fields.

AVFrame* raw_frame = av_frame_alloc();

while() {
    // DO STUFF
    av_frame_unref(raw_frame);
}

av_frame_free(&raw_frame);

AVPacket: The AVPacket struct holds encoded video or audio data. This struct doesn’t need to be allocated on the heap so you probably won’t run into any memory issues.
Bytesstream: If you are writing a custom output for the FFmpeg-muxer (which you will probably do if you want to do anything other than dumping everything into a file) your custom IO-function will receive a bytestream from the muxer. Because the muxer combines both audio and video it makes sense that this component unpacks and marshals all the structs. So from here on you don’t have to worry about managing memory in crude structs any more 🙂

// uint8* buffer contains the muxed bytestream
void custom_io_write(void* opaque, uint8_t *buffer, int32_t buffer_size);

More details, more code

As promised we will now take a closer look on each step of our processing pipeline. There will be quite a lot of code but hopefully this will help make starting with libav* a little less painful.

Resampling

The first step in the scaling component is to set up a Sws_context. This only has to be done once in your program. Also set the input and target resolution as well as the respective pixel formats. As mentioned earlier we have to transform the pixel format from a packed to a planar format. If you want to change the resolution use the bicubic resampling algorithm, since it produces the best image quality with okish performance. With the last parameter you can tune your resampling algorithm.

int source_width = 1920, source_height = 1080;
int target_width = 1920, target_height = 1080;

struct SwsContext* scaler = sws_getContext(
    source_width, source_height, AV_PIX_FMT_YUYV422,
    target_width, target_height, AV_PIX_FMT_YUV422P,
    SWS_BICUBIC, NULL, NULL, NULL
);

From now on everything has to be done on a per-image-basis, so you have to wrap the next calls in some kind of loop.

First we allocate an output buffer for the scaler. Because the scaler has to copy all the data anyway we don’t have to trouble ourselves with reusing buffers from the capturing process.

With av_image_alloc() we can allocate the actual memory in the AVFrame-container. The buffer size alignment doesn’t seem to influence anything, even the libsws-sourcecode doesn’t give any clue as to what value to use. It could be optimization for hardware instructions on the CPU but I couldn’t find prove for that.

AVFrame* scaled_frame = av_frame_alloc();

scaled_frame->format = AV_PIX_FMT_YUV422P;
scaled_frame->width  = target_width; 
scaled_frame->height = target_height;

av_image_alloc(
    scaled_frame->data, scaled_frame->linesize, 
    scaled_frame->width, scaled_frame->height, 
    scaled_frame->format, 16);

The last step is to call sws_scale(). The source_data and source_linesize parameters are both arrays with an entry for each plane of the source image (4 in total). Because we are provided with a packed pixel format from our webcam, only the first element of the source-arrays will be set.

sws_scale( scaler,
    (const uint8_t * const*) source_data, source_linesize,
    0, source_height,
    scaled_frame->data, scaled_frame->linesize);

After you have passed your scaled frame to the encoder you have to free the frame yourself using av_frame_free().

av_frame_free(&scaled_frame);

Encoding

Now to the fun part. With a mandatory call to avcodec_register_all() we initialize the avformat library. Afterwards we can get a handle to the codec we want to use.

avcodec_register_all();
AVCodec* codec = avcodec_find_encoder(AV_CODEC_ID_H264);

Next we have to configure the encoder. You might want to adjust these settings according to your needs.

Because most of your users will watch the stream either on a computer or mobile device with a 60Hz display, you should set the framerate only to a divider of 60 to avoid stuttering.
If you are using 23.976, 24, 25 or 50 frames per second there might be something wrong with your setup.
Also only use progressive scanning!

When using h264 you can also set codec presets. These influence the image quality as well as the encoding time needed, with lower speeds yielding better image quality but longer encoding times and vice versa. “ultrafast”, “superfast” and “veryfast” seem to be the only presets that can keep up with 1080p 60fps while livestreaming (and they have a nice ring to it). This was tested with an Intel Core i5-6200U.

AVCodecContext* encoder = avcodec_alloc_context3(codec);

encoder->bit_rate = 10 * 1000 * 10000;
encoder->width = 1920;
encoder->height = 1080;
encoder->time_base = (AVRational) {1,60};
encoder->gop_size = 30;
encoder->max_b_frames = 1;
encoder->pix_fmt = AV_PIX_FMT_YUV422P;

av_opt_set(encoder->av_codec_context->priv_data, "preset", "ultrafast", 0);

avcodec_open2(encoder, codec, NULL);

With the avcodec_send_frame() call we can send our raw frames to the encoder. This of course is also done on a per-image basis, so once again you have to wrap it in a loop. Because the encoder copies the frame to an internal buffer we can then safely free all our frame buffers.

A word on timestamps: simply incrementing an integer isn’t the correct way to do it, but it seems to work since it meets the monotonic-requirement of the encoder.

AVFrame* raw_frame = scaled_frame; 

raw_frame->pts = pts++;
avcodec_send_frame(encoder, raw_frame);

av_freep(&raw_frame->data[0]);
av_frame_free(&raw_frame);

The correct (but in my case untested) solution would be to use this formula below: You basically have to increment the pts for each interval on your timebase, even if you haven’t read a frame. Therefore you could substitute the skipped_frames-variable with the number of past timebase intervals since the previous_pts.

int64_t previous_pts = 0; 

raw_frame->pts = previous_pts + 1 + skipped_frames;
previous_pts = raw_frame->pts;

To read encoded packets from your encoder you simply call the avcodec_recieve_packet() function. Because the encoder may combine information from several input frames into one output frame, the first few calls to avcodec_recieve_packet() will not return any packets but an EAGAIN-error.

AVPacket encoded_frame; 
int got_output = avcodec_receive_packet(encoder, &encoded_frame);

if(got_output == 0) {
    // yeah :)
}

When ending your stream you will have to do two things:

call avcodec_send_frame() with the second argument set to NULL. This will start draining the internal buffers of the encoder and ensures that all frames that have been sent to the encoder actually get put in the encoded video.
call av_recieve_packet() until you get an AVERROR_EOF-error. This will indicate that all packets have been read from the encoder.

This steps are usually called “draining” or “flushing” the encoder.

Muxing

To set up the muxer we first have to set our output format. While the av_guess_format() call doesn’t seem to be the most pretty solution it works fairly well.

With avformat_new_stream() we create both an audio and a video track. You can also create subtitle tracks or add multiple audio tracks for different languages in your video. Because at playback time the decoder has to know which codec has been used to encode the tracks we have to embed this information in our output format. The IDs for the codecs can be found here. Because avcodec_parameters_from_context() sets only codec-specific settings we have to set the timebase and framerate of our tracks manually.

AVFormatContext* muxer = avformat_alloc_context();

muxer->oformat = av_guess_format("matroska", "test.mkv", NULL);

AVStream* video_track = avformat_new_stream(muxer, NULL);
AVStream* audio_track = avformat_new_stream(muxer, NULL);
muxer->oformat->video_codec = AV_CODEC_ID_H264;
muxer->oformat->audio_codec = AV_CODEC_ID_OPUS;

avcodec_parameters_from_context(video_track->codecpar, encoder); 
video_track->codecpar->codec_type = AVMEDIA_TYPE_VIDEO;

video_track->time_base = (AVRational) {1,60};
video_track->avg_frame_rate = (AVRational) {60, 1};

The muxer has to know where to write the resulting bytestream. Therefore we must use an IO-context. You can get this by either using the avio_open2()-function or by creating your own custom context. In any case the muxer will handle calling these functions, you don’t have to worry about that. Since I wanted to write the output to an unix domain socket I had to use a context with a custom write-callback. If you want to read more about custom io, here is a tutorial.

First we have to set up a buffer for the bytestream which we then provide to avio_alloc_context(). The third parameter sets the buffer to be writeable (0 if you want read-only). The fourth parameter can be used to pass custom data to the IO-functions. The last three parameters are the functions for reading input (not required here since we are not decoding anything), seeking (also only needed when building a player) and writing.

To add the IO-context to the muxer set the pb field of the muxer context.

int avio_buffer_size = 4 * KB;
void* avio_buffer = av_malloc(avio_buffer_size);

AVIOContext* custom_io = avio_alloc_context (
    avio_buffer, avio_buffer_size,
    1,
    (void*) 42,
    NULL, &custom_io_write, NULL);
    
muxer->pb = custom_io;

The custom writing function has the following signature. You can access the muxer’s bytestream via the buffer-parameter.

int custom_io_write(void* opaque, uint8_t *buffer, int32_t buffer_size);

Before we can start to put the actual data into our output format we first have to write a header. Here you can also provide additional flags to the muxer. The av_dict_set()-function consumes all the flags it can process from your dictionary. The “live” option tells the muxer to output frames with strictly ascending presentation timestamps and prevents the muxer from reordering frames. Also the muxer writes the entire header at the beginning of the video (wit a placeholder for the length of the video) instead of just a placeholder for the entire header.

AVDictionary *options = NULL;
av_dict_set(&options, "live", "1", 0);
avformat_write_header(muxer, &options);

With everything set up we can now send packets to our muxer. Once again this is done on a per-packet basis, a loop would do nicely here.

To add the packets to the correct track (audio or video) we need to add an identifying stream index to each packet. The index of the track is simply incremented for each call to avformat_new_stream().

Now for the timestamp-magic-part: Because some containers (e.g. Matroska) force a fixed timebase on their tracks (in this case 1/1000) we need to scale the timestamps of each track (timebase 1/60) to match the containers timebase. If we woulnd’t do this, decoders would play the video with a wrong framerate, which would at best look funny.
This has to be done both for the presentation and decoding time stamps. Because the documentation is very vague on how to use this function: av_rescale_q() first expects the timebase of your track (1/60) and then the target timebase of your container (1/1000).

From there its as simple as calling av_write_frame() and freeing your input packets. The muxer then writes the resulting bytestream to the IO-context.

AVPacket encoded_packet; 
AVRational encoder_time_base = (AVRational) {1, 60};

encoded_packet.stream_index = video_track->index;

int64_t scaled_pts = av_rescale_q(encoded_packet.pts, encoder_time_base, video_track->time_base);
encoded_packet.pts = scaled_pts;

int64_t scaled_dts = av_rescale_q(encoded_packet.dts, encoder_time_base, video_track->time_base);
input.packet.dts = scaled_dts;

int ret = av_write_frame(muxer->av_format_context, &encoded_packet);

av_packet_unref(&encoded_packet);
av_packet_free(&encoded_packet);

At the end of your stream you have to remember to write a trailer to your video stream.

av_write_trailer(muxer);

Testing

And that’s it. With some glue code and coffee you should now be able to see a moving picture. If you can, simply write the video to stdout and pipe it in ffplay.

ffplay -f matroska pipe:0

If this doesn’t work for you, you can dump the video to a file and watch it with any videoplayer.

If you want to test the streaming capabilities of your program you can use this command to open an http-server, listen for incoming mkv and display it directly.

ffplay -f matroska -listen 1 -i http://<SERVER_IP>:<SERVER_PORT>

If you have made some experiences in video streaming yourselves, feel free to post helpful tutorials, improvements to this post, or any other tips in the comments.

Image sources:

title image: https://commons.wikimedia.org/wiki/File:Green_screen_live_streaming_production_at_Mediehuset_K%C3%B8benhavn.jpg, Author: Rehak

Livestreaming with libav* – Tutorial (Part 2)