Evaluation of AI/ML Based Deep Packet Loss Concealment in Opus Codec Version 1.5

Sukumar Arasada, Mohan Prasad Sappidi, and Basharath Hussain Khaja

1. Introduction

Open-source Opus codec is a versatile tool for speech and audio coding. It is also a pioneer codec in introducing Artificial Intelligence (AI) and Machine Learning (ML) techniques for improving speech quality at low bit-rates and during transmission loss conditions [1]. In [2], we presented our evaluation results of an AI/ML based de-noising algorithm called LACE that was introduced in Opus codec version 1.5.2. This version of the codec also provides two implementations of AI/ML based Packet Loss Concealment (PLC) algorithms: (a) Deep PLC, recommended for occasional packet losses and (b) Deep REDundancy algorithm (DRED) to be used in case of contiguous, prolonged bursts of packet losses. In this report, we present the results of our evaluation of Opus codec with and without enabling the Deep PLC algorithm. When the Deep PLC algorithm is disabled, Opus uses the baseline legacy PLC algorithm. We present our findings both in terms of speech quality and the computational complexity while using the Deep PLC algorithm. The rest of the report is organized as follows: In Section 2, we provide a brief overview of PLC algorithms, Section 3 describes the Deep PLC algorithm in Opus, Section 4 describes the steps we followed in our evaluation, and finally in Section 5, we present our conclusions.

2. Packet Loss Concealment

Packetized voice communication over the Internet uses the User Datagram Protocol (UDP), which is a best effort service. Each packet consists of one or more frames of encoded voice data. Each frame consists of bit-stream generated from encoding a time-window of digitized PCM voice samples. The number of voice samples encoded per frame depends on the codec configuration. The UDP protocol does not provide any guarantees on the quality of service and transmission losses are imminent.

The receiver has several techniques in its arsenal to mask or conceal the effect of such losses. Such concealment techniques are based on the assumption that voice attributes or rather the rate of change of voice attributes remain constant over small time-scales. They can be broadly classified into two categories: (a) concealment based on information already available at the receiver from past good frames, and (b) transmitter guided algorithms, wherein the transmitter embeds perceptually important speech information (e.g., transients) of the current frame in a subsequent frame. Waveform substitution technique is an example of the former category, while Forward Error Correction (FEC) algorithms belong to the latter.

The simplest waveform substitution algorithm is specified in the ITU-T G.711 Appendix I [3]. Based on the pitch period, it simply recommends that one cut and paste voice samples from the past good frames. As speech coding evolved beyond the G.711 standard, and started using parameterized models such as analysis-by-synthesis, codebook excitation, classifying speech as voiced or unvoiced and handling them appropriately, the PLC techniques evolved as well. The receiver in such cases masks the lost frame using the model parameters from past good frames.

Transmitter guided concealment algorithms such as the FEC are supported in Opus and the 3GPP EVS codecs. In Opus, FEC flag along with a packet loss percentage statistic obtained from the network can be passed as input to the Encoder. In case of EVS, a channel-aware mode is supported in wide-band and super wide-band options [4]. In this mode, the encoder operates at 13.2 kbps and perceptually important information is piggybacked onto a future frame while still maintaining a constant overall bit-rate.

3. Opus Deep PLC

In Opus Deep PLC, a combination of generative and predictive AI models is used [5]. The generative AI model is designed to work with acoustic features such as the Bark frequency spectral coefficients, pitch period, and a voice/unvoice indicator, to generate a synthesized speech output. The predictive AI model is used to estimate these acoustic features, whenever a packet loss occurs. The estimated acoustic features are then fed as input to the generative AI model to synthesize speech during packet loss conditions.

Figure 1 shows the block diagram of the receiver flow. Good frames are processed via the normal flow in Opus codec. Whenever a packet loss is detected, then the acoustic parameters from the last good frame (or the last synthesized frame) is extracted and fed as input to the Deep Neural Network based predictive AI model block. This output is given to the generative AI model called FARGAN [6] to synthesize the speech frame to be played out to the listener. Both the predictive and generative AI models are extensively trained offline. Note that the AI models come into action only during packet loss conditions. There is no additional computational overhead related to Deep PLC involved during the normal flow.

Figure 1: Opus flow at the receiver (decoder).

4. Evaluation of Opus Deep PLC

Floating-point implementation of Libopus version 1.5.2 [7] with the release date of April 12^th, 2024 was used in our evaluation. To enable Deep PLC algorithm, the following compile time option has to be used while building the library:

./configure -–enable-deep-plc

This will increase the overall memory requirements of Libopus – the program memory will approximately increase by 1.5 Mbytes, state memory by 60 Kbytes, constants by 1.4 Mbytes, and scratch buffer requirements by 70 Kbytes. Libopus 1.5.2 also introduces a “complexity” parameter at the decoder (scale of 1 to 10) and Deep PLC is used for concealment only if the complexity parameter is set to 5 or higher value and for mono speech channels.

An example command used for encoding a wideband test vector is given below:

./opus_demo -e voip 16000 1 24000 -cbr -bandwidth WB -framesize 20 -complexity 3 -loss 0 <input.pcm> <bitstream.cod>

To simulate packet loss conditions, we have two options at the decoder: either we can specify the loss percentage or provide an input file with meta information to indicate whether the current frame should be treated as received or lost.

An example command using the loss percentage (20% in the command below) is as follows:

./opus_demo -d 16000 1 -dec_complexity 5 -loss 20 <bitstream.cod> <output.pcm>

An example command using the loss meta file option to indicate good or bad frame is as follows:

./opus_demo -d 16000 1 -dec_complexity 5 -lossfile meta_loss.txt <bitstream.cod> <output.pcm>

Note that at the encoder, frame size is set to 20 ms, a constant bit rate of 24 kbps is specified, sampling frequency is specified as 16000 Hz, and complexity is set to 3. The decoder complexity parameter has been set to 5 to enable the Deep PLC feature.

The gold standard for evaluating the quality of speech outputs between two different algorithms is by having listeners assign a score to each test sample and calculating the Mean Opinion Score (MOS) across all listener scores of the test samples [8]. However, this approach is subjective, time-consuming, and prohibitively expensive. To partially overcome these limitations, a crowd-sourced technique has also been proposed for evaluating speech quality [9]. On the other hand, objective evaluation tools have been developed over the last three decades, either by industry Standards or by the open-source community. PESQ (Perceptual Evaluation of Speech Quality) [10] and POLQA (Perceptual Objective Listening Quality Assessment) [11] are examples of ITU-T Standards, while ViSQOL (Virtual Speech Quality Objective Listener) [12] and PLCMOS [13] are examples of open-source tools. Some of these tools require the clean reference input to compare with the packet loss concealed output. While others such as the PLCMOS come in two versions: version 0 uses the reference input whereas version 2 is non-intrusive. Note that PLCMOS was developed specifically for evaluating PLC algorithms. In our evaluation, we use the objective evaluation tools, namely PESQ, ViSQOL, and PLCMOS version 0 respectively.

We used two different sets of test vectors in our evaluation of Deep PLC. Recently, Deep PLC challenge was announced in INTERSPEECH 2022 conference [14], followed by another one in ICASSP 2024 [15] to accelerate research and development of AI/ML based algorithms to address PLC. Opus Deep PLC had won the second prize in [14]. To evaluate the submissions in response to these challenges, clean audio files, lossy audio files, and meta data files that describe the loss information were provided. All the audio files were sampled at 16000 Hz and were of approximately 10 seconds in duration. We used these data files in our evaluation with the following pre-processing steps: (a) we first categorized the files, based on the number of lost frames, and only used the files under the following bins of packet loss percentages: (0% – 5%], (5% – 10%], (10% – 15%], (15% – 20%], (20% – 25%], and (25% – 30%], and (b) instead of evaluating the performance of Deep PLC directly on the lossy audio files, we first encode the clean audio files at 24 kbps and then introduce loss at the decoder using the information provided in the associated meta file. By encoding these audio files, one can evaluate the performance of the Deep PLC algorithm in real-world use cases. We also used a second dataset (CIT test vectors) consisting of five male and five female recordings in English, sampled at 16000 Hz, having an average duration of 23 seconds. These test vectors were derived from the speech recordings available here [16], and can be downloaded from here [17].

We profiled the CPU resource consumption on the Intel host system (Intel(R) Core(TM) i5-11400 CPU @ 2.60GHz, 8GB DDR4 @2667 MT/s) running 64-bit CentOS Stream 9, with gcc tools version 11.3.1 20221121 (Red Hat 11.3.1-4) and glibc version 2.34 (64-bit). Intrinsics for x86-64 architecture (SSE, SSE2, SSE4.1, AVX2) were enabled for optimal performance of LibOpus.

Figure 2 and Figure 3 shows the plot of average quality scores and CPU resource utilization (MIPS) obtained from the INTERSPEECH 2022 Deep PLC Challenge test vectors. Figure 4 and Figure 5 show the corresponding plots obtained from CIT test vectors. For comparison, the quality scores and the MIPS are plotted for both the baseline Opus PLC algorithm and the Opus Deep PLC algorithm.

Figure 2: Comparison of speech quality scores between baseline PLC and Deep PLC obtained from INTERSPEECH 2022 test vectors.

FIgure 3: CPU resource usage comparison between baseline and Deep PLC for INTERSPEECH 2022 test vectors

Figure 4: Comparison of speech quality scores between baseline PLC and Deep PLC obtained from CIT test vectors.

FIgure 5: CPU resource usage comparison between baseline and Deep PLC for CIT test vectors

From these figures, we summarize our findings as follows:

From Figures 2 and 4, we observe that Deep PLC always gives a better quality score when compared to the baseline PLC algorithm for packet loss conditions considered in this study (up to 30%).
Although, the percentage increase in the quality scores varies between the different objective tools used in this study, Deep PLC consistently performs better than baseline PLC.
The improvement in the quality scores provided by Deep PLC increases with increase in the number of packet losses. The score increases by ~2% for packet losses of up to 5%, and in the range of [10% – 20%] for packet losses of up to 30%.
From Figures 3 and 5, the MIPS consumed by the baseline PLC algorithm remains almost the same irrespective of the number of packet losses.
MIPS consumption by Deep PLC is comparable with baseline PLC for packet losses of up to 5%, but increases significantly with the increase in the number of packet losses. The MIPS increases by a factor of 2 for packet losses of up to 5%, while it increases by a factor of 9 for packet losses of up to 30%.
Figure 5 also shows that under no packet loss conditions, the MIPS overhead of enabling the Deep PLC feature is negligible.

5. Conclusion

Opus codec version 1.5 introduced an additional optional AI/ML based packet loss concealment algorithm at the receiver called the Deep PLC. The receiver can select either the baseline concealment algorithm or the Deep PLC algorithm. The Deep PLC does not depend on any side information from the transmitter for concealment. It uses a combination of generative and predictive AI models for concealment. Under normal conditions (no transmission loss), enabling Deep PLC does not result in any significant additional CPU resource usage. During packet loss conditions, Deep PLC consistently performs better than the baseline PLC. In our study, we observed an improvement in speech quality scores in the range of 2% – 20%. Deep PLC performs better at higher packet loss conditions. As is the case with most AI/ML algorithms, Deep PLC consumes significantly higher CPU resource consumption when compared to the baseline PLC. While the MIPS are comparable at low packet loss conditions, it increases by a factor of 9 at 30% packet loss scenarios. Enabling Deep PLC at terminals that have AI/ML hardware accelerators will be definitely beneficial. However, the high MIPS requirement will be a constraint for battery operated terminals and scalable enterprise applications such as the Media Resource Function and Session Border Controllers.

6. Acknowledgement

We thank Dr. Krishna Nagarajan for his guidance during the evaluation and in writing this report. For any queries, please contact support@couthit.com.