The Complexities of Audio Signal Processing in Video Conferencing
Video conferencing is about communicating between two remote parties using video and audio. There are three major factors that influences video conferencing quality:
- Media transimission quality, i.e., no jittering, hd quality;
- Audio quality, i.e., less noise, more voice clarity, invariance to voice pickup distance;
- Video quality, i.e., better picture clarity, more tracking intelligence.
However, factor 2) is not a trivial problem. A simple drawing is used to illustrate the video conferencing scenario.


For the near end room, there are lots of possibilities:
- N1: Small huddle room fitting only less than 3 person.
- N2: Medium sized room fitting a team of around 6-8 person.
- N3: Large meeting room fitting a team of around 15 person.
- N4: Extra large room such as training room or board room.
- F1: people might dial in from a quiet conferencing room using professional devices
- F2: people might dial in from their work desk with a earphone
- F3: people might dial in from their car, or even on a train
- F4: people might dial in from home office
- N1: This is the most straightforward case. In this case, usually the person is sitting pretty close to the audio capture device. As long as the device can handle echo cancellation and has basic noise supression, automatic gain control, this will be fine. However, most of the audio devices can not even handle this case, because the device can not handle double talk very well. This case mainly involves one issue.
- Double talk: When the far end is talking or the far end has background sound, it becomes double talk scenario. To make sure that far end can hear clearly about the conversation, the audio device is supposed to have great double talk capabilities, i.e., full duplex.
- N2 & N3: For this case, usually this involves two problems
- Distance to audio device: with only one device in the room, there might be some person sitting close to the device, while some other person sits far from the device. This makes the captured voice shaky, meaning that for closer-sitted person, the captured voice is rich and powerful, for the far-sitted person, the captured voice is shallow and weak. For this case, we would recommend to use daisy-chained devices to make sure all seats can be equally covered. The follow graph shows the difference between our device and the Jabra 710 device. Our device captures much powerful and rich voice at a distance up to 3 meters.
aligned_jabra_1m
aligned_jabra_3m
aligned_aw_1m
aligned_aw_3m
- Noise: With more people in the conference room, one serious issue is that it causes more noises. Some person might be hitting the keyboard, while some other person might be knocking on the desk and some other person might be sneezing. Unfortunately, all the claimed "NoiseBlock" technology can not handle these noises when the noises happens when people are talking at the same time. We are working on the next generation deep learning based solution to handle this issue. So, stay tuned for our future product updates.
- F1: When people dial in from quiet environments, the conference system should be able to handle double talk. As we have stated earlier, most of the existing solutions fail to handle double talk, while our solution can handle double talk very well.
aligned_aw_dtd
aligned_jabra_dtd
- F2: When people dial in from their desks, their microphone might pick up background sounds such as people walking by. This is a continuous double talk case.
aligned_jabra_sbg
aligned_aw_sbg
- F3: People might also dial in from their car, where the far end is full of instantaneous noise.
- F4: People might dial in from home office
To handle all these cases, there needs to be a wholistic approach and fortunately, after two years of hard work with more than 30 engineers and scientists, we are able to deliver such a device.