مرحبا شخص عظيم,
this and next week you will get two papers about the following idea: How could we produce better audio and video encodings. (With better meaning, similar data quality but less bandwidth required). If possible this will result in better audio and video quality in all our video calls. Previous approaches where to compress the audio/video down to a lower bit-rate e.g. by removing audio signals the human ear could not hear. This new wave of encodings leverage machine learning and a fascinating paradigm shift: We do not try to compress the original data, but to rebuild it at the destination.
- On the input a machine learning model learns about your voice and the things you said
- Transfer the minimum required data to rebuild your voice
- Another machine learning model rebuilds your voice at the destination
This week covers the paper about voice and next week will be about how to rebuild your face on the destination, scare times but also super fascinating technology
The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the ineffectiveness of modeling a sum of independent signals with a single auto regressive model. We introduce predictive-variance regularization to reduce the sensitivity to outliers, resulting in a significant increase in performance. We show that noise reduction to remove unwanted signals can significantly increase performance. We provide extensive subjective performance evaluations that show that our system based on generative modeling provides state-of-the-art coding performance at 3 kb/s for real-world speech signals at reasonable computational complexity.