We, humans, are able to understand, to some extent, to understand speech just from lip movements. In this research, the authors gave that functionality to a model. From facial videos.
1. AI for CFD: Intro (part 1)
2. Using Artificial Intelligence to detect COVID-19
3. Real vs Fake Tweet Detection using a BERT Transformer Model in few lines of code
4. Machine Learning System Design
A lot of these applications are built from using facial landmarks, and these are pretty cool.
Applications include security, were able to understand what a person is saying from a distance. Just by looking at their lip movements. One of the challenges would be synthesizing a human-like voice. Since that is still an active research area.
A lot of datasets related to this task are small, hence covering a limited amount of vocabulary. So naturally, the authors also released a dataset, much larger one, that covers a lot of vocabulary.
Just one thing to mind, since the authors created the dataset, their SOTA is not really a SOTA. I have to see how this area of research progress.
Another great example of how learning a good manifold can be such a powerful application.
So those are the person in their dataset, each speaker has more than 20+ hours of speech video. There is naturally going to be a bias on the generated voices. They are mostly going to be male voices.
But it is good to see this area of research starting. The problem is formulated in a way 2D images are translated to Mel spectrogram. But the author have formulated this problem as a sequence to sequence generation. And they are able to train this thing end to end, I believe this is really powerful framework.
Training in high dimensions is a huge problem, and it seems like the authors had face some challenges while training to train this model in the high dimensional space.
Since we are solving a sequence problem, of course, recurrent networks are going to be used. A limitation of this work would be not able to generate voices for videos that contain two or more speakers.
And they are extracting frame by frame using the image as the input data.
Wow, had no idea that these are the metric used for voice problems. I wonder if those metrics can be used as a loss function. Differentiate them and optimize the models.
And they were able to outperform any other methods, but again we need to take the result with a bit of caution. The authors have released the dataset. Another question that has to be answered is, can we choose what kind of voices we can use.
The facial encoder model significantly focuses on the lip regions of a given video. This is a promising result, we want the model to focus on the lip portion.
3D CNN was the best since the model is able to encourage the temporal information as well.