By Tat Banerjee
There is a very interesting piece of content we are looking at today. A team from IIIT-H (Indian Institute of Information Technology — Hyderabad) has built an AI which translates and lip-syncs a video from one language to another.
This originally came to our attention in this article from the folks at TNW. The paper this article is based on is here, and if you are interested in AI/Language you should absolutely have a look.
Video Translator: Face-To-Face Translation
So what is the team at IIIT-H doing? From the paper, “only provide textual transcripts or translated speech for talking face videos to also translate the visual modality, i.e. lip and mouth movements. Consequently, our proposed pipeline produces fully translated talking face videos with corresponding lip synchronisation.”
So this really very cool. What is happening here is:
- First there is a text-to-text translation happening
- Next there is a speech-to-speech translation happening
- This gets added to the visual translation (which is the bit the IIIT-H team worked on)
Together these give what the researchers are calling Face-to-Face Translation. One of the research team has a YouTube channel, this is a sample below.
While the technology is very cool, it’s not something we really do. The approach we have taken is quite different.
Video Translator: Speech-To-Speech Translation (Our App!)
So how is this technology different to what is happening here at VideoTranslator?
What we are doing is we expect human intervention at (1) and (2).
That is:
- Do a Speech-To-Text AI Transcription (human post-editing expected)
- With the transcript do a Text-To-Text Translation (human post-editing expected)
- With the translated transcript, do a Text-To-Speech Dubbing (human post-editing expected)
Generally this means the end-to-end flow is better suited for assets which are expected to be online for a long time. Hence we put in the extra effort into making the asset really nice.
What do such assets look like? This is an English video a client recently provided us.
This is the AI Vietnamese version after using our app.
Obviously the work that IIIT-H has done is a scientific paper, whereas you can try our technology for free, because it’s a production Saas app.
Which Is Better?
Clearly we are biased, and we think our approach is superior. But let’s talk about why?
Our clients report that you always want to over-disclose with AI.
Ok, here is what is happening. You always want to tell people is it an AI. This is because if they don’t know, most people feel like you are trying to fool them somehow. And then they react badly.
Basically AI Dubbing is pretty good, but it’s that good. So a human will always work out that something is up. If you do not disclose, people get cranky.
Disclosing that you are using an AI is a very good idea. Mostly because normal people (outside tech) are excited about technology, so will (paradoxically) pay more attention. That is, we get a win when we disclose, and people are cranky when you do not disclose.
The lip-syncing used by IIIT-H is a very cool feature, but comes awfully close to fake news, and communities worldwide have deep concerns (totally legitimate concerns too!) about fake news.
No really — disclose that you are using an AI!
We think you need standards/regulation, and lip syncing with AI is probably not going to reassure the community. That being said, if properly disclosed, there is almost certainly a place for this new technology.
We wish the team at IIIT-H the best, and hopefully we see their tech out in the wild sooner rather than later. Best of luck ladies and gents, and very nicely done!
Credit: BecomingHuman By: Tat Banerjee