data from games was quite easy, thanks to the very open and moddable nature of Bethesda games, and great tools like BAE, LazyAudio, and Bethesda’s Creation Kit. To start preparing the data for training, the audio files were first extracted from the game file, then decomposed into .lip and .wav files. The transcript was extracted using the Creation Kit. Following this, it was just a matter of matching up the audio files to the transcript, which was easy to do via a python script. To finalize the transcripts, a number of filtering passes were run, to exclude invalid data such as screams, shouts, music, and lines with extra unspoken text.
1. 130 Machine Learning Projects Solved and Explained
2. The New Intelligent Sales Stack
3. Time Series and How to Detect Anomalies in Them — Part I
4. Beginners Guide -CNN Image Classifier | Part 1
For the actual audio files, a couple additional pre-processing steps were required, starting with down-sampling the audio to 22050Hz mono audio. Using pydub, the audio silence was trimmed from either end, and the middle of the audio where long pauses were present (sox can also do this, but it introduced audio artifacts).
Models
I started early experiments during this project in 2018, using Tacotron . At this point, the project was a proof of concept, and was riddled with issues, as can be seen in this video I made showcasing progress at a v0.1 pre-release version.
Demo video for early experiments in xVASynth v0.1
Though it somewhat worked, the audio quality was terrible (with very high reverb), and the output was quite unstable. The model was also very slow to load. Additionally, the model required very large datasets which limited me to voices from voice actors who also recorded audio books (The data pre-processing for which was a whole other messy can of worms). Finally, any artistic control was limited to clever use of punctuation.
Fast-forward to 2020 and the PyTorch FastPitch model by NVIDIA was released. There are several key features of this model that are useful for this project.
One of the main selling points of this model is its focus on letter-by-letter values for pitch and duration. This meant that by hijacking intermediate model values, a user would be able to have artistic control over how the audio is generated – a great plus for the acting part of voice acting.
Pitch sliders and duration editing tools in xVASynth
Another plus is the support for multiple speakers per model. The best quality was still achieved when training a single speaker at a time. However, this has meant that training (or at least pre -training) voices with only a small amount of data was now possible, by grouping them up.
Image source: FastPitch paper on arxiv (https://arxiv.org/pdf/2006.06873.pdf)
However, the most important point for this project is that audio generation was backed by a pre-trained Tacotron 2 model, instead of learning from scratch. As a pre-processing step, the Tacotron 2 model outputs mel spectrograms, and character durations which is what is used to compute the loss, as shown in the diagram above, from the paper (the pitch information is extracted using a different method).
Just like Tacotron and Tacotron 2 , the FastPitch model generates mel spectrograms, so another model is needed to generate actual .wav audio files from these spectrograms. The FastPitch repo uses WaveGlow for this, which works really well.
This dependency on Tacotron 2 has meant the training has been far more quick, simple and successful. However, an issue still persists when the speaker style is very different from the one the pre-trained Tacotron 2 was trained on, LJSpeech . LJSpeech is a female speaker dataset, meaning deep male voices cannot very well be converted to the data required by FastPitch .
xVASynth right now
At the time of writing, v1.0 xVASynth comes with 34 models trained for 53 voice sets across 6 games, with many more planned for future releases. The issue with Tacotron 2 persists for most male voices, as I don’t currently have the hardware requirements to train/fine-tune a Tacotron 2 model well enough, though this is the next step after my next hardware upgrade.
The video at the top of this post details usage examples for the app, which can be downloaded from either GitHub or the Nexus.
I additionally added HiFi-GAN as an alternative vocoder to WaveGlow, which is orders of magnitude faster on the CPU, albeit at lower audio quality.