What are the losses to be minimized?
- Contrastive Loss (Lm): The objective of this loss is to bring the prediction of current masked timestep closer to the actual quantized latent speech representation which should have been there at the same time step, followed by pushing the prediction far away from distractors.
Here, sim() function is cosine-similarity function and the term Ct is output of the model at timestep=t.
- Diversity Loss (Ld): This loss objective is responsible for bringing in equal opportunities of selection for all the V entries present in each of the G-codebooks. So, we maximize the entropy of averaged SoftMax distribution for each of the entries in the codebook and to bring in equal opportunity across a batch of utterances. This is naïve SoftMax which doesn’t include non-negative temperature coefficient and Gumbel noise.
Here, probability term represents probability of finding v-th entry from g-th codebook.
Often, the setting is little bit changed when we pretrain on smaller Librispeech dataset, by introducing L2 penalty in the activations of the final layer of feature encoder to regularize training and reducing the gradient by a factor of 10. The authors have also suggested to remove the layer normalization step in the feature encoder and normalizing the output of only first layer of feature encoder. Checkpoint is set for the epoch where contrastive loss is minimum on the validation set.
For decoding, there are two schools of thoughts, one is via a 4-gram LM and other idea is to use a pretrained (on Librispeech LM corpus) Transformer LM, followed by choosing the best of both worlds using beam search decoding scheme. For more details I would request you to see these details in the original paper.