The method for predicting the v-segment and j-segment are very similar because they are only a handful of classes that the segments can be and a model can be tuned to predict the class that the segment is. This is a multiclass classification problem.
Input Data Representation
As mentioned, the input of the neural networks are the epitope protein sequence of the antigen. However, the epitope protein sequence is represented as letters in a non-fixed length. This is problematic as neural networks only work exclusively with numerical values in a constant size input matter.
Examples of epitope protein sequences: LLWNGPMAV (Yellow Fever Virus), CPSQEPMSIYVY (Cytomegalovirus), CTPYDINQM (Simian Immunodeficiency Virus)
Luckily, assigning each letter to an id makes this easier. We can map each letter in the protein sequence to a number. For example, the letter “A” becomes 1, the letter “B” becomes 2 and so on.
So the yellow fever virus sequence becomes:
LLWNGPMAV → 12 12 23 14 7 16 13 1 22
This new encoded sequence of numbers has a length of 9. However, some sequences will have a length of 8, 10, 11 or up to 20 proteins. The input for a neural network needs a fixed-sized input, so to achieve this, we can pad every sequence to the max length possible of 20 with 0’s. So for our currently encoded protein sequence of the yellow fever virus, it becomes:
12 12 23 14 7 16 13 1 22 → 12 12 23 14 7 16 13 1 22 0 0 0 0 0 0 0 0 0 0 0
These inputs are fed into our neural network in the form of arrays. So with the input handled, how does the neural network decide which segment is appropriate for the sequence?
A neural network is just a mathematical function that takes in input “x” and produces an output “y”. It has weights or learnt parameters that alter the x to get the y. There are different methodologies in neural networks that utilize parameters in different ways. These parameters are optimized to produce the best output possible. This is done by calculating a loss function which represents how good the model is. The lower the loss, the better the model. We can use an optimizer (which are calculus functions) to optimize the loss and make it lower. This process is called “machine learning” & “training the model.”
The model has over 30 million parameters that can be optimized. It has standard dense layers that perform the most basic neural network operations on an input yet is very effective.
The last dense layer in the neural network has 126 neutrons which represent the 126 classes the V-segment of the T-Cell can be. The output is in the form of a one-hot vector which means every neuron output is 0 except for one of the neurons which is 1. The one neuron’s position that has a value of 1 determines which class the V-segment is.
For the J-segment model, there were 68 neurons that represent the 68 classes the J-segment can be. It’s the same model with just a different number of classes in the last dense layer.
So for example, if we were predicting the V-segment class, if the output of the model is 1 0 0 0…. it means that the V-segment would be the 1st class or the TRBV6–8 variable segment.
Training / Results
After training both models, similar results were achieved. The loss went down to 3 for the entire dataset yet the validation loss was much worse. The model had a tough time generalizing on new data yet it works in theory. This is because many classes will work for a certain epitope so the loss in practicality is much lower.
After training for 30 epochs each, the model works decently well on new unseen data. I can predict one of the right classes up to 80% of the time.