Hello, friends. My name is Cami. Welcome to another one of my mind dumps.
Chances are if you are reading this post, you are already familiar with my first PyTorch blog post Contributing to PyTorch by someone who doesn’t know a ton about PyTorch. If you haven’t read that post yet, plz go read and clap thx luv u.
I started my journey with PyTorch taking the PyTorch Udacity class online… almost. I am actually still going through it. But as any impatient engineer would, I just wanted to dive in and get started without having to actually step through the entire thing. Probably not the best decision but hey, we laugh in the face of danger in these posts.
We are going to be building with the MNIST dataset. By the end of this post, you should have an AI that is able to distinguish between handwritten numbers.
With that said, to all my friends who are like me and just wanna start hecking building .✫*ﾟ･ﾟ｡.☆.*｡･ﾟ✫* LET’S DIVE RIGHT IN .✫*ﾟ･ﾟ｡.☆.*｡･ﾟ✫*.
Just kidding. BEFORE we start coding, here are some key words and phrases that I will be using throughout the post:
- Model: A neural network.
- GPU: Graphics processing unit, used to make your code run faster.
- TPU: Tensor processing unit, bigger and faster than a GPU.
- Activation Function: The activation function of a node defines the output of that node given an input or set of inputs.
- Hidden Layer: A layer in between input layers and output layers in a neural network, where artificial neurons take in a set of weighted inputs and produce an output through an activation function.
- Tensor: Multidimensional array to that contains your data.
- Loss: How wrong is the model. Your goal in training should be to to have the loss decrease.
- Gradient: The gradient says how we should change the weights of the network, so that the loss would be most minimized (but looking only at the local “gradient” of the neural network, at this point. So going this way a lot might not actually get you to the optimal solution!) <- This is a direct quote from Edward Yang, he is awesome go give him a Twitter follow.
- Optimizer: Optimizes according to the loss. An optimizer goes through and adjusts the weights based on the loss and gradients based on the learning rate.
- Learning Rate: Dictates the size of the step that your optimizer will take to get to minimum loss and optimal accuracy.
- Gradient Descent: An optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In our case, the algorithm to minimize our loss.
You got it? I got it. Let’s do this.
First things first, we need to decide where we want to code. If come from a background of being a software engineer like me, chances are you have already have an allegiance to an IDE (I love Sublime, we can debate over editors later).
Abandon that, right now. Because honestly the best place to start with building your first t̶h̶i̶n̶g̶ application with PyTorch is starting with either Jupyter Notebooks or Google Colab. Both of these are awesome. They’re cloud environments that requires no setup. The Colab website says it is a “Google research project created to help disseminate machine learning education and research. We provide notebooks for several of our models that allow you to interact with them on a hosted Google Cloud instance for free.”
It’s safe to say that you CAN develop locally. On PyTorch.org you can find the “QUICK START LOCALLY” and “QUICK START WITH CLOUD PARTNERS” section to get you up and moving.
If you have a local GPU, it definitely makes things faster, and makes a big difference in training times. I didn’t have one, so I chose to start with Colab. Colab is awesome because you can actually set a hardware accelerator within your notebook, to use a GPU or a TPU.
I don’t need a hardware accelerator for this “Hello World” project, but it’s really nice to have. Instead of my code taking 3-ish seconds to run, the accelerator makes it take less than a second. Eventually when I want to scale, I will probably use one of the Cloud Partners, but for learning and where I am at now, Colab is fantastic.
If you have heard of PyTorch, you have probably heard of tensors. Best way to solidify your understanding of tensors is to experiment a little..
Open a new Python3 project in Colab.
Your output is
tensor(10., 4.]) which is just 6+4 and 3+1, and
torch.Size() which is just how many elements are in the tensor. There are a lot of functions you can use to help you build your tensors, like
torch.ones. You can read more about constructing tensors in the tech docs.
Tensors are the object in PyTorch that we will use to house our data. If you have a 28×28 image, you would have a tensor that was
torch.Tensor([[0, 1, 2... 27], [0, 1, 2… 27]]),...[0, 1, 2… 27]]).
MNIST is a really popular dataset that we can access within PyTorch, and a great place to begin our experiments with PyTorch. It is a collection of images of handwritten numbers zero through nine. In our first project with PyTorch, we will be training on the MNSIT dataset, so our neural network will be able to differentiate between handwritten numbers and assign them a label 0 through 9.
I hope that at some point, I will write a blog post using my own curated dataset. Sending positive vibes to you, future Cami. You can do it.
Going back to Google Colab, delete our experiments with tensors, and import the following items.
import torchimport torchvisionfrom torchvision import transforms, datasets
If you are doing anything in PyTorch with vision, you must use
torchvision. There is also
torchtext, intuitively named to their uses.
The datasets from
torchvision will give us access to MNIST. There are two different datasets that you want to keep track of within your code. First is a “training” dataset, and the other is a “testing” dataset. They’re pretty intuitive, but basically the training dataset is what your neural network will be trained on, what will help the neural network define what numbers 0 through 9 look like. The testing dataset is a dataset never seen by the neural network during the training phase. We want this “out of network” data to assure that our neural network isn’t over-fitting according to the training dataset, and to assure that it can actually label independent items properly.
train = datasets.MNIST("", train=True, download=True, transform=transforms.Compose([ transforms.ToTensor()]))test = datasets.MNIST("", train=False, download=True, transform=transforms.Compose([ transforms.ToTensor()]))
Let’s look at these parameters:
"”is in the place of where we want to train or test. Empty quotes signifies we are training/testing locally.
train=<True/False>is a boolean indicating if we are training or testing.
download=Trueallows you to download the data.
transform=is saying what we want to transform the data to.
Next we want to create data loaders. The
testset will allow us to iterate over this data.
trainset = torch.utils.data.DataLoader(train, batch_size=10, shuffle=True)testset = torch.utils.data.DataLoader(test, batch_size=10, shuffle=False)
For these parameters:
batch_size=is how much data we want to send to our loader at once.
shuffle=assures that your dataset is randomized and thus your neural network will be generalized. For example, training all the zeros in MNIST first isn’t the best approach, because it could over-fit for zeros. So shuffling the dataset makes sure that we get a good mix of data to iterate over so we can train the neural network.
The objects in
trainset are tuples.
trainset is a tuple, where the first index is a multidimensional tensor that contains 10 examples of handwritten digits, and the second index is a 1-dimensional tensor containing the numbers corresponding to each of those digits.
trainset isn’t iterable in its current state, in pseudocode:
trainset will give us a tensor representing the handwritten digit, and
trainset tell us that the handwritten digit is a 7.
Now to the fun stuff. Building the actual neural network. Buckle up kiddos, because you are about to feel tremendous computational POWER.
In order to build our model, we need to import some more things.
import torch.nn as nnimport torch.nn.functional as F
torch.nn is our neural network class, while
torch.nn.functional are the functions we can use with the neural network we construct.
Next we have to define what our neural network will look like. We want to build a fully connected neural network, where the input is images that are 28 by 28, and the output are digits 0 through 9. We will have 3 layers of 64 neurons for our hidden layers.
class Net(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(784, 64) self.fc2 = nn.Linear(64, 64) self.fc3 = nn.Linear(64, 64) self.fc4 = nn.Linear(64, 10)
In a neural network, “layers” is a group of nodes operating together at a specific depth within a neural network. Think of a layer as a stage the input has to go through in order to help define it. When you’re adding a layer to your neural network, you want them to be of the same size, being a 1-dimensional matrix. This way you can not only accept input and process output consistently between layers. Before putting the image through a neural network, we have to flatten it to be one dimensional. Calling flatten will change it from a 28 by 28 tensor to a 1 by 784 tensor.
Couple notes here:
fc2, etc. are fully connected layers.
nn.Linear(<input>, <output>)essentially means, create a fully connected neural network.
- We start with
784because it is our flattened image (ie: 28*28).
fc2must take in
- We want our output to end with 10 nodes. These 10 nodes represent digits 0 through 9.
Net( (fc1): Linear(in_features=784, out_features=64, bias=True) (fc2): Linear(in_features=64, out_features=64, bias=True) (fc3): Linear(in_features=64, out_features=64, bias=True) (fc4): Linear(in_features=64, out_features=10, bias=True))
We need to create a path for data to pass through our layers. We want our data to “feed forward”, which means to go in one direction, from input to output. The input of this function would be our data, and the output would be the result of the data passing through our neural network.
def forward(self, x): x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = F.relu(self.fc3(x)) x = self.fc4(x) return F.log_softmax(x, dim=1)
We use the function
F.relu which is our rectify linear activation function. Our activation function defines the output of that node given an input or set of inputs. When we perform “feed forward” we need an activation function to help pass our data through. The rectified linear activation function is known for making your model easier to train and while achieving good performance.
In writing these two functions, 🎉 we have constructed our neural network, and told data how to pass through our neural network 🎉. Feel free to pause and weep over your beautiful code.
Testing them on a flattened image, I should expect an output that is a tensor telling me actual predictions for each digit on that single data point.
However WHEN I tested my functions, I got an error that I wanted to share. Apparently in PyTorch there is a function
view that I honestly thought all did the same thing. Allow me to explain…
torch.flatten(input, start_dim=0, end_dim=-1)flattens a contiguous range of dimensions in a tensor.
torch.reshape(input, shape)returns a tensor with the same data and number of elements as input, but with the specified shape.
- And finally
view(*shape) -> Tensorreturns a new tensor with the same data as the self tensor but of a different shape.
Yeah, they all sound super similar. My guess is that
reshape are closest in similarity, in that you could
reshape a tensor to a similar output that
flatten would give you. In any case, we want to use
view, so we can perform it directly on our data rather than calling
To test I made a random tensor representing a 28 by 28 image and flattened it using
view …what? I know. Then pass it through our neural network.
A = torch.rand((28, 28))A = A.view(-1, 28*28)output = net(A)print(output)
This resulted in an output
tensor([[-2.2614, -2.3526, -2.3101, -2.3914, -2.2236, -2.2715, -2.2735, -2.3167, -2.2197, -2.4265]], grad_fn=<LogSoftmaxBackward>). Each of these elements are the predictions for each of our digits. WOW! We are doing something!
Ideally these would all sum up to 1. Actually, if you switch from returning
F.log_softmax(x, dim=1) to
F.softmax(x, dim=1) in
forward, the sum of the output is 1. The difference is that
log_softmax is basically
log(softmax(x)). We use
log_softmax though because it yields improved numerical performance and gradient optimization.
Now it is time to pass our datasets through our model to train it. I tried to find a diagram that 1. I thought described the training process well and 2. that was free for me to use, but it proved to be difficult. So I tried making my own:
Okay, I get it, its rough. Also if you’ll notice I made it at 8:35PM on a Sunday and so my brain was basically mush. But let me explain and maybe it’ll make more sense.
You have the skeleton for your neural network, and your data. First, you iterate through your data and pass it forward, into your neural network. This output will give us a loss in gradients,, a.k.a. how wrong our initial prediction was, comparing it to its label. Then with that loss, we back-propagate through the neural network’s parameters. We then use an optimizer to adjust the weights within the layers of the neural network according to what the loss specified.
This diagram that I found the next morning is also nice:
Code-wise, this isn’t overly complicated thanks to the wonderful contributors of PyTorch.
First, import and initialize our optimizer:
import torch.optim as optimoptimizer = optim.Adam(net.parameters(), lr=0.001)
Adam is an algorithm for first-order gradient-based optimization, built into PyTorch.
net.parameters()is supposed to indicate everything that is adjustable in our model, or everything that the loss can back-propagate over.
lris our learning rate, or the step size that the optimizer takes to help find minimum loss.
On more complicated tasks, we can use a decaying learning rate, where over time the learning rate gets smaller and smaller. But in our case, this is “Hello World”, we don’t need to add to the complexity. Taking the same step size for every iteration using the optimizer is fine by me.
Next, we need to define how many times we will go through this training loop:
EPOCHS = 3
Finally, we will write our training loop. We will iterate over our
trainset. Remember that each object in the
trainset is a batch of our feature sets (multi-dimensional tensors) that depict the hand-drawn image, and the labels (1 dimensional tensors) that depict what number the image represents. On each iteration of the loop, we want to zero-out the gradient, as to reset the loss due to the previous back-propagation and optimization. Once the gradient is zero, we will calculate the loss based upon the current data, apply the loss backwards through the model’s parameters, and then optimize the weights to account for that loss.
for epoch in range(EPOCHS): for data in trainset: X, y = data net.zero_grad() output = net(X.view(-1,28*28)) loss = F.nll_loss(output, y) loss.backward() optimizer.step()
With each step of this loop, our loss should go down. I added a print statement `print(loss)` within my loop to assure this was the case. My output looked something like this:
tensor(0.1253, grad_fn=<NllLossBackward>)tensor(0.0011, grad_fn=<NllLossBackward>)tensor(6.5088e-06, grad_fn=<NllLossBackward>)
If yours don’t match mine, no worries. As long as on each iteration they are going down, we are good.
And holy wow that is it for training! We are almost at the finish line, like any good engineer, we must TEST.
Okay so testing is going to be the most simple aspect of this whole blog post. Essentially, we are just re-writing the
train function but going over our
testset instead of our
The one key piece for testing, however, is that we do not want to calculate gradients. Because this is out-of-network data, we don’t care about our loss. All we care about is that every prediction we made with our test data matches our target value.
correct = 0total = 0with torch.no_grad(): for data in testset: X, y = data output = net(X.view(-1,784))
for index, i in enumerate(output): if torch.argmax(i) == y[index]: correct += 1 total += 1
output is a batch of predictions based upon the hand-drawn images in
testset. Within that batch are 1-dimensional tensors of size 10, where each index represents the prediction according to the label (0 through 9)
I wanted to add a side note, that in PyTorch 1.3 they recently added Named Tensors, where rather than relying on yourself remembering what each index means, you can actually name them (so for example, instead of relying on index 0 to represent label “zero”, you can actually name the tensor “zero” and refer to it as such throughout your code). This will be helpful in the future as you want to label things like “book”, “microwave”, or “car”.
To compare our prediction to the target value, we use something called
argmax, in order to get the maximum, or closest to 1, label prediction. I used my
total variables to calculate an accuracy:
print(“Accuracy: “, round(correct/total, 3)). This outputted:
And HUZZAH. We did it. Our trained neural network gave us a 97% percent accuracy on our test data. All in all, we have created a usable model for the MNSIT dataset.
Finally, dear friends, if you are like me, and someone who just wants to copy code and run it, this section is for you. Here is all the code, in one block, that we wrote in this blog post on my GitHub.
I want to give kudos to sentdex on YouTube and Twitter. His quick tutorials were super helpful to fill in the blanks.
If you have read this far, you are quite the champion. Pat yourself on the back, give this a clap, and please, follow me on Twitter (@cwillycs).
Aaaaand a link dump for MORE LEARNING: