In this article, I’m going to show you the main steps to colorize black and white images using machine learning with “Keras Tensorflow”.
Before going into the code and the details for doing so let’s first discuss the problem generally and as you go deeper, I’m going to discuss the details and the code with you.
The first thing you should think about when it comes to solving a problem is to identify your problem, so let’s begin identifying the problem.
Before answering this question, let me show you what the image consists of, Usually, the image is stored in Red Green Blue components, they are added together to make the image you see on your computer, as you see below, a single image can be separated to its original RGB components.
if you combine them together, you will end up seeing something exactly like in the figure below.
Now, you know what the image consists of, you can’t generate a single channel that is responsible for the colorization part, in RGB color space the colorization information is inside the three channels if any of these channels was not exist that would destroy your colors in the image.
The main problem that you have a black and white image as your input, you want to generate three channels of RGB components of that black and white image as output. Now, Let’s assume you have a black and white image, you put it into a black box from the left side then you receive the three colorized components of RGB from the right side, but what is that black box? the answer is “Auto-encoders”.
According to Ian “Goodfellow”, Inventor of GANs, he described Auto-encoders in his book “Deep Learning” as follows:
an Auto-encoder is a neural network that is trained to attempt to copy its input to its output. Internally, it has a hidden layer h that describes a code used to represent the input. The network may be viewed as consisting of two parts: an encoder function h=f(x) and a decoder that produces a reconstruction r=g(h). This architecture is presented in figure 1.1. If an Auto-encoder succeeds in simply learning to set g(f(x))=x everywhere, then it is not especially useful. Instead, Auto-encoder are designed to be unable to learn to copy perfectly. Usually they are restricted in ways that allow them to copy only approximately, and to copy only input that resembles the training data. Because the model is forced to prioritize which aspects of the input should be copied, it often learns useful properties of the data. Modern Auto-encoder have generalized the idea of an encoder and a de-coder beyond deterministic functions to stochastic mappings p encoder (h|x) and p decoder (x|h).
Traditionally, Auto-encoder were used for dimensionality reduction or feature learning. Recently, theoretical connections between Auto-encoder and latent variable models have brought Auto-encoder to the forefront of generative modeling. Auto-encoder may be thought of as being a special case of feedforward networks and may be trained with all the same techniques, typically minibatch gradient descent following gradients computed by back-propagation.
The general structure of an Auto-encoder, mapping an input x to an output(called reconstruction) r through an internal representation or code h. The Auto-encoder has two components: the encoder f(mapping x to h) and the decoder g(mapping h to r).
As you see, we can use Auto-encoder for the reconstruction of the image, in other words, we would say that it has the ability to generate and that’s exactly what we want to do, we want to generate the three channels of RGB.
One approach is to make two copies of your image, one to be a grayscale image and it will act as your input to the encoder which is responsible for extracting the features of the image “Latent Space Representations” that can be used as input for the decoder to reconstruct the image, the other copy will be the same image but colorized as your target to the decoder (Supervised Learning) so that it can minimize the error between the original colored image and the generated one. Auto-encoder architecture would be something like in the figure below.
There will be a convolutional neural network (CNN) through the encoder part for extracting features. in the decoder part, there will be convolutional layers like those in the encoder (with different filters) but followed by upsampling layers for the reconstruction part.
you can control the number of the filters and layers in each layer, of course, the last layer should contain three filters that will be the RGB channels of the reconstructed image but, there is something smarter you can do. What if you have another color space instead of RGB that can isolate the color information from the image? This means you will get pure black and white information of an image in one single channel and the other two channels will contain the color information embedded in those two channels. that seems a very good idea and that what the LAB color space does for you.
there are many color spaces like RGB, CMYK, Lab, XYZ, …
Lab [aka CIELAB / L*a*b*] is a color space that completely separates the lightness from color. Think of lightness as some sort of grayscale image, it only has luminosity but, no colors at all. channel L is responsible for that lightness (grayscale) and the other two channels ab are responsible for the colors. as you can see in the images below the color information is embedded in the ab channel. Without looking at L you may notice that it is too hard to know what is in the picture from looking at ab and that’s because of a science fact that says 94% of the cells in our eyes determine Lightness (L). That leaves only 6% of our receptors to act as sensors for colors (ab).
for further explanation see the video below, introduced by Marco Olivotto (a physicist who works in the field of image color correction)
Lab color space (further explanation)
In technical explanation, I coded this approach on google colaboratory, if you don’t know what google colab is, Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud.With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources (Tesla K-80 Nvidia GPU), all for free from your browser. it allows you to use up to 12 GB of RAM.
1. choose your dataset
choosing the dataset may seem easy but, it is not. you must choose a dataset that is consistent, for example, many of the datasets I found have many black and white images and that can maximize your error in the training part because in this case instead of showing a grayscale image and its colored version for your model to learn the mapping from grayscale to colored image, you are showing a grayscale version and a grayscale image too as a target, nothing useful for your model to learn in this case, so you should remove any grayscale images from your dataset and choose all of them to be colored as RGB. After filtering your data, now you ready for the preprocessing part.
Now, you have a dataset of RGB images. as you figured out in the general explanation section that choosing Lab instead of RGB is a better choice because you are only interested in generating colors and the colors information are embedded in the ab channels instead of RGB. in preprocessing you will need to convert all the dataset from RGB to Lab. another thing to do before converting to Lab is to normalize your dataset (divide by the maximum value that RGB pixel can reach) because the RGB range is between 0–255 for each color channel, so maximum value that RGB pixel can reach is 255. This “normalization” enables us to compare the error from our prediction and converge faster. Also, all the images should be of the same size, so we will resize them to be “256 x 256”
from keras.preprocessing import image
from keras.engine import Layer
from keras.layers import Conv2D, Conv3D, UpSampling2D, InputLayer, Conv2DTranspose, Input, Reshape, merge, concatenate
from keras.layers import Activation, Dense, Dropout, Flatten
from keras.layers.normalization import BatchNormalization
from keras.callbacks import TensorBoardfrom keras.models import Sequential, Model
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from skimage.color import rgb2lab, lab2rgb, rgb2gray, gray2rgb
from skimage.transform import resize
from skimage.io import imsave
from time import time
import numpy as np
import tensorflow as tf
from PIL import Image, ImageFile
the first thing we do is importing the libraries that we will use for future work. one of the things that google colab allows you to do is importing your google drive into your environment, so you don’t have to download or upload the dataset if they already exist in your google drive account. we can do so with the following code.
from google.colab import drive
After you import your data, now it’s time to normalize the image (divide them by 255). ImageDataGenerator allows you to divide your data on the fly without consuming memory for doing so.
path = '/Mahmoud/My Drive/dataset/'
train_datagen = ImageDataGenerator(rescale=1. / 255)
then resize the images to be 256×256
train = train_datagen.flow_from_directory(path, target_size=(256, 256),batch_size=length,class_mode=None)
here is a trick in using train_datagen.flow_from_directory that when you pass the path of your dataset, you should put your dataset inside a folder then pass the path of that folder. for example, if your path of the dataset is “My Drive/dataset/Data” you should pass that path “My Drive/dataset” without passing the last folder “Data” because if you have 3 folders contains your data, it assumes that you use 3 classes of a different dataset because we are only using one class, for now, we have only one folder contains our dataset.
the last part is the conversion from RGB to Lab.
for img in train:
lab = rgb2lab(img)
Y.append(lab[:,:,1:] / 128)
X = np.array(X)
Y = np.array(Y)X = X.reshape(X.shape+(1,))
by iterating on each image, we convert the RGB to Lab. As our input to the network will be the L channel, we put the L channel of each image in vector X and the ab in vector Y, we need to divide Y by 128 because the range of values of ab channel is between (-127, 128). After normalization, the values would be between (-1, 1). we add this line X=X.reshape(X.shape+(1,)) because we want dimensions to be the same for X and Y.
1. Getting Started with Building Realtime API Infrastructure
2. How I used machine learning as inspiration for physical paintings
3. MS or Startup Job — Which way to go to build a career in Deep Learning?
4. TOP 100 medium articles related with Artificial Intelligence
3. Train your network
Now, you are ready to build the model.
- The encoder part consists of some convolutional layers with activation function ReLU and, Strides=2 for decreasing the width and height of the latent space vector.
- The decoder part consists of convolutional layers with upsampling layers to restore the dimensions of the original input image (256×256) and reconstruct the image with 2 filters at the last layer which represents the ab channels. you may notice here that the last layer we used tanh activation function instead of ReLU. you should be able to know why we made this because we normalized the ab values to be between (-1,1) and tanh used for squashing the values between (-1,1).
encoder_input = Input(shape=(256, 256, 1,))
encoder_output = Conv2D(64, (3,3), activation='relu', padding='same', strides=2)(encoder_input)
encoder_output = Conv2D(128, (3,3), activation='relu', padding='same')(encoder_output)
encoder_output = Conv2D(128, (3,3), activation='relu', padding='same', strides=2)(encoder_output)
encoder_output = Conv2D(256, (3,3), activation='relu', padding='same')(encoder_output)
encoder_output = Conv2D(256, (3,3), activation='relu', padding='same', strides=2)(encoder_output)
encoder_output = Conv2D(512, (3,3), activation='relu', padding='same')(encoder_output)
encoder_output = Conv2D(512, (3,3), activation='relu', padding='same')(encoder_output)
encoder_output = Conv2D(256, (3,3), activation='relu', padding='same')(encoder_output)#Decoder
decoder_output = Conv2D(128, (3,3), activation='relu', padding='same')(encoder_output)
decoder_output = UpSampling2D((2, 2))(decoder_output)
decoder_output = Conv2D(64, (3,3), activation='relu', padding='same')(decoder_output)
decoder_output = UpSampling2D((2, 2))(decoder_output)
decoder_output = Conv2D(32, (3,3), activation='relu', padding='same')(decoder_output)
decoder_output = Conv2D(16, (3,3), activation='relu', padding='same')(decoder_output)
decoder_output = Conv2D(2, (3, 3), activation='tanh', padding='same')(decoder_output)
decoder_output = UpSampling2D((2, 2))(decoder_output)
model = Model(inputs=encoder_input, outputs=decoder_output)
After constructing the model, train it using mean square error as the loss function and Adam as the optimizer. the number of epochs is a “Trial and error” choice. it also depends on the type of dataset for example, a dataset of flowers may differ much from a dataset of animals.
model.compile(optimizer='adam', loss='mse' , metrics=['accuracy'])
model.fit(X,Y,validation_split=0.2, epochs=1000 )
as we do in the preprocessing for training, we do the same before testing. it doesn’t matter if the testing image is in grayscale or colored because in both ways we extract the L channel and predict with the grayscale image.
test_path = 'MyDrive/Test/'
test = os.listdir(test_path)
for imgName in test:
color_me = 
img = img_to_array(load_img(test_path + imgName))
img = resize(img ,(256,256))
color_me = np.array(color_me, dtype=float)
color_me = rgb2lab(1.0/255*color_me)[:,:,:,0]
color_me = color_me.reshape(color_me.shape+(1,))
After predicting, we are expecting the values to be in range (-1,1). that’s is because we used tanh function at the last layer so, we have to multiply the values by 128 to restore the values of ab channels.
output = model.predict(color_me)
output = output * 128
The last thing is to concatenate the L channel that you used for testing with your output ab channels to reconstruct the Lab space, save the image you just construct after converting it to RGB.
# Output colorizations
for i in range(len(output)):
result = np.zeros((256, 256, 3))
result[:,:,0] = color_me[i][:,:,0]
result[:,:,1:] = output[i]
now, you know the basic idea of doing that task with the main steps, there is another approach that is similar to what we have done but it’s more efficient and generates much faster. it’s using the transfer learning technique.
As Jason Brownlee said in his article “A Gentle Introduction to Transfer Learning for Deep Learning”
Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task.
It is a popular approach in deep learning where pre-trained models are used as the starting point on computer vision and natural language processing tasks given the vast compute and time resources required to develop neural network models on these problems and from the huge jumps in skill that they provide on related problems.
you can use transfer learning to speed up training and improve the performance of your deep learning model.
Transfer learning is a machine learning technique where a model trained on one task is re-purposed on a second related task.
Transfer learning and domain adaptation refer to the situation where what has been learned in one setting … is exploited to improve generalization in another setting
— Page 526, Deep Learning, 2016.
Transfer learning is an optimization that allows rapid progress or improved performance when modeling the second task.
Transfer learning is the improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned.
— Chapter 11: Transfer Learning, Handbook of Research on Machine Learning Applications, 2009.
Transfer learning is related to problems such as multi-task learning and concept drift and is not exclusively an area of study for deep learning.
Nevertheless, transfer learning is popular in deep learning given the enormous resources required to train deep learning models or the large and challenging datasets on which deep learning models are trained.
Transfer learning only works in deep learning if the model features learned from the first task are general.
In transfer learning, we first train a base network on a base dataset and task, and then we repurpose the learned features, or transfer them, to a second target network to be trained on a target dataset and task. This process will tend to work if the features are general, meaning suitable to both base and target tasks, instead of specific to the base task.
— How transferable are features in deep neural networks?
This form of transfer learning used in deep learning is called inductive transfer. This is where the scope of possible models (model bias) is narrowed in a beneficial way by using a model fit on a different but related task.
Andrew Ng, co-founder of Coursera, and Adjunct Professor of Computer Science at Stanford University said during his widely popular NIPS 2016 tutorial that transfer learning will be — after supervised learning — the next driver of ML commercial success.
In particular, he sketched out a chart on a whiteboard in the Figure below. According to Andrew Ng, transfer learning will become a key driver of Machine Learning success in the industry.
In this section, the work is done in Microsoft Azure Virtual Machine instead of google colab, because I wanted to increase my dataset, and google colab crashes the session after using 12 GB of RAM. Microsoft Azure provides 56 GB of RAM. if you are student, you can use your exchange college mail to get 100$ of use in Microsoft Azure VM, with 100$ you can get up to 160 hours of use in your VM. make sure to choose Data Science Virtual Machines because they have Pre-Configured environments in the cloud for Data Science and AI Development, it is using Tesla K-80 Nvidia GPU for training.
Continuing to the previous section, we are going to change model architecture and use transfer learning instead of training the network from scratch, we are going to use VGG16 pre-trained model as an encoder, VGG16 (also called OxfordNet) is a convolutional neural network architecture named after the Visual Geometry Group from Oxford, who developed it. It was used to win the ILSVR (ImageNet) competition in 2014. To this day is it still considered to be an excellent vision model. it was learned to classify between 1000 class of ImageNet dataset
The figure below illustrates the architecture of VGG16: the input layer takes an image in the size of (224 x 224 x 3), and the output layer is a softmax prediction on 1000 classes. From the input layer to the last max-pooling layer (labeled by 7 x 7 x 512) is regarded as the feature extraction part of the model, while the rest of the network is regarded as the classification part of the model.
Because we are going to replace the encoder part with VGG16, we don’t need it as a classifier, we need it as a feature extractor so, the last dense layers aren’t needed we have to pop them up.
vggmodel = keras.applications.vgg16.VGG16()
newmodel = Sequential()
num = 0
for i, layer in enumerate(vggmodel.layers):
for layer in newmodel.layers:
here, we iterate on each layer except the last dense layers so, we add 19 layers to our model. the dimension of the last layer volume is “7x7x512”. we will be using that latent space volume as a feature vector to be input to the decoder. and the decoder is going to learn the mapping from the latent space vector to ab channels. we want the layers of VGG16 with its original weights without changing them so that we set the trainable parameter in each layer to false because we don’t want to train them again. the final architecture is shown in the figure below.