On performing a quick Google Search ( like “age estimation keras” ), we come across several GitHub repositories which demonstrate age and gender estimation models in Keras. Here are some of those,
Our problem statement is,
To implement two models in Android, one for age estimation and another for gender classification. As these models are to be deployed on an Android smartphone ( with relatively low computational power ), we needed models which had lesser parameters, thereby resulting in a lower inference time*. The models should produce satisfactory results on the UTKFace dataset, thus ensuring their better generalization.
* : We’ll make use of the term “inference time” frequently in this story. Informally, it means the time taken by our model to perform a single inference ( i.e. to execute a single forward pass ).
This motivated us to train a custom NN architecture, as existing ones ( implemented by developers mentioned above ) had a larger size i.e. in terms of no. of parameters as well as file size, which was unhealthy, according to problem-statement.
Another approach which most developers might suggest would be to perform Transfer Learning, which has been widely adopted by the machine learning community. In our case, this option had its own constraints, as attaching a different architecture ( as a backbone for our model ) like InceptionV3 ( Christian Szegedy et al, 2015 ), ResNet ( Kaiming He et al, 2015 ) or MobileNets ( Andrew G. Howard et al, 2017 ) would lead to an significant increase in the no. of parameters for the model. Another arguable point could be to use a *frozen backbone model, as the total no. of trainable parameters of the model ( as a whole ) would remain unchanged during training. This won’t affect the inference time or file size of the model as these parameters will be used to make a prediction.
*: Same as setting
Interestingly, we use the same architecture for both, age and gender estimation. The only difference being the output layer for both the NN, as the age estimation NN produces a continuous output ( in
( 0 , 1 ] ) whereas the gender classification NN outputs a probability distribution.
Here’s the high-level overview on how our model would look like,
The “Convolutional Layers” in the diagram depicts a set of convolution blocks whose structure is
Conv2D -> BatchNorm -> LeakyReLU which is described in the next section.
* The Convolutional Layers
As discussed above, the convolutional layers actually consist of blocks, where by each block, we define the structure
Conv2D -> BatchNorm -> LeakyReLU . The no. of blocks which are to be included in the model is determined by the variable
- Another feature of our model is that we provide two versions of our model, one model which uses the vanilla ( standard ) convolutions and other which uses separable convolutions. We refer the model using separable convolutions as the “lite” model in the README of the GitHub repo.
- Whether we need to train a “lite” model or a “vanilla” model ( the one which uses standard convolutions ) is determined by the variable
- As observed, all parameters provided to
tf.keras.layers.BatchNormalizationlayer allows us to use Batch Normalization ( Sergey Ioffe et al., 2015 ), a technique which normalizes incoming signals ( from the convolutional layer, in our case ), so as to reduce internal covariate shift. Batch Normalization has been widely adopted in the ML community as it enables the utilization of larger learning rates and also regularizes the model.
- It has been adopted in other popular architectures as well like MobileNets and DenseNets ( Gao Huang et al, 2016 ).
- As a side note, we set
use_bias=Falsein the convolutional layers ( both standard and separable convolutions ) as the bias has no significance because of Batch Normalization [ 1 ].
- We add L2 weight regularization which helps reduce overfitting by directly penalizing the parameters of the convolutional layer ( i.e. the filters of the convolutional layer ). The weight decay constants were (
1e-5) were taken from [ 2 ].
- LeakyReLU ( Bing Xu et al, 2015 ) is a variant of the ReLU ( Rectified Linear Unit ) activation function which returns x * alpha for inputs x < 0. Setting alpha to 0 gives a standard ReLU function. It helps solve the dying-ReLU problem as described in [ 3 ].
This completes the discussion on the convolutional layers of our model. We’ll discuss more on the
Dense layers and the model compilation.
* Dense Layers and Compiling the model
conv_block , we implement
dense() method which creates a set of
Dense -> LeakyReLU -> Dropout .
- We add Dropout ( Nitish Srivastav et al, 2014 ) following every
Denselayer in our model. Dropout is a regularization technique which randomly sets activations to 0, with a certain probability. It helps reducing interdependent learning among neurons.
Once we’ve constructed the
conv_block as mentioned in Snippet 3, we’re ready to stack these sets of layers end-to-end. The no. of blocks is determined by the
num_blocks argument. The no. of filters in each convolutional layer ( of each block ) is retrieved from the
num_filters array. The same goes for the kernel size, as retrieved from the
kernel_sizes array. The variable
conv_output holds the output of the convolutional layers, which is later passed to the
Snippet 5 shows the outputs for the age estimation model.
Snippet 6 shows the outputs for the gender classification model. The
softmax activation functions outputs a probability distribution for the two labels
We’ve defined the architecture for both the models; we now head towards the training part of the model.