Now the idea is to implement FOTS unified network for text detection. For this, we need 2 models, 1 for detection of the text area (bounding box) and another for the recognition of the text, And an intermediate layer between the 2 models to straight up the angled bounding boxes.
Here we are going to use 2 datasets(or more). Ideally we use Synthtext dataset for training and ICDAR dataset to finetune our model.
What we have : 2 models,1 intermediate layer and 2 Datasets.Lets Start
We have 2 datasets in different formats. We have to bring them into a single format for our comfort.
I have converted the Synthtext dataset to ICDAR format since that format is simple.
Data format: x1,y1,x2,y2,x3,y3,x4,y4,<>word in block<>
Note: Co-ordinates of bounding boxs are from top left- clockwise
Please check my EDA notebook for a detailed dataset explanation.
Now that we have all the data in a single format. Let’s start with the actual preprocessing.
We will have to augment our images for the model to work better. The steps are:
- -10 to 10 degree rotate
- 0.8 to 1.2 ratio height rescale
- 600×600 random samples crop from the image.
Once the augmented data is produced, initialize the ground truth for the models. We have 2 models, 1 for detection and the other for recognition.
For detection, we have to find the bounding boxes of text, so we need:
- Score maps: The image with each pixel having a value of range 0 or 1, 1 indicates that pixel is part of a bounding box and 0 indicates the opposite.
- Geo maps: This map is an image with 5 layers(top,bottom,right,left,angle).Here each pixel indicates the possibility of being the edge of that particular side.
- Training mask: This contains the bounding boxes which should be considered while trained. We neglect boxes with an area less than the mentioned threshold value(threshold can be set by us).
For ROIRotate, we have the coordinates of the boxes and the angle at which they are aligned.
For Recognition, we have the final text present in the bounding box.
Generators are used to fetch the data, since image data occupies high memory.
input_data=[batch_images,batch_text_polyses,batch_text_labels, batch_boxes_masks,transform_matrixes, box_widths,batch_rboxes]output=[batch_image_fns,batch_score_maps,batch_geo_maps, batch_training_masks,batch_text_labels_sparse,box_widths]
The whole model building process is divided into 4 parts.
1.Shared Convolution: Here we take the image And pass it through a pre-trained ResNet-50 followed by deconvolutional and upsampling layers such that the final dimension of output is 1/4th the size of the original image. The final convolutions are called shared convolutions and these are feature representations of the image.
As shown in the snippet above, we can consider any pre-trained network but resnet-50 is preferred. And for real-time latency use resnet-30.
2. How AI Will Power the Next Wave of Healthcare Innovation?
3. Machine Learning by Using Regression Model
4. Top Data Science Platforms in 2021 Other than Kaggle
2.Detection: A custom model is created with 3 layers which are trained separately and concatenated. The text detection layer outputs a dense per-pixel prediction of text using features produced by shared convolutions.
- F_score: Predicts the probability of the pixel being positive(a part of the bounding box).
- geo_map: Each pixel predicts the distance from the top, bottom, right, left and hence it has 4 filters.
We multiply the layers with 512(image dimension) so that the pixel distances are mapped according to the size of an image.
- angle_map: Predicts the orientation of the related bounding box.
Similarly, we multiply the angle map with pi/2 so that they are mapped with reference to the vertical line.
Finally, The geo_map and angle_map are concatenated to get an image with 5 layers.
Once we have the boxes, we use locally aware NMS (Non-max suppression) to get the final boxes that have the highest probability.
3.ROIRotate: Now the obtained boxes from our detect branch may be inclined and some maybe not. So in order to bring all boxes to the same level i.e. inclined at 0 degrees, We use ROI Rotate.
The paper suggests implementing this with the help of affine transformation, but I have used tf.image.crop_to_bounding_box to crop out the boxes and tfa.image.rotate to align all the boxes.
While training we use the images from ground truth to get the boxes, which later will be used to train the recognition model. But in the inference phase, we can get the boxes from the prediction of the detect branch(with NMS) and align those boxes.
4.Recognition: Once the boxes are obtained, we create another custom model with should be able to recognize the text in the box. For this, we use a Neural network as mentioned below.
The final fully connected layer serves as the input layer to the CTC(Connectionist Temporal Classification) decoder. The whole network is known as CRNN (Convolutional Recurrent Neural Network).
In practice, shared convolutions and detection are trained as 1 part and recognition is trained as another.
The detection loss considers the balanced cross-entropy (as per the paper , which is also known as classification loss.
Detection loss (Classification component):
The other component of detection loss considers IoU between predicted and ground-truth bounding boxes as well as the rotation of the predicted bounding boxes.
Detection loss (regression loss):
The total detection loss is composed of the above losses:
- The lambda_reg is a hyperparameter and set to 20 in this implementation.
CTC loss is used as Recognition loss. The total FOTS loss is composed of detection loss + recognition loss.
- The lambda_reg is also a hyperparameter and set to 1 in this implementation.
While training we train both models individually and check their respective results. Based on the performance we can change the weightage of the loss to improve the respective model.
Due to hardware limitations, I couldn’t train the model to the extent. I have trained it on 4000 images on synthtext dataset and used tesseract instead of the recognition model since training text recognition models require high-end tech.
To get good results:
Train your models with synthtext and then finetune it on ICDAR dataset.
Train the detect model to a loss of 0.02 or less, and the Recognition model to a loss of 0.5 or less.
Use tape gradient to have control on training models.
Please visit my GitHub repo for the code and feel free to contact me on my LinkedIn for further discussion.