Capstone Project - Text Recognition Dataset

Some of these images provided as the train data for Text detection and recognition are not appropriate to be used for training. For example, in recognition, we need to crop those texts and align them horizontally depending upon the angle, that gives us around ~600k images in the training data for recognition. Now there are many images, in which you can’t even see the text, and after cropping them off, they are even more unclear. And if this repo has been used to generate those data images, it will again provide such images only. On the other hand, the test dataset images look pretty good as every image is clear and the words can be made out, as they are real images and not synthetic.
Please let me know if I’m wrong. I’m attaching a sample image in which we cannot see any word but still, there are 2 words in that image. @GokulNC @Ishvinder


Infact, when I use Test Data as the Train data and train it for a 1000 epochs, and I use the original given Train data as my Test data, I don’t get a high Validation accuracy, but I do get an extremely high Test accuracy of ~98%.

Maybe a very few images could be like this. Can you please point out the image name?

Also, please tell us the corresponding 4 points for all the words in the image.
If possible, please plot the quadrilateral of the words and post the image here.

The image path is:- Text Recognition/Train/Image/2/1.jpg
Annotations are:-
544.0747 563.01514 565.32263 546.3822 274.24966 272.74448 301.78024 303.28543 तक
1.5635462 28.617405 31.684746 4.630887 179.54037 176.55576 204.3595 207.34412 रख

you cannot see anything in these images, as the text is too small. compared to image size, and background noise too high. there are many such images which you can find in the dataset, in which the text is too small, the background noise is too high or there is not contrast between the text and background at that spot.

We cannot handpick and remove such images, as there are close to 600k images in the train set if we crop out texts from all images.

Images like those in the test set are good for training, but very few of those are available to improve accuracy.

Okay… I have not checked this dataset manually. Yes, hand-picking is not possible.
One thing you could do is ignore all the images with boxes less than a minimum size (say minimum_avg_width=40 and minimum_avg_height=30), and work with the dataset after removing those images.

Please let us know if you do so, and how many images get ignored if the above thresholding is applied.

So using this threshold gave me around 370k (earlier the number was 670k) images. I see better images now where text is visible, will try to train on these. But the problem of noise, contrast and weird fonts still stays. Here are a few example images which show some of those problems @GokulNC

1_48_3 1_63_0 1_65_1 1_73_3


The training did not improve drastically. Earlier it was 85% validation and 33% test with the original train data. With the threshold train dataset it was 86% validation and 41% test.

When I use the test set as my training data and the original train set as my test set(just swapped the datasets), I get a ~50% validation and a 98% test! Does this mean the test set has better images for training and we just need more to improve validation accuracy? Do we have more such images? Let me know if I’m wrong.

Any views on this @GokulNC?