Doubts in Text Recognition CRNN: Capstone Project

I have some doubts regarding the implementation of this project for “Text Recognition” module. I’m not sure I’ve understood it completely.

This is what I’ve understood:-

  • Pass the image through a CNN (without the final avg pool and FCN)
  • Upscale the feature volume obtained to the input dimensions
  • Slicing and passing it through an RNN
  • Predicting on the slices, we will get a sequence of characters
  • Using CTC loss on the predicted sequence of characters, get the appropriate output/word

Please confirm whether I’ve understood this correctly or not.

Things I have doubts in:-

  • How do I upscale the feature cube obtained at the end of CNN? Won’t upscaling tamper with the information stored in the filters?
  • How do I implement CTC loss on top of the predicted output of RNN to get the appropriate sequence from the predicted sequence?

Generally for the text recognition network, you atleast fix the height of the image that you pass as input; the width could be variable (say Wx32).
Now, we define layers of CNNs such that the final layer gives us an output of dimension W’xH’xC’, for example, W’x4x512. This is the feature output from CNN which we will be sending to an RNN with W’ time steps.

That is, for time step 1, you pass the first 4x512 chunk to the RNN after flattening it to 2048, then for 2nd time step, the next 4x512 chunk and so on for W’ time steps. Hence you will ensure your RNN takes an input vector of dimension 2048.

You can now train your model, say using CTC loss. If you’re using PyTorch, it already has support for using CTC loss out of the box; ensure your ground_truth and predictions are structured the right way before using. If you want to implement CTC from scratch, I request you to look into how CTC loss works before proceeding. It can be tedious to implement it yourself.

For more intricate details on the implementation, please look into the CRNN paper and possibly some PyTorch code for reference.

1 Like

where can i find a pre-trained crnn network for the project?

Thank you for the prompt and in-depth reply. I think I know what to do know. I’ll give it a shot, and get back to you if any problems arise.

I have one doubt. Let’s say my CNN gives the output of the shape 512x32x32. So I’ll be doing 32 time-steps in the RNN. Thus, making 32 character predictions. But, what if my ground truth is longer than 32 (longer than W’)? What do we do in such cases or what do we do to avoid this?

Also, at the RNN end, we would get 32 outputs of size 32X129. How do I compare it to the ground truth? The ground truth could be of any length(say 10). So how do I compare the those 10 ground truth character tensor to the 32X129 tensor from RNN? Or is this taken care by the CTC loss? Like in other models we have a prediction of length 5 and the ground truth is also of the length 5. But here, it would be different. How does this work?

So I have tried implementing the model on the test set provided in the capstone for recognition.
The problems I’m having is the loss doesn’t seem to decrease. Could you have a look @GokulNC?
Link to the notebook:-
https://colab.research.google.com/drive/1Ghlz8qZj1-D9BEbo9lspgmzq0GA8Ekb7

Also, how to make inference after training the model?

The point is, you must be training the model for the project. :slight_smile:
There will be no learning at all if you just use a pre-trained model to do detection and recognition.
That being said, you can check this repo for pre-trained checkpoint for demo purposes: CRNN.PyTorch

Generally, your ground truth will always be smaller than your time steps.
If not, you need to design your last CNN feature map so as to have an increased width to atleast that of ground truth. In practice, we just pass a crop containing only 1 word (or max 2) to the recognition network, though generally it need not be true; depends on how you design the detection & recognition modules.

Yes, this is taken care by the CTC loss. Please check the article that I linked in my previous reply to see how CTC outputs are handled. It’s simple.

Please try cross-comparing your implementation with the repo I sent yesterday. The code over there seems not complicated.

As I said, please check the article I said which tells you how to process CTC output.

Yes, I did see the code. And read the document on CTC. I have tried to implement a simplified and similar version to that repo. But I don’t understand what’s wrong in my code. If you could please take a look. @GokulNC

https://colab.research.google.com/drive/19N142zOad2K-sRC6C1XW4KjlYr_nhnc6

I didn’t look into your code in detail, but one thing I noticed is that you do not seem to have handled blank token ('-' as in CTC) and end of text token. Please check that.

Hello, I tried using end of text token. I had already used the blank token earlier. CTCLoss has a default argument “token = 0”, and my token was 0 in ground_truth labels. I’ve been trying to figure out what wrong in my code, but don’t seem to know. I’ve matched the arguments provided by me and other projects on RCNN to CTCLoss, and they match, but still my loss is not changing. It’s staying at one level from start to end. Please, I need help. Please could you find out what’s wrong in my code. @GokulNC

https://colab.research.google.com/drive/1S1KsqT66OeOVrsKYBmoFfqLhl09-Hx2X

What are the expected Validation and test accuracies on the provided train and test data for Text Recognition? I’m using a generic CRNN for this. I’m getting a validation accuracy of 80%. What is the expected accuracy for this?