I have some doubts regarding the implementation of this project for “Text Recognition” module. I’m not sure I’ve understood it completely.
This is what I’ve understood:-
- Pass the image through a CNN (without the final avg pool and FCN)
- Upscale the feature volume obtained to the input dimensions
- Slicing and passing it through an RNN
- Predicting on the slices, we will get a sequence of characters
- Using CTC loss on the predicted sequence of characters, get the appropriate output/word
Please confirm whether I’ve understood this correctly or not.
Things I have doubts in:-
- How do I upscale the feature cube obtained at the end of CNN? Won’t upscaling tamper with the information stored in the filters?
- How do I implement CTC loss on top of the predicted output of RNN to get the appropriate sequence from the predicted sequence?