Extract data from PDF or Image

Need suggestion on how to read data from customer sent pdf - like Tax forms/application forms. May be image/text pdf. Current approach is to convert pdf to png and do pre-processing and using tesseract to extract text. Not getting good result. Tried openCV, PIL/ PyPDF2/PyMuPDF/PDFMiner etc. Not getting expected results. Tried edge detection/contour detection/bounding box/ feature extraction/hough lines/ HOCR/ALTO XML - Not getting consistent results. Anyone please suggest the right approach. Thanks

You can use tabula or camelot library

Hi @vvr72,
You may refer Key value extraction from documents using OCR

1 Like

Thanks but tabula is not supporting images scanned as pdf. Anyway, I will try it.

Thanks. I have previously seen this page but did not try due to few reasons. I will give a try and see if it helps.

This one takes image file as input. I guess he wants to process scanned doc saved as pdf. but I guess we can convert pdf to png files - that will be several png images - one per page of pdf and then apply this .

Yes, he’s mentioned this in the description. I thought he’s asking for an approach after preprocessing part on png.

Thanks. Yes I convert the pdf to png first and do pre-processing like thresholding/ opening/closing/eration/dilation/canny etc and then performing Tesseract OCR. I have tried OpenCV hough lines detection and feature extraction etc. Have tried all pdf-text based libraries like PyPDF2, PyMuPDF, Tabula and camelot etc. I have not tried deep learning techniques. So wanted simple approach to extract data from forms. Not interested in AWS tesseract/ Azure form or Google AI etc

Thanks. Yes I convert the pdf to png first and do pre-processing like thresholding/ opening/closing/erotion/dilation/canny etc and then performing Tesseract OCR. I have tried OpenCV hough lines detection and feature extraction etc. Have tried all pdf-text based libraries like PyPDF2, PyMuPDF, Tabula and camelot etc. I have not tried deep learning techniques. So wanted simple approach to extract data from forms. Not interested in AWS tesseract/ Azure form or Google AI etc

I have not yet tried your suggestion key-value pair extraction. Will try and will update here. Thanks

its little old thread - but I suggest using bounding boxes approach, you can do that using only opencv or if you want try EAST with opencv. should be ok in general.

If you need to extract the praragraphs and not only key-value, then it will be little tricky.

if you have already used bounding boxes and still have the issue, you can upload a sample image and can try to help.