Extract data from PDF or Image

Need suggestion on how to read data from customer sent pdf - like Tax forms/application forms. May be image/text pdf. Current approach is to convert pdf to png and do pre-processing and using tesseract to extract text. Not getting good result. Tried openCV, PIL/ PyPDF2/PyMuPDF/PDFMiner etc. Not getting expected results. Tried edge detection/contour detection/bounding box/ feature extraction/hough lines/ HOCR/ALTO XML - Not getting consistent results. Anyone please suggest the right approach. Thanks

You can use tabula or camelot library

Hi @vvr72,
You may refer Key value extraction from documents using OCR

1 Like

Thanks but tabula is not supporting images scanned as pdf. Anyway, I will try it.

Thanks. I have previously seen this page but did not try due to few reasons. I will give a try and see if it helps.

This one takes image file as input. I guess he wants to process scanned doc saved as pdf. but I guess we can convert pdf to png files - that will be several png images - one per page of pdf and then apply this .

Yes, he’s mentioned this in the description. I thought he’s asking for an approach after preprocessing part on png.

Thanks. Yes I convert the pdf to png first and do pre-processing like thresholding/ opening/closing/eration/dilation/canny etc and then performing Tesseract OCR. I have tried OpenCV hough lines detection and feature extraction etc. Have tried all pdf-text based libraries like PyPDF2, PyMuPDF, Tabula and camelot etc. I have not tried deep learning techniques. So wanted simple approach to extract data from forms. Not interested in AWS tesseract/ Azure form or Google AI etc

Thanks. Yes I convert the pdf to png first and do pre-processing like thresholding/ opening/closing/erotion/dilation/canny etc and then performing Tesseract OCR. I have tried OpenCV hough lines detection and feature extraction etc. Have tried all pdf-text based libraries like PyPDF2, PyMuPDF, Tabula and camelot etc. I have not tried deep learning techniques. So wanted simple approach to extract data from forms. Not interested in AWS tesseract/ Azure form or Google AI etc

I have not yet tried your suggestion key-value pair extraction. Will try and will update here. Thanks