Applying OCR on scanned documents

ionut.orzu · 10 June 2021 08:02

For extracting text from scanned documents using OCR we’ve used the tesseract library and for the image processing we have used some techniques recommended in this documentation page.

Some problems that we’ve encountered are that Tesseract v4 version is using a new LSTM engine which does not support extracting the font family or font size when applying the OCR.

So far our results on a PDF which contains a scanned image are the following, varying depending on the page segmentation methods used:

PDF scanned document

Default page segmentation (#3)

Screenshot from 2021-06-10 10-51-29754×763 92.2 KB
“Assume a single uniform block of text” page segmentation (#6)

Screenshot from 2021-06-10 10-54-06758×563 82.9 KB
“Assume a single column of text of variable sizes” page segmentation (#4) - this method gave the best result on this scanned document

Screenshot from 2021-06-10 10-55-55718×563 75.2 KB

putt1ck · 12 June 2021 07:51

I think I read somewhere about a separate library/tool that can look at images to do font guessing (not doing OCR as such). Will dig about in my notes.