Applying OCR on scanned documents

For extracting text from scanned documents using OCR we’ve used the tesseract library and for the image processing we have used some techniques recommended in this documentation page.

Some problems that we’ve encountered are that Tesseract v4 version is using a new LSTM engine which does not support extracting the font family or font size when applying the OCR.

So far our results on a PDF which contains a scanned image are the following, varying depending on the page segmentation methods used:

PDF scanned document

  1. Default page segmentation (#3)

  2. “Assume a single uniform block of text” page segmentation (#6)

  3. “Assume a single column of text of variable sizes” page segmentation (#4) - this method gave the best result on this scanned document

1 Like

I think I read somewhere about a separate library/tool that can look at images to do font guessing (not doing OCR as such). Will dig about in my notes.