Some problems that we’ve encountered are that Tesseract v4 version is using a new LSTM engine which does not support extracting the font family or font size when applying the OCR.
So far our results on a PDF which contains a scanned image are the following, varying depending on the page segmentation methods used:
PDF scanned document
Default page segmentation (#3)
“Assume a single uniform block of text” page segmentation (#6)
“Assume a single column of text of variable sizes” page segmentation (#4) - this method gave the best result on this scanned document