Extracting non-text images from scanned documents - OCR

ionut.orzu · 17 June 2021 12:30

We’ve used OpenCV on the scanned document image to get the contours of the objects found in that image, which mainly should be fragments of text or other images.
The results contain both text fratements and images and we can filter them based on their height, for example, to eliminate most text fragements.

Some results:

Large font, large logos.

Result:

ROI_0

An invoice

invoice750×1061 20.4 KB

Result:

Small text, text overlapping image, images with various backgrounds (the objects need to have different foregrounds that the image background)

21700×2200 328 KB

Result:

lucian.pricop · 17 June 2021 12:40

Looks promising. It seems to be cutting a few pixels from the left of the image?

ionut.orzu · 17 June 2021 12:45

It does from the first one, seems to be because it has the color yellow. The OCR process converts the image to grayscale, yellow is ‘closer’ to white than black. OCR performs the search on the black portions of the image.

ionut.orzu · 17 June 2021 12:46

This is the object it detected
Screenshot from 2021-06-17 15-46-14

ionut.orzu · 17 June 2021 12:49

Also this is the result we get if we don’t filter based on the height of the objects.
Screenshot from 2021-06-17 15-48-51