Extracting non-text images from scanned documents - OCR

We’ve used OpenCV on the scanned document image to get the contours of the objects found in that image, which mainly should be fragments of text or other images.
The results contain both text fratements and images and we can filter them based on their height, for example, to eliminate most text fragements.

Some results:

  1. Large font, large logos.
    logo

Result:

ROI_0

  1. An invoice

Result:

  • ROI_0

  • ROI_3

  • ROI_2

  • ROI_1

  1. Small text, text overlapping image, images with various backgrounds (the objects need to have different foregrounds that the image background)

Result:

Looks promising. It seems to be cutting a few pixels from the left of the image?

It does from the first one, seems to be because it has the color yellow. The OCR process converts the image to grayscale, yellow is ‘closer’ to white than black. OCR performs the search on the black portions of the image.

This is the object it detected
Screenshot from 2021-06-17 15-46-14

Also this is the result we get if we don’t filter based on the height of the objects.
Screenshot from 2021-06-17 15-48-51

1 Like