Baseline requirements for each step in workflow

lucian.pricop · 29 May 2020 06:54

ingest:

We need a tool or a toolchain that would enable us to open as many file types as possible and extract the text and/or images. We need to support at least doc, docx, odt, odf and pdf.

privacy shield

A tool that would remove the embedded metadata from ingested documents.

OCR:

Because we aim to extract text from images as well, including the images of scanned documents sent as pdf, we need a set of tools that would be able to extract text from document images.

Font detection

We need a tool that would recognize the font used in scanned document or images

Tagging:

We need a way to apply each search operation and tag it accordingly. This could be done using a set of regex formulas or for more complex searches, some AI module that would base the search on semantics rather than text characters.

displace:

This will be (non-) simple regex formula to quickly replace or delete tagged text

output

We need a tool that would generate from the resulting text the actual ODT document.

diff highlighter

Useful for seeing the difference between original text and result of applying search and displace operation(s) and even more useful during the testing phase to find differences between the result and the expected result.