We need a tool or a toolchain that would enable us to open as many file types as possible and extract the text and/or images. We need to support at least doc, docx, odt, odf and pdf.
A tool that would remove the embedded metadata from ingested documents.
Because we aim to extract text from images as well, including the images of scanned documents sent as pdf, we need a set of tools that would be able to extract text from document images.
We need a tool that would recognize the font used in scanned document or images
We need a way to apply each search operation and tag it accordingly. This could be done using a set of regex formulas or for more complex searches, some AI module that would base the search on semantics rather than text characters.
This will be (non-) simple regex formula to quickly replace or delete tagged text
We need a tool that would generate from the resulting text the actual ODT document.
Useful for seeing the difference between original text and result of applying search and displace operation(s) and even more useful during the testing phase to find differences between the result and the expected result.