Workflow and toolchain requirements

lucian.pricop · 29 May 2020 06:52

Here’s a proposed workflow:

The user runs a search and displace command from the console or from the interface (assuming interface for users, CLI would be for the scripted process where search and displace operations are pre-determined)
The user specifies an input document or documents for markup and a list of operations to be run
ingest : searchanddisplace opens document copy and runs “ingest” on it to extract text
privacy shield : document(s) get stripped of metadata
ocr : if there are images in document (including if the document is a scanned pdf) ingest will run ocr on them to extract text
font detection : ingest runs font detection on images
UI to create search/displace operations (optional in workflow, probably used in initial runs)
tagging : searchanddisplace will run tagging depending on each operation (searching part)
UI to review effect of search (and displace) operations (optional in workflow, probably used in initial runs)
displace : searchanddisplace will run displacement operations on tags found previously depending on each operation (displace part)
output : searchanddisplace will save the resulting text in preferred output format e.g. ODT, PDF, md document, with stages reconstructing original document as near as possible, including use of matching fonts (or free equivalents), reinserting structures, images (redacted as needed) etc… Reconstruction to highest level the option for use when redacting to share.

anon13131114 · 4 June 2020 06:26

I think for this we could make our own api/library:

First step generates our “Labeled font data set” as described here by using most common/used fonts.
Second using our trained module to detect the fonts from images/documents by using a neuronal network as describe on the same website here.

lucian.pricop · 4 June 2020 10:54

That sounds promising

putt1ck · 18 June 2021 07:40

I think we should write this up as a future feature of interest, but identifying fonts from OCR text is not a core feature. This might be useful though https://github.com/robinreni96/Font_Recognition-DeepFont

If we can see a way to get basic font recognition e.g. serif, sans serif, courier, “handwriting font” with minimal effort that would be useful, but the most important part of the OCR side is accuracy.