Workflow and toolchain requirements

Here’s a proposed workflow:

  1. The user runs a search and displace command from the console or from the interface (assuming interface for users, CLI would be for the scripted process where search and displace operations are pre-determined)
  2. The user specifies an input document or documents for markup and a list of operations to be run
  3. ingest : searchanddisplace opens document copy and runs “ingest” on it to extract text
  4. privacy shield : document(s) get stripped of metadata
  5. ocr : if there are images in document (including if the document is a scanned pdf) ingest will run ocr on them to extract text
  6. font detection : ingest runs font detection on images
  7. UI to create search/displace operations (optional in workflow, probably used in initial runs)
  8. tagging : searchanddisplace will run tagging depending on each operation (searching part)
  9. UI to review effect of search (and displace) operations (optional in workflow, probably used in initial runs)
  10. displace : searchanddisplace will run displacement operations on tags found previously depending on each operation (displace part)
  11. output : searchanddisplace will save the resulting text in preferred output format e.g. ODT, PDF, md document, with stages reconstructing original document as near as possible, including use of matching fonts (or free equivalents), reinserting structures, images (redacted as needed) etc… Reconstruction to highest level the option for use when redacting to share.

I think for this we could make our own api/library:

  • First step generates our “Labeled font data set” as described here by using most common/used fonts.

  • Second using our trained module to detect the fonts from images/documents by using a neuronal network as describe on the same website here.

1 Like

That sounds promising

I think we should write this up as a future feature of interest, but identifying fonts from OCR text is not a core feature. This might be useful though

If we can see a way to get basic font recognition e.g. serif, sans serif, courier, “handwriting font” with minimal effort that would be useful, but the most important part of the OCR side is accuracy.