Here’s a proposed workflow:
- The user runs a search and displace command from the console or from the interface (assuming interface for users, CLI would be for the scripted process where search and displace operations are pre-determined)
- The user specifies an input document or documents for markup and a list of operations to be run
- ingest : searchanddisplace opens document copy and runs “ingest” on it to extract text
- privacy shield : document(s) get stripped of metadata
- ocr : if there are images in document (including if the document is a scanned pdf) ingest will run ocr on them to extract text
- font detection : ingest runs font detection on images
- UI to create search/displace operations (optional in workflow, probably used in initial runs)
- tagging : searchanddisplace will run tagging depending on each operation (searching part)
- UI to review effect of search (and displace) operations (optional in workflow, probably used in initial runs)
- displace : searchanddisplace will run displacement operations on tags found previously depending on each operation (displace part)
- output : searchanddisplace will save the resulting text in preferred output format e.g. ODT, PDF, md document, with stages reconstructing original document as near as possible, including use of matching fonts (or free equivalents), reinserting structures, images (redacted as needed) etc… Reconstruction to highest level the option for use when redacting to share.