Existing tools which could be used or adapted for use in the toolchain

ingest:

  • LibreOffice
  • pandoc – for document conversion at input

OCR:

Font detection - detect font in scanned documents or embedded images:

  • Typefont - apparently not very accurate, so not reliable.
  • Create a set of fonts we support and for them a large set of images to train an AI module to detect them in new images. As suggested by Florin here

Tagging:

  • set of regex expressions ran through (Sift) or ripgrep
  • use facebook/duckling for what it can easily detect such as Phone Numbers, time, urls, etc
  • AI module(s) for recognizing more specialized text, semantic searches

Displace:

  • internal regex to replace tagged text (Sift)

output – for creating clean ODT:

  • LibreOffice
  • MAT2
  • pandoc for document conversion at the output

Just tested out on a simple text using this fonts:

  • “Aldrich Regular” -failed
  • “Raleway Regular” - worked
  • “Roboto Regular” -failed
  • “Lora Regular” -failed
    I’ve tested using different font sizes!

Maybe worth looking for another library/tool.

Sure, whatever you can find as long as it’s open source.

Found this article , he says that he was able to get 98% accuracy after 1.2 million training steps using a neuronal network.

Will also need a way/tool to identify the text language, can be used in the regex build and/or in the third party libraries used to identify different dimension like “quantity”,“distance”,“duration”,“phone numbers”,“temperature”,“time”,“volume”.

1 Like

There’s a couple of natural language libraries that could probably do this. But the quick and dirty initial solution is to limit multi-language ability to “what is set in the document” (with fallback to English if not specified or the input format doesn’t support it).

Related: we’d want to flag the document language so it can be set again in output i.e. one of the things we’d want to recreate in odt/msooxml output would be the same language settings - even down to paragraph if it varies. For text phase output it would be good to have a language tag or tags for the doc as a whole as this might impact what happens to that output afterwards.

We could start by using facebook/duckling, to create the initial tags, currently supports 13 dimensions, but can be extended!

From what I’ve read and checked,Sift it’s really fast. It will give us the line number and the lines matching a pattern on multiple files/directories and wold make it easy to tag the searched element in multiple files, also maybe we should make a time comparison with ripgrep.

pandoc:

  • It is a very good convertor, can convert many formats(except PDF files where we can use pdftohtml, DOC, RTF) to HTML and then allowing us to run the search and displace over the files.
  • It’s losing the font, but keeping the format even if we try rebuilding the file in the original format after running search and displace on it.
  • Preserve the structural elements of a document, but not formatting details such as margin size.

LibreOffice:

  • Keeps the font’s while converting the doc/docx/dot etc. files to html.
  • Supports all the MS Office formats and many others.

Poppler

  • Use pdftohtml from poppeler-utils in order to convert PDF files in a format that we can apply search and displace.
  • This works well, I’ve tested on documents like docx, doc and images!
  • On the tested files the metadata was removed as it should!
  • Encountered some working while using MS Office files, but checked the inside XML and the private info was successfully removed.

Exiftool

  • This works well on images/videos!
  • Also supports a lot of formats but from what I’ve checked not really removing metadata from documents like doc, docx just reads.
  • It’s great for using it as it is, but it won’t be easy extending the library using the “user input” basically writing new rules will require a developer.

Can’t LibreOffice remove metadata instead of Exiftool? Exiftool sounds like a great tool for privacy shielding images

LibreOffice from what I’ve found allows you to remove the metadata using the UI, but we can use MAT2, or we can write our script to do that:

  • ex: unzip the .docx and modify the XML file that contains the privacy info.

Exiftool is also very good at checking the metadata and see if MAT2 done a good job.

You mean check metadata with Exiftool to see if it was removed properly ?

Yes, use MAT2 and then check with ExifTool if successfully removed, I’ve encountered some warnings while using MAT2 on converted files from doc to docx.

Just seen this demo using GPT3 to convert English to regex, could be useful https://losslesshq.com/

1 Like