Will also need a way/tool to identify the text language, can be used in the regex build and/or in the third party libraries used to identify different dimension like “quantity”,“distance”,“duration”,“phone numbers”,“temperature”,“time”,“volume”.
There’s a couple of natural language libraries that could probably do this. But the quick and dirty initial solution is to limit multi-language ability to “what is set in the document” (with fallback to English if not specified or the input format doesn’t support it).
Related: we’d want to flag the document language so it can be set again in output i.e. one of the things we’d want to recreate in odt/msooxml output would be the same language settings - even down to paragraph if it varies. For text phase output it would be good to have a language tag or tags for the doc as a whole as this might impact what happens to that output afterwards.
From what I’ve read and checked,Sift it’s really fast. It will give us the line number and the lines matching a pattern on multiple files/directories and wold make it easy to tag the searched element in multiple files, also maybe we should make a time comparison with ripgrep.
It is a very good convertor, can convert many formats(except PDF files where we can use pdftohtml, DOC, RTF) to HTML and then allowing us to run the search and displace over the files.
It’s losing the font, but keeping the format even if we try rebuilding the file in the original format after running search and displace on it.
Preserve the structural elements of a document, but not formatting details such as margin size.
LibreOffice:
Keeps the font’s while converting the doc/docx/dot etc. files to html.
Supports all the MS Office formats and many others.
Also supports a lot of formats but from what I’ve checked not really removing metadata from documents like doc, docx just reads.
It’s great for using it as it is, but it won’t be easy extending the library using the “user input” basically writing new rules will require a developer.
Yes, use MAT2 and then check with ExifTool if successfully removed, I’ve encountered some warnings while using MAT2 on converted files from doc to docx.