Applying S&D operations on the original document format

ionut.orzu · 8 September 2021 14:12

This process requires a communication with multiple calls back and forth between the Interface and the Ingest apps.
In Ingest we are converting all document formats to ‘Docx’ in order to use the PhpWord library to extract the structure and styles data and the text contents from the document. Ingest sends this data to the Interface app which applies the S&D operations on the text contents. The interface then sends back the result together with the structure and styles data to Ingest, so it can recreate the ‘Docx’ document and then it can convert it back to the original document’s format (e.g. PDF, ODT, RTF, etc.).

A limitation with this strategy is that PhpWord, at the moment, does not support images. In order to tackle this issue the images and their position in the document’s structure can be extracted via another tool.

In order to convert the document to Markdown Ingest converts the document to PDF and then uses the ‘pdftohtml’ command which converts the document to an XML from which it extracts the text contents. This tool can also extract images from the PDF document.

So to resolve the images issue mentioned above we can use the ‘pdftohtml’ command and combine both strategies in order to get the complete document data.