Handling of Outlook MSG files and RTF bodies #3893

pudo · 2021-04-22T09:15:13Z

You have to give Microsoft credit for its consistency: instead of storing E-Mail messages in Outlook as RFC822 plain text, they came up with their own super funky file format based on OLE. We often see these in leaks (for example: the entire Panama Papers).

In Python, the most popular parser for MSG files is msg-extractor, but it's maintained by a developer who seems to prioritise implementing the spec over building a tool that could parse all files found in the wild. I tried to fix up the encoding support in the library at some point, but the PR was rejected on the basis that I should request that the source of the files fix their Exchange settings. This did not seem like a healthy option, vis a vis the Russian mafia.

So I started to maintain a fork and eventually ended up cleaning it up quite significantly. Still, two issues are pretty persistent:

a) Encodings in this file format are a mess and many files seem to be outright damaged. msglite 0.30 does much more work on this, but I'm still pretty sure we'll see issues in the future.

b) Outlook re-formats the body of many emails into RTF (Rich text format) upon receipt. In the best case, this means that each .msg file contains an HTML, a plain text and an RTF version. But sometimes the plain text is encoding-fucked, while the RTF is not. Now, msglite does provide a msg.rtfBody property with that version, but processing it further in Python is a bit difficult.

We can either - as we do currently - save the RTF as its own file and essentially declare it an attachment to the message. The attachment is then processed using convert-document and turned into a PDF. This is at best annoying (because people now need to understand that the attachment is part of a real email), and also duplicative if the main message body was extracted correctly.

The other option would be to use striprtf to turn the RTF body into a plain text body. Unfortunately, the lib currently provides an extremely naive implementation of RTF that does not handle encodings other than unicode (cf. joshy/striprtf#11 - but the issue is larger than described there). We might want to consider PRing proper encoding support into striprtf and then adopting it.

The text was updated successfully, but these errors were encountered:

vsessink · 2024-05-03T11:44:05Z

Same goes for PST files (mail box archive). See #3897 for a workaround: after unpacking the PST archive, I'm actually going over all messages to see if the first mime-part is an application/rtf file and if so, I'm converting the RTF part to HTML and replace the content. It's kind of a hack. I didn't even bother to find a Python rtf to html library, I'm calling an external utility, which is rather expensive, computationally wise. Maybe I should also check if filename=="rtf-body.rtf" but I just wanted to ingest 60Gb of data and I don't need a perfect script ;-)

pudo assigned Rosencrantz Apr 22, 2021

Rosencrantz removed their assignment Nov 2, 2022

stchris transferred this issue from alephdata/ingest-file Oct 21, 2024

stchris added the ingest-file label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of Outlook MSG files and RTF bodies #3893

Handling of Outlook MSG files and RTF bodies #3893

pudo commented Apr 22, 2021

vsessink commented May 3, 2024

Handling of Outlook MSG files and RTF bodies #3893

Handling of Outlook MSG files and RTF bodies #3893

Comments

pudo commented Apr 22, 2021

vsessink commented May 3, 2024