You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You have to give Microsoft credit for its consistency: instead of storing E-Mail messages in Outlook as RFC822 plain text, they came up with their own super funky file format based on OLE. We often see these in leaks (for example: the entire Panama Papers).
In Python, the most popular parser for MSG files is msg-extractor, but it's maintained by a developer who seems to prioritise implementing the spec over building a tool that could parse all files found in the wild. I tried to fix up the encoding support in the library at some point, but the PR was rejected on the basis that I should request that the source of the files fix their Exchange settings. This did not seem like a healthy option, vis a vis the Russian mafia.
So I started to maintain a fork and eventually ended up cleaning it up quite significantly. Still, two issues are pretty persistent:
a) Encodings in this file format are a mess and many files seem to be outright damaged. msglite 0.30 does much more work on this, but I'm still pretty sure we'll see issues in the future.
b) Outlook re-formats the body of many emails into RTF (Rich text format) upon receipt. In the best case, this means that each .msg file contains an HTML, a plain text and an RTF version. But sometimes the plain text is encoding-fucked, while the RTF is not. Now, msglite does provide a msg.rtfBody property with that version, but processing it further in Python is a bit difficult.
We can either - as we do currently - save the RTF as its own file and essentially declare it an attachment to the message. The attachment is then processed using convert-document and turned into a PDF. This is at best annoying (because people now need to understand that the attachment is part of a real email), and also duplicative if the main message body was extracted correctly.
The other option would be to use striprtf to turn the RTF body into a plain text body. Unfortunately, the lib currently provides an extremely naive implementation of RTF that does not handle encodings other than unicode (cf. joshy/striprtf#11 - but the issue is larger than described there). We might want to consider PRing proper encoding support into striprtf and then adopting it.
The text was updated successfully, but these errors were encountered:
Same goes for PST files (mail box archive). See #3897 for a workaround: after unpacking the PST archive, I'm actually going over all messages to see if the first mime-part is an application/rtf file and if so, I'm converting the RTF part to HTML and replace the content. It's kind of a hack. I didn't even bother to find a Python rtf to html library, I'm calling an external utility, which is rather expensive, computationally wise. Maybe I should also check if filename=="rtf-body.rtf" but I just wanted to ingest 60Gb of data and I don't need a perfect script ;-)
You have to give Microsoft credit for its consistency: instead of storing E-Mail messages in Outlook as RFC822 plain text, they came up with their own super funky file format based on OLE. We often see these in leaks (for example: the entire Panama Papers).
In Python, the most popular parser for MSG files is msg-extractor, but it's maintained by a developer who seems to prioritise implementing the spec over building a tool that could parse all files found in the wild. I tried to fix up the encoding support in the library at some point, but the PR was rejected on the basis that I should request that the source of the files fix their Exchange settings. This did not seem like a healthy option, vis a vis the Russian mafia.
So I started to maintain a fork and eventually ended up cleaning it up quite significantly. Still, two issues are pretty persistent:
a) Encodings in this file format are a mess and many files seem to be outright damaged. msglite 0.30 does much more work on this, but I'm still pretty sure we'll see issues in the future.
b) Outlook re-formats the body of many emails into RTF (Rich text format) upon receipt. In the best case, this means that each .msg file contains an HTML, a plain text and an RTF version. But sometimes the plain text is encoding-fucked, while the RTF is not. Now, msglite does provide a
msg.rtfBody
property with that version, but processing it further in Python is a bit difficult.We can either - as we do currently - save the RTF as its own file and essentially declare it an attachment to the message. The attachment is then processed using
convert-document
and turned into a PDF. This is at best annoying (because people now need to understand that the attachment is part of a real email), and also duplicative if the main message body was extracted correctly.The other option would be to use striprtf to turn the RTF body into a plain text body. Unfortunately, the lib currently provides an extremely naive implementation of RTF that does not handle encodings other than unicode (cf. joshy/striprtf#11 - but the issue is larger than described there). We might want to consider PRing proper encoding support into striprtf and then adopting it.
The text was updated successfully, but these errors were encountered: