Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of Outlook MSG files and RTF bodies #3893

Open
pudo opened this issue Apr 22, 2021 · 1 comment
Open

Handling of Outlook MSG files and RTF bodies #3893

pudo opened this issue Apr 22, 2021 · 1 comment

Comments

@pudo
Copy link
Contributor

pudo commented Apr 22, 2021

You have to give Microsoft credit for its consistency: instead of storing E-Mail messages in Outlook as RFC822 plain text, they came up with their own super funky file format based on OLE. We often see these in leaks (for example: the entire Panama Papers).

In Python, the most popular parser for MSG files is msg-extractor, but it's maintained by a developer who seems to prioritise implementing the spec over building a tool that could parse all files found in the wild. I tried to fix up the encoding support in the library at some point, but the PR was rejected on the basis that I should request that the source of the files fix their Exchange settings. This did not seem like a healthy option, vis a vis the Russian mafia.

So I started to maintain a fork and eventually ended up cleaning it up quite significantly. Still, two issues are pretty persistent:

a) Encodings in this file format are a mess and many files seem to be outright damaged. msglite 0.30 does much more work on this, but I'm still pretty sure we'll see issues in the future.

b) Outlook re-formats the body of many emails into RTF (Rich text format) upon receipt. In the best case, this means that each .msg file contains an HTML, a plain text and an RTF version. But sometimes the plain text is encoding-fucked, while the RTF is not. Now, msglite does provide a msg.rtfBody property with that version, but processing it further in Python is a bit difficult.

We can either - as we do currently - save the RTF as its own file and essentially declare it an attachment to the message. The attachment is then processed using convert-document and turned into a PDF. This is at best annoying (because people now need to understand that the attachment is part of a real email), and also duplicative if the main message body was extracted correctly.

The other option would be to use striprtf to turn the RTF body into a plain text body. Unfortunately, the lib currently provides an extremely naive implementation of RTF that does not handle encodings other than unicode (cf. joshy/striprtf#11 - but the issue is larger than described there). We might want to consider PRing proper encoding support into striprtf and then adopting it.

@Rosencrantz Rosencrantz removed their assignment Nov 2, 2022
@vsessink
Copy link
Contributor

vsessink commented May 3, 2024

Same goes for PST files (mail box archive). See #3897 for a workaround: after unpacking the PST archive, I'm actually going over all messages to see if the first mime-part is an application/rtf file and if so, I'm converting the RTF part to HTML and replace the content. It's kind of a hack. I didn't even bother to find a Python rtf to html library, I'm calling an external utility, which is rather expensive, computationally wise. Maybe I should also check if filename=="rtf-body.rtf" but I just wanted to ingest 60Gb of data and I don't need a perfect script ;-)

@stchris stchris transferred this issue from alephdata/ingest-file Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: 📋 Backlog
Development

No branches or pull requests

4 participants