Skip to content

Using msg extractor In Your Own Code

Destiny Peterson edited this page Jan 2, 2023 · 10 revisions

Basic Usage

Before anything else, you will first, of course, need to import the module.

import extract_msg

All example codes will be assuming that you have imported the module like so.

Opening and Closing an MSG File

It is highly recommended that you do not open an MSG file directly using a class, but rather use openMsg to do this. This function will automatically determine if the MSG file has support in any form, and if it does, which class to use for reading it.

msg = extract_msg.openMsg('path/to/msg/file.msg')

MSG files can be closed in the same way that a normal file can, simply using the close method of the class.

msg.close()

MSGFile, the base class for all MSG file types, supports the __enter__ and __exit__ magic functions, allowing you to use the with context manager with them. At the end of the with context manager, the file will automatically be closed.

with extract_msg.openMsg('path/to/msg/file.msg') as msg:
    # Do some stuff

openMsg takes a filename for its first argument. It can also the raw bytes that would make up an MSG file or a file-like object. File-like objects at minimum will require a read, seek, tell, and close method. The read method MUST return bytes and MUST return at most the number of bytes requested.

Saving

While most of the classes support saving, not all of them do. If you try to call a save method when a class doesn't have it, the method will raise a NotImplementedError. The save method requires no arguments, but it is likely that you will want to use some of them. The method also returns a reference to the current instance, allowing for you to chain certain methods directly. However, given the possibility of errors, it is not recommended to chain opening, saving, and closing all together, as this could lead to a file handle that doesn't properly get closed when you expect it to.

msg.save()

Additionally, MSGFile has the saveAttachments method. This will, by default, save all attachments to the current directory. See the section in Intermediate Usage for more information on customizing this method.

msg.saveAttachments()

Intermediate Usage

While the previous section covered the absolute basics of using the module, this section will go over slightly more in depth usage, much of which is going over the advanced details of the functions/methods mentioned in the Basic Usage section.

Opening and Closing an MSG File

The openMsg function takes a number of keyword arguments that can allow you to customize the way it (and the MSGFile instance it produces) behaves. The first and most important keyword argument is strict. If this is set to False, the function will return an instance of MSGFile (not a subclass) in the event that it cannot find a more suitable class to open it with. This allows your code to have at least minimal support for a file that would otherwise not be supported, so long as it uses the MSG standard. This argument is set to True by default.

While strict customizes the behavior of openMsg directly, it also affects the behavior of attachments that are, themselves, embedded MSG files. While normally an unsupported embedded MSG file would be a source of NotImplementedErrors, strict will stop those. This is because all keyword arguments given to openMsg are given to the MSGFile when it is opened, and these keyword arguments (with the exception of those noted below) get used when opening an embedded MSG file.

The rest of the keyword arguments are exclusively used for customizing the behavior of the class used to open the MSGFile, and their descriptions are as follows, with an * marking those that do not apply to embedded MSG files of the file you are opening.

Arguments for all MSGFiles:

  • prefix*: An advanced argument that is used internally for handling embedded MSG files. The value tells the code where to find the directory containing the data for the embedded MSG file. If you know exactly where in the main MSG file your desired file is and want to reduce the number of MSGFile instances produces, then this argument is for you. Otherwise, this should probably be left alone.
  • parentMsg*: Another advanced argument that is used internally for handling embedded MSG files. It's used for ensuring that only one OleFileIO instance is created when opening an MSGFile (except in the case of signed messages, where embedded MSG files are not stored in the same way) and for syncing the named properties. Embedded MSG files have the details of the named properties stored in the top level MSG file, which reduces the data for shared streams. This also means that we can get away with parsing that data once and sharing the result. It is not recommended to ever set this yourself outside of the internal code.
  • attachmentClass: The class the MSGFile will use for attachments, should they be supported by it. If not set, the default Attachment class will be used. If you need to change the behavior of the Attachment class in any way for your file, this is the way to do it.
  • delayAttachments: Delays the initialization of attachments until the user attempts to retrieve them. Setting this to True is one of the ways MSG files with bad/unsupported attachments can be loaded.
  • filename: The filename to be used by default when saving. This is related to specific save arguments. If the argument used to open the MSG file is an actual path, this will default to that path.
  • attachmentErrorBehavior: The behavior to use in the event of an error when parsing the attachments. Should be an int or an instance of extract_msg.enums.AttachErrorBehavior. This is the other method of opening an MSG file with bad/unsupported attachments.
  • overrideEncoding: An encoding to use that overrides the value that is found in the MSG file. This is used if something is wrong with the value in the file that causes issues decoding the data, including no value being set. If you have manually set this and are getting encoding errors, do not report them.

Arguments for subclasses of MessageBase:

  • recipientSeparator: The separator string to use between recipients.
  • ignoreRtfDeErrors: If set to True, tells the code to silently ignore any errors from the RTFDE module. RTFDE is used for extracting encapsulated HTML from the RTF body should no HTML body be found, however it is not perfect. Currently there are several critical errors, and the last commit was in January 2022. The developer is working on fixing many of these errors with a significant rewrite, but until then this argument can be used to try to work around it. Alternatively, the next argument can be used to override how deencapsulation works.
  • deencapsulationFunc: A callable that will override the way that HTML/text is deencapsulated from the RTF body. This function must take exactly 2 arguments, the first being the RTF body from the message (a bytes instance) and the second being an instance of the enum extract_msg.enums.DeencapType which will tell the function what data has been requested. This function must return a string if the requested data is text, otherwise it must return bytes for the HTML. If an error occurs, the function must return either None or raise one of the appropriate exceptions from extract_msg.exceptions to signify what happened. If any other exceptions are thrown, they will not be caught by the class.

Arguments for subclasses of MessageSignedBase:

  • recipientSeparator: The separator string to use between recipients.
  • signedAttachmentClass: Like attachmentClass, except specifically used for handling signed attachments. Defaults to the SignedAtttachment class.

Saving

Saving is, unfortunately, a rather complicated ordeal with a lot of ways to customize it. The save method, where implemented, will accept a number of keyword arguments, many of which will be passed down, without modification, to it's attachments and embedded MSG files. Additionally, the saveAttachments method will pass down it's keyword arguments to the attachments. Below is a series of lists containing the various arguments and where they are used. Like before, arguments with a * after then will not be preserved (they either get removed or modified) when being passed to the next save function (with the exception being when calling saveAttachments which does not directly modify or remove any arguments). The only note is the zip argument, which gets converted to the instance when passed down should a path be provided.

Arguments for MSGFile.saveAttachments and MessageBase.save:

  • skipHidden: If set to True, skips any attachments marked as being hidden from being saved. This is mostly for attachments that get embedded into the text. Default is False.

Arguments for MessageBase.save and Attachment.save:

  • zip*: If set, indicates that the data should be set to a zip file. Value must be a path, zipfile.ZipFile instance, or be a object that fulfils the following criteria:
    • Has an open method that accepts either a str or zipfile.ZipInfo for the path, and supports the modes "r" and "w" for reading and writing bytes, respectively.
    • Has a namelist method which takes no arguments and returns a list of strings representing the paths of all the files (and folders, if any empty folders exist) in the "zip file". This is to handle name collisions.
  • customFilename*: The name to use for the file (in the case of attachments) or the folder (in the case of MSG files) being saved.
  • customPath*: The directory to create the file or folder inside of. If using a zip file for saving, this will be the location inside of it to use.
  • maxNameLength: will force all file names to be shortened to fit in the space (with the extension included in the length). If a number is added to the directory that will not be included in the length, so it is recommended to plan for up to 5 characters extra to be a part of the name. Default is 256.

Arguments for MessageBase.save:

  • json: Enables saving the body (and some other data) in a JSON format. Incompatible with other saving formats.
  • html: Enables saving the body in HTML format. Incompatible with other saving formats.
  • rtf: Enables saving the body in RTF format. Incompatible with other saving formats.
  • raw: Saves the MSG file as a zip file containing all of it's streams and storages. Incompatible with other saving formats.
  • pdf: Enables saving the body as a PDF file. Requires (currently) an install of wkhtmltopdf and that the html argument would be valid. Incompatible with other saving formats.
  • allowFallback: Allows for the save method to pick the best method to save the data if it is unable to save it in the format requested.
  • preparedHtml: Prepares the HTML body a little bit before it is given to the save method. This requires the html argument to have an effect, and is automatically set to True if pdf is set to True.
  • useMsgFilename: Tells the save Method to use the filename property of the MSG file for the name of the output folder.
  • attachmentsOnly: Skip saving the body in each MSG file, only saving the attachments.
  • skipBodyNotFound: If set to True, skips saving the body if no valid format can be found given the other arguments.
  • saveHeader: If set to True, the header is saved as a separate file.
  • charset: If preparedHtml is True (or will be based on other options), the charset to use for the Content-Type meta tag to insert. Set to None or an empty string to skip inserting the tag.
  • wkPath: If pdf is True, the path to use (instead of looking for it) for the wkhtmltopdf executable that will be used for converting the HTML body to PDF.
  • wkOptions: A list or list-like object containing str and bytes instances. These will be the additional options passed to the wkhtmltopdf executable.

Arguments for Attachment.Save:

  • skipEmbedded: If set to True, embedded MSG file attachments will be skipped entirely from saving.
  • contentId: If set the True, uses the content ID as the name to save with, if available.
  • extractEmbedded: If set to True, embedded MSG files will be saved as a new MSG file, as if they were simply attached to the original as bytes.
Clone this wiki locally