-
-
Notifications
You must be signed in to change notification settings - Fork 172
Using msg extractor In Your Own Code
Before anything else, you will first, of course, need to import the module.
import extract_msg
All example codes will be assuming that you have imported the module like so.
It is highly recommended that you do not open an MSG file directly using a class, but rather use openMsg
to do this. This function will automatically determine if the MSG file has support in any form, and if it does, which class to use for reading it.
msg = extract_msg.openMsg('path/to/msg/file.msg')
MSG files can be closed in the same way that a normal file can, simply using the close
method of the class.
msg.close()
MSGFile
, the base class for all MSG file types, supports the __enter__
and __exit__
magic functions, allowing you to use the with
context manager with them. At the end of the with
context manager, the file will automatically be closed.
with extract_msg.openMsg('path/to/msg/file.msg') as msg:
# Do some stuff
openMsg
takes a filename for its first argument. It can also the raw bytes that would make up an MSG file or a file-like object. File-like objects at minimum will require a read
, seek
, tell
, and close
method. The read
method MUST return bytes and MUST return at most the number of bytes requested.
While most of the classes support saving, not all of them do. If you try to call a save method when a class doesn't have it, the method will raise a NotImplementedError
. The save
method requires no arguments, but it is likely that you will want to use some of them. The method also returns a reference to the current instance, allowing for you to chain certain methods directly. However, given the possibility of errors, it is not recommended to chain opening, saving, and closing all together, as this could lead to a file handle that doesn't properly get closed when you expect it to.
msg.save()
Additionally, MSGFile
has the saveAttachments
method. This will, by default, save all attachments to the current directory. See the section in Intermediate Usage for more information on customizing this method.
msg.saveAttachments()
While the previous section covered the absolute basics of using the module, this section will go over slightly more in depth usage, much of which is going over the advanced details of the functions/methods mentioned in the Basic Usage section.
The openMsg
function takes a number of keyword arguments that can allow you to customize the way it (and the MSGFile instance it produces) behaves. The first and most important keyword argument is strict
. If this is set to False
, the function will return an instance of MSGFile (not a subclass) in the event that it cannot find a more suitable class to open it with. This allows your code to have at least minimal support for a file that would otherwise not be supported, so long as it uses the MSG standard. This argument is set to True
by default.
While strict
customizes the behavior of openMsg
directly, it also affects the behavior of attachments that are, themselves, embedded MSG files. While normally an unsupported embedded MSG file would be a source of NotImplementedError
s, strict
will stop those. This is because all keyword arguments given to openMsg
are given to the MSGFile when it is opened, and these keyword arguments (with the exception of those noted below) get used when opening an embedded MSG file.
The rest of the keyword arguments are exclusively used for customizing the behavior of the class used to open the MSGFile, and their descriptions are as follows, with an * marking those that do not apply to embedded MSG files of the file you are opening.
Arguments for all MSGFiles:
-
prefix
*: An advanced argument that is used internally for handling embedded MSG files. The value tells the code where to find the directory containing the data for the embedded MSG file. If you know exactly where in the main MSG file your desired file is and want to reduce the number of MSGFile instances produces, then this argument is for you. Otherwise, this should probably be left alone. -
parentMsg
*: Another advanced argument that is used internally for handling embedded MSG files. It's used for ensuring that only one OleFileIO instance is created when opening anMSGFile
(except in the case of signed messages, where embedded MSG files are not stored in the same way) and for syncing the named properties. Embedded MSG files have the details of the named properties stored in the top level MSG file, which reduces the data for shared streams. This also means that we can get away with parsing that data once and sharing the result. It is not recommended to ever set this yourself outside of the internal code. -
attachmentClass
: The class theMSGFile
will use for attachments, should they be supported by it. If not set, the defaultAttachment
class will be used. If you need to change the behavior of theAttachment
class in any way for your file, this is the way to do it. -
delayAttachments
: Delays the initialization of attachments until the user attempts to retrieve them. Setting this toTrue
is one of the ways MSG files with bad/unsupported attachments can be loaded. -
filename
: The filename to be used by default when saving. This is related to specific save arguments. If the argument used to open the MSG file is an actual path, this will default to that path. -
attachmentErrorBehavior
: The behavior to use in the event of an error when parsing the attachments. Should be an int or an instance ofextract_msg.enums.AttachErrorBehavior
. This is the other method of opening an MSG file with bad/unsupported attachments. -
overrideEncoding
: An encoding to use that overrides the value that is found in the MSG file. This is used if something is wrong with the value in the file that causes issues decoding the data, including no value being set. If you have manually set this and are getting encoding errors, do not report them.
Arguments for subclasses of MessageBase
:
-
recipientSeparator
: The separator string to use between recipients. -
ignoreRtfDeErrors
: If set toTrue
, tells the code to silently ignore any errors from the RTFDE module. RTFDE is used for extracting encapsulated HTML from the RTF body should no HTML body be found, however it is not perfect. Currently there are several critical errors, and the last commit was in January 2022. The developer is working on fixing many of these errors with a significant rewrite, but until then this argument can be used to try to work around it. Alternatively, the next argument can be used to override how deencapsulation works. -
deencapsulationFunc
: A callable that will override the way that HTML/text is deencapsulated from the RTF body. This function must take exactly 2 arguments, the first being the RTF body from the message (abytes
instance) and the second being an instance of the enumextract_msg.enums.DeencapType
which will tell the function what data has been requested. This function must return a string if the requested data is text, otherwise it must return bytes for the HTML. If an error occurs, the function must return either None or raise one of the appropriate exceptions fromextract_msg.exceptions
to signify what happened. If any other exceptions are thrown, they will not be caught by the class.
Arguments for subclasses of MessageSignedBase
:
-
recipientSeparator
: The separator string to use between recipients. -
signedAttachmentClass
: LikeattachmentClass
, except specifically used for handling signed attachments. Defaults to theSignedAtttachment
class.
Saving is, unfortunately, a rather complicated ordeal with a lot of ways to customize it. The save method, where implemented, will accept a number of keyword arguments, many of which will be passed down, without modification, to it's attachments and embedded MSG files. Additionally, the saveAttachments
method will pass down it's keyword arguments to the attachments. Below is a series of lists containing the various arguments and where they are used. Like before, arguments with a * after then will not be preserved (they either get removed or modified) when being passed to the next save function (with the exception being when calling saveAttachments
which does not directly modify or remove any arguments). The only note is the zip
argument, which gets converted to the instance when passed down should a path be provided.
Arguments for MSGFile.saveAttachments
and MessageBase.save
:
-
skipHidden
: If set toTrue
, skips any attachments marked as being hidden from being saved. This is mostly for attachments that get embedded into the text. Default isFalse
.
Arguments for MessageBase.save
and Attachment.save
:
-
zip
*: If set, indicates that the data should be set to a zip file. Value must be a path,zipfile.ZipFile
instance, or be a object that fulfils the following criteria:- Has an open method that accepts either a
str
orzipfile.ZipInfo
for the path, and supports the modes"r"
and"w"
for reading and writing bytes, respectively. - Has a
namelist
method which takes no arguments and returns a list of strings representing the paths of all the files (and folders, if any empty folders exist) in the "zip file". This is to handle name collisions.
- Has an open method that accepts either a
-
customFilename
*: The name to use for the file (in the case of attachments) or the folder (in the case of MSG files) being saved. -
customPath
*: The directory to create the file or folder inside of. If using a zip file for saving, this will be the location inside of it to use. -
maxNameLength
: will force all file names to be shortened to fit in the space (with the extension included in the length). If a number is added to the directory that will not be included in the length, so it is recommended to plan for up to 5 characters extra to be a part of the name. Default is 256.
Arguments for MessageBase.save
:
-
json
: Enables saving the body (and some other data) in a JSON format. Incompatible with other saving formats. -
html
: Enables saving the body in HTML format. Incompatible with other saving formats. -
rtf
: Enables saving the body in RTF format. Incompatible with other saving formats. -
raw
: Saves the MSG file as a zip file containing all of it's streams and storages. Incompatible with other saving formats. -
pdf
: Enables saving the body as a PDF file. Requires (currently) an install of wkhtmltopdf and that thehtml
argument would be valid. Incompatible with other saving formats. -
allowFallback
: Allows for the save method to pick the best method to save the data if it is unable to save it in the format requested. -
preparedHtml
: Prepares the HTML body a little bit before it is given to the save method. This requires thehtml
argument to have an effect, and is automatically set toTrue
ifpdf
is set toTrue
. -
useMsgFilename
: Tells the save Method to use thefilename
property of the MSG file for the name of the output folder. -
attachmentsOnly
: Skip saving the body in each MSG file, only saving the attachments. -
skipBodyNotFound
: If set toTrue
, skips saving the body if no valid format can be found given the other arguments. -
saveHeader
: If set toTrue
, the header is saved as a separate file. -
charset
: IfpreparedHtml
isTrue
(or will be based on other options), the charset to use for theContent-Type
meta tag to insert. Set toNone
or an empty string to skip inserting the tag. -
wkPath
: Ifpdf
isTrue
, the path to use (instead of looking for it) for thewkhtmltopdf
executable that will be used for converting the HTML body to PDF. -
wkOptions
: A list or list-like object containingstr
andbytes
instances. These will be the additional options passed to thewkhtmltopdf
executable.
Arguments for Attachment.Save
:
-
skipEmbedded
: If set toTrue
, embedded MSG file attachments will be skipped entirely from saving. -
contentId
: If set theTrue
, uses the content ID as the name to save with, if available. -
extractEmbedded
: If set toTrue
, embedded MSG files will be saved as a new MSG file, as if they were simply attached to the original as bytes.