Skip to content

fiedl/extended_email_reply_parser

Repository files navigation

ExtendedEmailReplyParser

Join the chat at https://gitter.im/fiedl/extended_email_reply_parser Build Status Code Climate Test Coverage Gem Version Dependency Status Documentation

When implementing a "reply or comment by email" feature, it's neccessary to filter out signatures and the previous conversation. One needs to extract just the relevant parts for the conversation or comment section of the application. This is what this ruby gem helps to do.

This gem is an extended version of github's email_reply_parser. It wraps the original email_reply_parser and allows to build extensions such as support for i18n and detecting previous conversation that is not properly marked as quotation by the sender's mail client.

Usage

Parsing incoming emails

To extract the relevant text of an email reply, call ExtendedEmailReplyParser.parse, which accepts either a path to the email file, or a Mail::Message object, or the email body text itself, and returns the parsed, i.e. relevant text, which can be used as comment in the application

ExtendedEmailReplyParser.parse "/path/to/email.eml"
ExtendedEmailReplyParser.parse message
ExtendedEmailReplyParser.parse body_text             # => parsed text as String

Or, if you prefer to call #parse on the Mail::Message:

message.parse  # => parsed text as String

For example, for a incoming Mail::Message, message:

@comment.author = User.where(email: message.from.first).first
@comment.text = ExtendedEmailReplyParser.parse message
@comment.save!

Extracting the body text

This gem extends Mail::Message to conveniently extract the text body as utf-8.

message.extract_text

Or, to see where this comes from:

ExtendedEmailReplyParser.extract_text message
ExtendedEmailReplyParser.extract_text '/path/to/email.eml'

Or, in two separate steps:

ExtendedEmailReplyParser.read("/path/to/email.eml")  # => Mail::Message
ExtendedEmailReplyParser.read("/path/to/email.eml").extract_text  # => String

Optional: How to handle html-only emails: There are emails that do not have a text part but only an html part. For those emails, ExtendedEmailReplyParser.parse uses the html part. But, to give you more control over how to handle those situations, ExtendedEmailReplyParser.extract_text returns nil for those emails. If you want your text extraction to fall back to the html part if the text part is missing, use this:

ExtendedEmailReplyParser.extract_text_or_html message
ExtendedEmailReplyParser.extract_text_or_html '/path/to/email.eml'

The Mail::Message object it self is extended to support message.extract_text, message.extract_html_body_content as well as message.extract_text_or_html.

Writing a parser

The parsing system allows you to add your own parser to the parsing chain. Just define a class inheriting from ExtendedEmailReplyParser::Parsers::Base and implement a parse method. The text before parsing is accessed via text.

# app/models/email_parsers/my_parser.rb
class EmailParsers::ShoutParser < ExtendedEmailReplyParser::Parsers::Base

  # This parser SHOUTS THE WHOLE EMAIL.
  # The text before parsing is accessed by `text`.
  #
  def parse
    text.upcase
  end
end

Selecting parsers to use

By default, all parsers that inherit from ExtendedEmailReplyParser::Parsers::Base are used. One simply has to call:

ExtendedEmailReplyParser.parse message

In order to select specific parsers, just chain them yourself:

EmailParsers::ShoutParser.parse \
  ExtendedEmailReplyParser::Parsers::Github.parse \
  message

Installation

Add this line to your application's Gemfile:

# Gemfile
gem 'extended_email_reply_parser'

Or, for the edge version:

# Gemfile
gem 'extended_email_reply_parser', github: 'fiedl/extended_email_reply_parser'

And then execute:

➜ bundle

Or install it yourself as:

➜ gem install extended_email_reply_parser

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Helper methods for writing parsers

To accomplish the most common parsing operations, there are a couple of helper methods.

This, for example, is the English parser.

module ExtendedEmailReplyParser
  class Parsers::I18nEn < Parsers::Base

    def parse
      except_in_visible_block_quotes do
        hide_everything_after ["From: ", "Sent: ", "To: "]
      end
    end

  end
end

add_quote_header_regex

The github parser needs to know how to identify the header line of quotes, for example "On Tue, 2011-03-01 at 18:02 +0530, Abhishek Kona wrote":

Hi,

On Tue, 2011-03-01 at 18:02 +0530, Abhishek Kona wrote:
> Hi folks
>
> What is the best way to clear a Riak bucket of all key, values after
> running a test?
> I am currently using the Java HTTP API.

You can list the keys for the bucket and call delete for each. Or if you
put the keys (and kept track of them in your test) you can delete them
one at a time (without incurring the cost of calling list first.)

By default, it uses the regex /^On .* wrote:$/ for that. To make it recognize other header lines, specify their patterns using add_quote_header_regex.

Since this is needed by the github parser, i.e. possibly before the parse method of your custom parser is run, make sure to add the quote header regex in the class head:

module ExtendedEmailReplyParser
  class Parsers::I18nDe < Parsers::Base
    add_quote_header_regex '^Am .* schrieb.*$'
    # ...
  end
end

hide_everything_after

Some email clients do not quote the previous conversation.

Hi Chris,
this is great, thanks!
Cheers, John


From: Chris <chris@example.com>
Sent: Saturday, July 09, 2016 3:27 PM
To: John <john@example.com>
Subject: The solution!

Hi John,
I've just found a solution to our big problem!
...

To remove the previous conversation, tell the parser expressions to identify where start of the previous conversation:

module ExtendedEmailReplyParser
  class Parsers::I18nEn < Parsers::Base
    def parse
      except_in_visible_block_quotes do
        hide_everything_after ["From: ", "Sent: ", "To: "]
      end
      # ...
    end
  end
end

(The parser will combine the expressions to a regex: /(#{expressions.join(".*?")}.*?\n)/m, for example: /(From: .*?Sent: .*?To: .*?\n)/m.)

To avoid cutting off the email within a visible quote, wrap the hide_everything_after within a except_in_visible_block_quotes block as shown above.

Hi Chris,

> From: Chris <chris@example.com>
> Sent: Saturday, July 09, 2016 3:27 PM
> To: John <john@example.com>
> Subject: The solution!
>
> Hi John,
> I've just found a solution to our big problem!

this is great, thanks!
Cheers, John

If not wrapped in except_in_visible_block_quotes, the parsed email would just be "Hi Chris,", because everything after "From: Sent: To:" would be cut off.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/fiedl/extended_email_reply_parser.

License

The gem is available as open source under the terms of the MIT License.