NOTE: This project is maintained. While it may seem inactive, it is because there is nothing to add. If you want an enhancement or want to file a bug report, please go to the issues.
A markdown reverser.
Unmarkd is a BeautifulSoup-powered Markdown reverser written in Python and for Python.
This is created as a StackSearch (one of my other projects) dependency. In order to create a better API, I needed a way to reverse HTML. So I created this.
There are similar projects (written in Ruby) but I have not found any written in Python (or for Python) later I found a popular library, html2text.
You know the drill
pip install unmarkd
TL;DR: Html2Text is fast. If you don't need much configuration, you could use Html2Text for the little speed increase.
Click to expand
TL;DR: Unmarkd < Html2Text
Html2Text is basically faster:
(The DOC
variable used can be found here)
Unmarkd sacrifices speed for power.
Html2Text directly uses Python's html.parser
module (in the standard library). On the other hand, Unmarkd uses the powerful HTML parsing library, beautifulsoup4
. BeautifulSoup can be configured to use different HTML parsers. In Unmarkd, we configure it to use Python's html.parser
, too.
But another layer of code means more code is ran.
I hope that's a good explanation of the speed difference.
TL;DR: Unmarkd == Html2Text
I actually found two html-to-markdown libraries. One of them was Tomd which had an incorrect implementation:
It seems to be abandoned, anyway.
Now with Html2Text and Unmarkd:
In other words, they work
TL;DR: Unmarkd > Html2Text
This is Unmarkd's strong point.
In Html2Text, you only have a limited set of options.
In Unmarkd, you can subclass the BaseUnmarker
and implement conversions for new tags (e.g. <q>
), etc. In my opinion, it's much easier to extend and configure Unmarkd.
Unmarkd was originally written as a StackSearch dependancy.
Html2Text has no options for configuring parsing of code blocks. Unmarkd does
Here's an example of basic usage
import unmarkd
print(unmarkd.unmark("<b>I <i>love</i> markdown!</b>"))
# Output: **I *love* markdown!**
or something more complex (shamelessly taken from here):
import unmarkd
html_doc = R"""<h1 id="sample-markdown">Sample Markdown</h1>
<p>This is some basic, sample markdown.</p>
<h2 id="second-heading">Second Heading</h2>
<ul>
<li>Unordered lists, and:<ol>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ol>
</li>
<li>More</li>
</ul>
<blockquote>
<p>Blockquote</p>
</blockquote>
<p>And <strong>bold</strong>, <em>italics</em>, and even <em>italics and later <strong>bold</strong></em>. Even <del>strikethrough</del>. <a href="https://markdowntohtml.com">A link</a> to somewhere.</p>
<p>And code highlighting:</p>
<pre><code class="lang-js"><span class="hljs-keyword">var</span> foo = <span class="hljs-string">'bar'</span>;
<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">baz</span><span class="hljs-params">(s)</span> </span>{
<span class="hljs-keyword">return</span> foo + <span class="hljs-string">':'</span> + s;
}
</code></pre>
<p>Or inline code like <code>var foo = 'bar';</code>.</p>
<p>Or an image of bears</p>
<p><img src="http://placebear.com/200/200" alt="bears"></p>
<p>The end ...</p>
"""
print(unmarkd.unmark(html_doc))
and the output:
# Sample Markdown
This is some basic, sample markdown.
## Second Heading
- Unordered lists, and:
1. One
2. Two
3. Three
- More
>Blockquote
And **bold**, *italics*, and even *italics and later **bold***. Even ~~strikethrough~~. [A link](https://markdowntohtml.com) to somewhere.
And code highlighting:
```js
var foo = 'bar';
function baz(s) {
return foo + ':' + s;
}
```
Or inline code like `var foo = 'bar';`.
Or an image of bears
![bears](http://placebear.com/200/200)
The end ...
Most functionality should be covered by the BasicUnmarker
class defined in unmarkd.unmarkers
.
If you need to reverse markdown from StackExchange (as in the case for my other project), you may use the StackOverflowUnmarker
(or it's alias, StackExchangeUnmarker
), which is also defined in unmarkd.unmarkers
.
If the above two classes do not suit your needs, you can subclass the unmarkd.unmarkers.BaseUnmarker
abstract class.
Currently, you can optionally override the following methods:
detect_language
(parameters: 1)- Parameters:
- html:
bs4.BeautifulSoup
- html:
- When a fenced code block is approached, this function is called with a parameter of type
bs4.BeautifulSoup
passed to it; this is the element the code block was detected from (i.e.pre
). - This function is responsible for detecting the programming language (or returning
''
if none was detected) of the code block. - Note: This method is different from
unmarkd.unmarkers.BasicUnmarker
. It is simpler and does less checking/filtering
- Parameters:
But Unmarkd is more flexible than that.
There are currently 3 constants you may override:
- Formats:
NOTE: Use the Format String Syntax
UNORDERED_FORMAT
- The string format of unordered (bulleted) lists.
ORDERED_FORMAT
- The string format of ordered (numbered) lists.
- Miscellaneous:
ESCAPABLES
- A container (preferably a
set
) of length-1str
that should be escaped
- A container (preferably a
For an HTML tag some_tag
, you can customize how it's converted to markdown by overriding a method like so:
from unmarkd.unmarkers import BaseUnmarker
class MyCustomUnmarker(BaseUnmarker):
def tag_some_tag(self, element) -> str:
... # parse code here
To reduce code duplication, if your tag also has aliases (e.g. strong
is an alias for b
in HTML) then you may modify the TAG_ALIASES
.
If you really need to, you may also modify DEFAULT_TAG_ALIASES
. Be warned: if you do so, you will also need to implement the aliases (currently em
and strong
).
I find myself iterating through the children of the tag a lot. But that would lead to us needing to handle new tags, which could be anything. So here's the template/pattern I recommend:
from unmarkd.unmarkers import BaseUnmarker
class MyCustomUnmarker(BaseUnmarker):
def tag_some_tag(self, element) -> str:
for child in element.children:
if non_tag_output := self.parse_non_tags(child):
output += non_tag_output
continue
assert isinstance(element, bs4.Tag), type(element)
... # Do whatever you want with the child
You may use (when extending) the following functions:
__parse
, 2 parameters:html
: bs4.BeautifulSoup- The html to unmark. This is used internally by the
unmark
method and is slightly faster.
- The html to unmark. This is used internally by the
escape
: bool- Whether to escape the characters inside the string or not. Defaults to
False
.
- Whether to escape the characters inside the string or not. Defaults to
escape
: 1 parameter:string
: str- The string to escape and make markdown-safe
wrap
: 2 parameters:element
: bs4.BeautifulSoup- The element to wrap.
around_with
: str- The character to wrap the element around with. WILL NOT BE ESCPAED
- And, of course,
tag_*
anddetect_language
.