Skip to content

Latest commit

 

History

History
81 lines (51 loc) · 5.66 KB

README.md

File metadata and controls

81 lines (51 loc) · 5.66 KB

optdown: Pure-ruby GFM[1] parser.

This simple script converts a GFM into an HTML.

Background stories

In Japan, especially when it comes to write a physical book in computer science or programming, Markdown is widely used by programmers / tech writers.

This sounds to be a good thing but in reality, not an easy job to do. Japanese writing system tends to be complex. HTML-rendered composition of Japanese language has not (yet) reached to certain level of quality. Editors are forced to use other layout engines such as InDesign. So, there are needs to convert a Markdown document into other formats than HTML. The situation is severe because there is no such thing like a "standardized markup of Japanese books". Each book editors have distinct home-grown text markup so that then they can cut & paste the marked-up text file into InDesign (by hand).

The authors hope the status quo to become relaxed someday somehow. For the meantime we developed a tiny program to technically tackle some part of the problem; that there is no such thing like a Markdown parser that renders arbitrary 3rd-party markup. This former program was not made public.

In developing such non-HTML Markdown parser the authors found a fact that (1) Markdown, while it seems to be, is not a trivial language; and (2) however, they can be parsed using pure-ruby only. Based on this finding we thought that this Markdown-parsing problem could be a good benchmark for the Ruby interpreter itself.

We added an HTML rendering mode to the library, deleted 3rd-party markup modes, and this program was born.

What it is

This parser understands GFM.

What is generated by parsing a Markdown document is a DOM object. At the very least you can traverse the DOM tree. Along with the DOM object we provide several "visitor classes" that understands the given DOM to render whatever transformed output. You can create one by subclassing some pre-existing visitor class.

Why the yet another Markdown parser?

General

  • It is standards compliant. As of writing we support 0.28-gfm (2017-08-01).
  • It is fully multilingualized; proper handling of complex Japanese texts implemented through Ruby's multilingualization features (note: emojis are Japan rooted).
  • It is extensible; though not bundled, any 3rd party target format can be made.
  • It is pure-ruby; this property sacrifices runtime speed for maximum portability.

Why not redcarpet[2]

  • First off, the author would like to say that redcarpet IS great. We started by extending the library before ended up writing this repo.
  • Yet, it is not CommonMark compatible.
  • Also, it does not expose its DOM. All you can do is to customize HTML rendering.
  • One of the authors experienced SEGV using it before[3].

Why not cmark[4] / commonmarker[5]

  • Cmark is pretty well written. Extraordinary fast to run, and error-prone.
  • That being said, cmark supports only fixed kinds of output formats like MAN, HTML, XML, COMMONMARK, and TEX. It seems to me that github/cmark[6] ended up forking the library to support FORMAT_PLAINTEXT. This is not a good property.
  • Commommarker is a good gem. People should consider using it if possible. However it is a C extension that needs a compiler. Not for our needs.

Limitations

  • Slower than any other known Markdown library as of writing. We are dead serious. This program is order of magnitude slower than other practical Markdown parsers which claim to be GFM compatible. The authors think this is due to Ruby's interpreter being not optimized enough.
  • While it is relatively straight-forward to write a new format, it is not known to be easy to override existing Markdown syntax; e.g. to introduce elastic tabstop is a challenge.

Legal

The authors believe they have not employed any 3rd-party intellectual properties except ones listed below. Consult LICENSE.txt[7] for detailed usage of this software. TL;DR: it is MIT-licensed.

3rd-party intellectual properties

  • Lawful quotations of the CommonMark spec is included; which is under CC-BY-SA 4.0.

  • File named html5entity.rb is a delivation of a source code provided by W3C[8]; which is licensed under W3C document license[9]. Here is the copyright notice that the license requests us to show:

    Copyright © 2015 W3C® (MIT, ERCIM, Keio, Beihang). This software or document includes material copied from or derived from https://www.w3.org/TR/html5/entities.json
    

HTML5 related legal twist

At one point the GFM spec refers to the HTML Living Standard[10]. However, the authors cannot find any license terms of it. Circumstantial evidence shows that the standard is not seriously licensed in any way. We are afraid of such thing to pollute my codes -- at least, we have no idea if the WHATWG HTML is MIT compatible.

So instead, we hereby refer to the W3C definition of HTML5[11]. It is at least explicitly licensed and the license tells us that it is safe to use.

Acknowledgment

The authors would like to show our gratitude to Tanaka Akira akr@fsij.org for his works on extending regular expression grammars. Two key features which made this entire project possible (namely the absent operator and the subexp call) were both invented by him.