Skip to content

Latest commit

 

History

History
70 lines (40 loc) · 7.53 KB

README.md

File metadata and controls

70 lines (40 loc) · 7.53 KB

tao2tex

Examples: https://www.youtube.com/watch?v=PXW_sWKoHnI

Description

Goes through the HTML of a wordpress math blogpost (mainly, Prof. Terry Tao’s blog) using a combination of regexes and BeautifulSoup, and spits out a $\rm\LaTeX$ version. In some ways, a partial inverse for LaTeX2WP. However, we also include the comments (which sometimes has great information.) This should work well for many of Tao's blog posts, and issues with the generated .tex should be few and easy to fix.

Note: please observe Prof Tao's copyright notice on this page and do not redistribute large numbers of Tao's blogposts without asking him for permission:

Readers are welcome to copy, link to, quote from, or translate reasonable portions of the content of this blog (e.g. a single article) into other media, though for items longer than one or two paragraphs, I would appreciate it if a reference or citation to the URL that the content originates from is provided. If you wish to copy a significantly larger fraction of the content (e.g. an entire series of articles), please contact me about it first.

Requirements and Installation

You need reasonably up-to-date installations of Python 3 and $\rm\LaTeX$ (software to compile the output of tao2tex.py). In addition, we also require the following to be installed (e.g. via pip)

You could also use a cloud service like Overleaf in lieu of a new $\rm\TeX$ installation.

Usage

  1. clone the repo and install the dependencies. One way to do this is with pdm install.

  2. Go to Terry’s blog and find a post you want to convert to $\rm\LaTeX$.

  3. Copy the URL.

  4. cd to the repo and run python3 tao2tex.py URL. (if using pdm, then use pdm run python tao2tex.py)

  5. Wait a few seconds and a .tex file will be produced.

  6. Run the .tex file in your favourite $\rm\LaTeX$ workflow to create a finished PDF.

For instance if we copied this url, we should type python3 tao2tex.py https://terrytao.wordpress.com/2018/12/09/254a-supplemental-weak-solutions-from-the-perspective-of-nonstandard-analysis-optional/.

tao2tex also supports a local mode, and a batch mode:

  • For local mode, save the html of the page and then use the name of the file in place of the url, with the option -l. e.g. python3 tao2tex.py file.html -l
  • For batch mode, save the list of urls in a file, e.g. batch.txt and call python3 tao2tex.py batch.txt -b. If you have a list of local files, you can use -b -l, e.g. the provided tested.txt file. Everything after the first whitespace in each line is ignored, so you can leave comments after a space.

In addition, you can specify the name of the .tex file with the -o option, the -p option prints the output to the command-line, and -d enables a rudimentary debugger. If you do not have a specific post in mind, you can run python3 tao2tex.py -i https://terrytao.wordpress.com to get a list of blog posts on Prof Tao's front page.

Testing

Since the desired output is not precisely defined, we provide a test.html file which may be used for debugging (in particular, for adding features, adjusting to breaking changes, or for adapting to other blogs). It is a short sample HTML file that can be used to test the output of tao2tex via the command python3 tao2tex.py test.html -l.

Customizing the output

The easiest way to customise the output is to modify preamble.tex. The theorems look very close to how they appear online. This is achieved with \usepackage[framemethod=tikz]{mdframed} and the simple style \mdfdefinestyle{tao}{outerlinewidth = 1,roundcorner=2pt,innertopmargin=0}. The more standard amsthm environments are provided as a commented-out block.

There are a number of keywords in the given preamble.tex; they are in all-caps and begin with TTT-, e.g. TTT-BLOG-TITLE. These are substituted via regex by tao2tex.py to create the .tex output. It is possible to create more of these keywords; to make tao2tex see them, you should modify the preamble_formatter function.

Emoji that appear (for instance, in certain comments) are processed (e.g. 😂 becomes \emoji{face_with_tears_of_joy}); \emoji is defined to simply be \texttt, as $\rm\LaTeX$ is unable to render emoji without help. But you can get the actual emoji if you comment out this definition, import the emoji package, and compile with $\rm Lua\TeX$, a variant of $\rm pdf\TeX$.

Known Limitations or Issues

  • the more recent versions (since 2018) of $\rm pdf\LaTeX$ will cope with many unicode symbols (but not all) because UTF8 is assumed to be the default input encoding. If you do not want to install a newer version, you can try using Overleaf. You might be able to get away with adding \usepackage[TU]{inputenc} or \usepackage[T1]{inputenc} to the preamble...

  • Sometimes (In section names, theorem names, etc.) The mathematics is skipped. This should be easy to fix once I have time to look into this.

  • In string_formatter, we escape only a few unicode characters to attempt to please the $\rm\TeX$ engine. We replace greek characters, which do appear on some of the blog posts, in an arguably naive and counterproductive manner (e.g. alpha into\(\alpha\)). $\rm{}pdf\LaTeX$ will complain, and $\rm{}Xe\LaTeX$ and $\rm{}Lua\LaTeX$ will work if you switch to a font that has the glyphs (without, these two will still compile.)

  • Since we pull website data using the requests module, we do not see any HTML generated from Javascript. For example, we are unable to process the occasional polls that Tao makes. However, the rest of the post should work as expected.

  • In some posts, e.g. this one, there are so many comments that we check multiple pages. We skip this when running in -l/--local mode.

  • The heuristics we use for labels are not perfect. However, we definitely include all labelled tags (formatted as <a name="...">eq. number</a>). Most issues seem to be easy to regex away after running tao2tex; for example, I had success replacing end{align}\\label{[a-z-]*} with end{align} globally.

  • Most likely, modification of the BeautifulSoup part is needed to work with other blogs, even those that are on Wordpress. Despite looking quite similar, the precise way that the tags are laid out seem to differ from blog to blog.

  • For similar reasons, if Prof Tao ever updates the layout of the blog, this tool will break. Hopefully such a new version will directly support a good print option, but in any case the posts pre-update with the older layout will still be accessible, thanks to the Internet Archive.