The fast & forgiving HTML/XML parser, on JSR.io.
htmlparser2 is the fastest HTML parser, and takes some shortcuts to get there. If you need strict HTML spec compliance, have a look at parse5.
Name | Description |
---|---|
htmlparser2 | Fast & forgiving HTML/XML parser |
domhandler | Handler for htmlparser2 that turns documents into a DOM |
domutils | Utilities for working with domhandler's DOM |
css-select | CSS selector engine, compatible with domhandler's DOM |
cheerio | The jQuery API for domhandler's DOM |
dom-serializer | Serializer for domhandler's DOM |
htmlparser2
itself provides a callback interface that allows consumption of documents with minimal allocations. For a
more ergonomic experience, read Getting a DOM below.
import * as htmlparser2 from "@victr/htmlparser2";
const parser = new htmlparser2.Parser({
onopentag(name, attributes) {
/*
* This fires when a new tag is opened.
*
* If you don't need an aggregated `attributes` object,
* have a look at the `onopentagname` and `onattribute` events.
*/
if (
name === "script" && attributes.type === "text/javascript"
) {
console.log("JS! Hooray!");
}
},
ontext(text) {
/*
* Fires whenever a section of text was processed.
*
* Note that this can fire at any point within text and you might
* have to stitch together multiple pieces.
*/
console.log("-->", text);
},
onclosetag(tagname) {
/*
* Fires when a tag is closed.
*
* You can rely on this event only firing when you have received an
* equivalent opening tag before. Closing tags without corresponding
* opening tags will be ignored.
*/
if (tagname === "script") {
console.log("That's it?!");
}
},
});
parser.write(
"Xyz <script type='text/javascript'>const foo = '<<bar>>';</script>",
);
parser.end();
Output (with multiple text events combined):
--> Xyz
JS! Hooray!
--> const foo = '<<bar>>';
That's it?!
This example only shows three of the possible events. Read more about the parser, its events and options in the wiki.
The DomHandler
produces a DOM (document object model) that can be manipulated using the
DomUtils
helper.
import * as htmlparser2 from "@victr/htmlparser2";
const dom = htmlparser2.parseDocument(htmlString);
The DomHandler
, while still bundled with this module, was moved to its
own module. Have a look at that for further information.
htmlparser2
makes it easy to parse RSS, RDF and Atom feeds, by providing a parseFeed
method:
const feed = htmlparser2.parseFeed(content, options);
After having some artificial benchmarks for some time, @AndreasMadsen published his
htmlparser-benchmark
, which benchmarks HTML parses based on
real-world websites.
At the time of writing, the latest versions of all supported parsers show the following performance characteristics on GitHub Actions (sourced from here):
htmlparser2 : 2.17215 ms/file ± 3.81587
node-html-parser : 2.35983 ms/file ± 1.54487
html5parser : 2.43468 ms/file ± 2.81501
neutron-html5parser: 2.61356 ms/file ± 1.70324
htmlparser2-dom : 3.09034 ms/file ± 4.77033
html-dom-parser : 3.56804 ms/file ± 5.15621
libxmljs : 4.07490 ms/file ± 2.99869
htmljs-parser : 6.15812 ms/file ± 7.52497
parse5 : 9.70406 ms/file ± 6.74872
htmlparser : 15.0596 ms/file ± 89.0826
html-parser : 28.6282 ms/file ± 22.6652
saxes : 45.7921 ms/file ± 128.691
html5 : 120.844 ms/file ± 153.944