GitHub - mitica/html-explorer: HTML Page Explorer

html-explorer - HTML page explorer

html-explorer extracts main information from a HTML page.

Currently it extracts:

Page meta:
- title
- description
- keywords
- canonical
- feeds
Main images - a ordered list of images;
Main videos - a ordered list of videos;
Page content - main page content/article;
Page encoding;

Usage

var explorer = require('html-explorer');
explorer.explore('http://edition.cnn.com/')
.then(function(page){
  // page object
});

Result structure

url (String) - input url param;
href (String) - server response url;
canonical (String) - page canonical;
title (String);
description (String);
keywords (String);
content (String);
encoding (String): utf8, windows-1251, iso-8859-2, etc.;
feeds ([Feed]) - list of feeds:
- title (String);
- href (String) - feed url;
images ([Image]) - a list of images:
- src (String) - image src;
- viewWidth (Number) - image view width if founded;
- viewHeight (Number);
- width (Number) - real image width;
- height (Number);
- alt (String);
- title (String);
- rating (Number) - count of words matching page title words;
- type (String) - (only if identify option is true) - can be: bmp, gif, jpg, png, psd, svg, tiff or webp;
- data (Buffer) - (only if identify option is true) - image data.
videos ([Video]) - a list of videos:
- sourceType (String) - video source type: URL, YOUTUBE, VIMEO or IFRAME;
- sourceId (String) - depends of sourceType: url or source id;
- width (Number) - video width;
- height (Number) - video height;

API

`explorer.explore(url, [options])`

Explores an url.

Options

page - html page options:
- timeout (Number) [5000] - request timeout;
- headers (Object) [{}]- request headers;
- canonical (Boolean) [true] - find or not;
- feeds (Boolean|Function) - find or not, function for validating a feed;
- validator (Function) [noop] - Validates page after exploring info, throw an error if invalid;
- html (Boolean|String) [false] - Return HTML text or not. If is string it will be used as remote HTML body;
- lang (String) - page language 2 chars code;
content (Boolean|Object) - content options:
- filter (Boolean|Object):
  - minLine: (Number) [50] - accepted minimum line length;
  - minPhrase: (Number) [100] - accepted minimum phrase length;
  - phraseEndRegex: (Regex) default: /[.!?:;¡¿%]$/ - end phrase puctuation regex;
  - phraseEnd: (Boolean) [false] - require phrase to end with a puctuation;
  - maxInvalidLines: (Number) [3] - maximum consecutive invalid lines;
  - minScore: (Number) [0.3] - min in text search score: 0 to 1;
images (Boolean|Object) - images explorer options:
- limit (Number) [5] - maximum number of images to return;
- filter (Object):
  - minViewHeight (Number) [180] - accepted minimum image view height;
  - minViewWidth (Number) [220] - accepted minimum image view width;
  - minHeight (Number) [200] - accepted minimum image height;
  - minWidth (Number) [250] - accepted minimum image width;
  - minRating (Number) [0] - accepted minimum image rating(...);
  - minRatio (Number) [null] - accepted minimum image ratio (ratio=width/height);
  - maxRatio (Number) [null] - accepted maximum image ratio;
  - invalidRatio (Number | [Number]) [1] - example: value [1] will exclude all images with width=height;
  - invalidExtensions ([String]) [gif, png] - invalid image extensions;
  - src (RegExp) [see source code] - invalidate image by SRC;
  - extraSrc (RegExp) - invalidate image by SRC;
  - cssClass (RegExp) - filter image by its css class;
  - types (String|[String]) - accepted image types (bmp, gif, jpg, png, psd, svg, tiff, webp), default: ['jpg'];
  - invalidTypes (String|[String]) - invalid image types;
- identify (Boolean) [false] - identify image width, height and type by downloading data;
- data (Boolean) [false] - set image data property. Works only if identify is true.
- timeout (Number) [1000] - image downloading timeout, in ms.
video (Boolean|Object) - video explorer options:
- limit (Number) [1] - maximum number or videos to return;
- filter (Object):
  - minHeight (Number) [200] - accepted minimum image height;
  - minWidth (Number) [250] - accepted minimum image width;
  - minRatio (Number) [null] - accepted minimum image ratio (ratio=width/height);
  - maxRatio (Number) [null] - accepted maximum image ratio;
  - invalidRatio (Number | [Number]) [1] - example: value [1] will exclude all images with width=height;
  - src (RegExp) [see source code] - invalidate image by SRC;
  - extraSrc (RegExp) - invalidate image by SRC;
- priority ([String]) - video source type priority - default: ['YOUTUBE', 'VIMEO', 'URL', 'IFRAME'];
- customFinders ([Finder]) - a list of custom video fiders.

Changelog

v0.1.12 - July 16, 2016

filter page content by relevancy score option;
added lang option;
using ascripe module instead of readability-js;
using in-text-search module;

v0.1.11 - August 16, 2016

find videos from known iframes

v0.1.9 - August 15, 2015

explore content with readability-js
fix videos explore bug

v0.1.6 - August 3, 2015

explore videos from microdata

v0.1.5 - August 3, 2015

filter page content
better encoding detection & add to the response object

v0.1.4 - August 2, 2015

tests
extracting page content
editorconfig, eslint

v0.1.2 - June 17, 2015

custom video finders
sort videos by priority option
head(og:video) video finder

v0.1.1 - June 13, 2015

decode page urls
image downloading timeout

v0.1.0 - May 30, 2015

detect embedded videos
better images order

v0.0.8 - May 29, 2015

detect charset from content-type response header
image filter: invalidRatio

v0.0.7 - May 22, 2015

filter images by view size - width & heigth detected in image attributes
merge images with same src

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
lib		lib
test		test
.editorconfig		.editorconfig
.eslintrc		.eslintrc
.gitignore		.gitignore
.travis.yml		.travis.yml
package.json		package.json
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

html-explorer - HTML page explorer

Usage

Result structure

API

`explorer.explore(url, [options])`

Options

Changelog

v0.1.12 - July 16, 2016

v0.1.11 - August 16, 2016

v0.1.9 - August 15, 2015

v0.1.6 - August 3, 2015

v0.1.5 - August 3, 2015

v0.1.4 - August 2, 2015

v0.1.2 - June 17, 2015

v0.1.1 - June 13, 2015

v0.1.0 - May 30, 2015

v0.0.8 - May 29, 2015

v0.0.7 - May 22, 2015

About

Releases

Packages

Languages

mitica/html-explorer

Folders and files

Latest commit

History

Repository files navigation

html-explorer - HTML page explorer

Usage

Result structure

API

explorer.explore(url, [options])

Options

Changelog

v0.1.12 - July 16, 2016

v0.1.11 - August 16, 2016

v0.1.9 - August 15, 2015

v0.1.6 - August 3, 2015

v0.1.5 - August 3, 2015

v0.1.4 - August 2, 2015

v0.1.2 - June 17, 2015

v0.1.1 - June 13, 2015

v0.1.0 - May 30, 2015

v0.0.8 - May 29, 2015

v0.0.7 - May 22, 2015

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`explorer.explore(url, [options])`

Packages