Customizing Spidergram's settings

Spidergram ships with some sensible defaults for things like URL normalization and HTML parsing. Out of the box, it also assumes you're storing your data in a locally-installed copy of ArangoDB without a password.

It will treat whatever directory you're in when you run the spidergram command as its current "project directory," and store its temp files, downloads, final report output, etc. there.

Getting the most out of Spidergram, however, means cracking open a settings file and customizing how it does its business.

Config files and locations

When you run Spidergram from the command line, it will check the directory you're located in for a spidergram.config.json file; if it doesn't find anything there, it will also check to see if there's a dedicated config subdirectory. In addition, config files can be in yaml or json5 format (JSON5 is a less-strict version of JSON that supports inline comments and reads more like Javascript code and less like an explosion in a quotation-mark factory).

If no config files are found, Spidergram will use its internal default settings for everything, including the connection to ArangoDB for data storage.

Configuration options

Spidergram has a lot of internal settings that can be customized. We'll cover the basics here but more complete documentation on every intividual flag will be coming soon on a dedicated API documentation site.

Global settings

option	default	notes
storageDirectory	`<current-dir>/storage`
outputDirectory	`<current-dir>/output`

database option	default	notes
arango.url	`'http://127.0.0.1:8529'`
arango.databaseName	`'spidergram'`
arango.auth.username	`'root'`
arango.auth.password	`''`

url normalizer option	default	notes
normalizer.forceProtocol	`'https:'`
normalizer.forceLowercase	`'hostname'`
normalizer.discardSubdomain	`'ww*'`
normalizer.discardAnchor	`true`
normalizer.discardAuth	`true`
normalizer.discardIndex	`'**/{index,default}.{htm,html,aspx,php}'`
normalizer.discardSearch	`'!{page,p}'`
normalizer.sortSearchParams	`true`

Spider settings

option	default	notes
spider.userAgent	`'Spidergram'`
spider.maxConcurrency	`1`	The number of headless browsers to run simultaneously
spider.maxRequestsPerMinute	`120`
spider.downloadMimeTypes	`[]`	An array of mime types to download for later parsing (`*` wildcards are supported)
spider.saveCookies	`true`	Save all set cookies for later parsing
spider.savePerformance	`true`	Save page loading and rendering data

url filter setting option	default	notes
spider.urls.selectors	`'a'`
spider.urls.save	`'all'`	Save links that match this criteria
spider.urls.crawl	`'same-domain'`	Visit and crawl links that match this criteria
spider.urls.discardNonWeb	`false`	Discard non-http/https links
spider.urls.discardUnparsable	`false`	Discard malformed or incomplete links
spider.urls.recursionThreshold	`3`	Do not follow links if a path segment repeats more than this many time (e.g., `example.com/directory/~/~/~/`

Page Analysis

data extraction setting	default	notes
analysis.data.all	`false`
analysis.data.attributes	`true`	Parse HTML attributes on the `body` tag; these are often used to store pagewide settings and design options
analysis.data.meta	`true`	Parse meta tags, including keywords, OpenGraph data, etc.
analysis.data.json	`true`	Parse JSON data embedded in `script` tags
analysis.data.schemaOrg	`true`	Parse Schema.org information embedded as `JSON-LD` dta
analysis.data.links	`false`	Parse `link` tags
analysis.data.noscript	`false`	Parse `noscript` tags
analysis.data.scripts	`false`	Parse `script` tags
analysis.data.styles	`false`	Parse `style` tags

content analysis setting	default	notes
analysis.content.selector	`'body'`	CSS selector to use when extracting the page's core content
analysis.content.defaultToFullDocument	`false`	If the selector can't be found, fall back to the full page body
analysis.content.trim	`true`
analysis.content.readability	`true`	Calculate the core content's readability score
analysis.content.readability.formula	`'FleschKincaid'`
analysis.content.readability.stats	`true`	Collect additional stats like word and sentence count

general analysis setting	default	notes
analysis.tech	`true`	Scan each page for known web technologies
analysis.links	`false`	Rebuild each page's list of outbound links
analysis.site	`'parsed.hostname'`	The dot-notation path of a Page property to use as its "site name"

Complex analysis options

These settings in particular are relatively complicated sub-structures that will receive additional documentation attention shortly. For the moment, some of the examples in the create-spidergram project demonstrate how they can be used.

setting	notes
analysis.properties	A key/value structure describing how page properties should be remapped after parsing. (e.g., moving a page's OpenGraph publish date to the `content.published` property)
analysis.patterns	An array of design pattern definitions that can be used to detect individual pattern appearances on each page.
queries	A key/value structure containing named, reusable ArangoDB queries
reports	A key/value structure containing named, reusable reports and output formatting instructions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONFIG.md

CONFIG.md

Customizing Spidergram's settings

Config files and locations

Configuration options

Global settings

Spider settings

Page Analysis

Complex analysis options

Files

CONFIG.md

Latest commit

History

CONFIG.md

File metadata and controls

Customizing Spidergram's settings

Config files and locations

Configuration options

Global settings

Spider settings

Page Analysis

Complex analysis options