You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Spidergram ships with some sensible defaults for things like URL normalization and HTML parsing. Out of the box, it also assumes you're storing your data in a locally-installed copy of ArangoDB without a password.
It will treat whatever directory you're in when you run the spidergram command as its current "project directory," and store its temp files, downloads, final report output, etc. there.
Getting the most out of Spidergram, however, means cracking open a settings file and customizing how it does its business.
Config files and locations
When you run Spidergram from the command line, it will check the directory you're located in for a spidergram.config.json file; if it doesn't find anything there, it will also check to see if there's a dedicated config subdirectory. In addition, config files can be in yaml or json5 format (JSON5 is a less-strict version of JSON that supports inline comments and reads more like Javascript code and less like an explosion in a quotation-mark factory).
If no config files are found, Spidergram will use its internal default settings for everything, including the connection to ArangoDB for data storage.
Configuration options
Spidergram has a lot of internal settings that can be customized. We'll cover the basics here but more complete documentation on every intividual flag will be coming soon on a dedicated API documentation site.
Global settings
option
default
notes
storageDirectory
<current-dir>/storage
outputDirectory
<current-dir>/output
database option
default
notes
arango.url
'http://127.0.0.1:8529'
arango.databaseName
'spidergram'
arango.auth.username
'root'
arango.auth.password
''
url normalizer option
default
notes
normalizer.forceProtocol
'https:'
normalizer.forceLowercase
'hostname'
normalizer.discardSubdomain
'ww*'
normalizer.discardAnchor
true
normalizer.discardAuth
true
normalizer.discardIndex
'**/{index,default}.{htm,html,aspx,php}'
normalizer.discardSearch
'!{page,p}'
normalizer.sortSearchParams
true
Spider settings
option
default
notes
spider.userAgent
'Spidergram'
spider.maxConcurrency
1
The number of headless browsers to run simultaneously
spider.maxRequestsPerMinute
120
spider.downloadMimeTypes
[]
An array of mime types to download for later parsing (* wildcards are supported)
spider.saveCookies
true
Save all set cookies for later parsing
spider.savePerformance
true
Save page loading and rendering data
url filter setting option
default
notes
spider.urls.selectors
'a'
spider.urls.save
'all'
Save links that match this criteria
spider.urls.crawl
'same-domain'
Visit and crawl links that match this criteria
spider.urls.discardNonWeb
false
Discard non-http/https links
spider.urls.discardUnparsable
false
Discard malformed or incomplete links
spider.urls.recursionThreshold
3
Do not follow links if a path segment repeats more than this many time (e.g., example.com/directory/~/~/~/
Page Analysis
data extraction setting
default
notes
analysis.data.all
false
analysis.data.attributes
true
Parse HTML attributes on the body tag; these are often used to store pagewide settings and design options
analysis.data.meta
true
Parse meta tags, including keywords, OpenGraph data, etc.
analysis.data.json
true
Parse JSON data embedded in script tags
analysis.data.schemaOrg
true
Parse Schema.org information embedded as JSON-LD dta
analysis.data.links
false
Parse link tags
analysis.data.noscript
false
Parse noscript tags
analysis.data.scripts
false
Parse script tags
analysis.data.styles
false
Parse style tags
content analysis setting
default
notes
analysis.content.selector
'body'
CSS selector to use when extracting the page's core content
analysis.content.defaultToFullDocument
false
If the selector can't be found, fall back to the full page body
analysis.content.trim
true
analysis.content.readability
true
Calculate the core content's readability score
analysis.content.readability.formula
'FleschKincaid'
analysis.content.readability.stats
true
Collect additional stats like word and sentence count
general analysis setting
default
notes
analysis.tech
true
Scan each page for known web technologies
analysis.links
false
Rebuild each page's list of outbound links
analysis.site
'parsed.hostname'
The dot-notation path of a Page property to use as its "site name"
Complex analysis options
These settings in particular are relatively complicated sub-structures that will receive additional documentation attention shortly. For the moment, some of the examples in the create-spidergram project demonstrate how they can be used.
setting
notes
analysis.properties
A key/value structure describing how page properties should be remapped after parsing. (e.g., moving a page's OpenGraph publish date to the content.published property)
analysis.patterns
An array of design pattern definitions that can be used to detect individual pattern appearances on each page.
queries
A key/value structure containing named, reusable ArangoDB queries
reports
A key/value structure containing named, reusable reports and output formatting instructions