- Cookies added to
spider.cookies
in the config file will be injected into all crawl requests, making crawling/analysis of logged-in sites possible. - The accompanying
spidergram login <url>
CLI command pops up a browser window, allows you to log into a site, then back on the command line records cookies that were sent as a result of the login process. The cookies are then saved in a JSON file for analysis or incorporation into the crawl config file. - Add support for both GET and HEAD in prefetch requests to determine a URL's status and mimetype; HEAD is still the default but some sites only respond to GET, because the internet is terrible.
- Dependency updates (particularly Crawlee)
- Behind the scenes changes to make it easier to work with non-spider data, like CMS exports and spreadsheets generated by other spidering tools. This work rests on top of the analyzer changes, and in future versions of Spidergram will make it easier to build complex analysis operations on top of existing data, without the need for an explicit crawl.
- Related work to decouple Spidergram's property discovery and mapping code from its page-centric crawling and reporting. That property manipulation system is what allows the 'anayze' command to combine information from a page's headers, CSS queries for HTML properties, calculated stats about its content, and so on into new properties; it's quite powerful, but was built in a way that makes it tough to use in other situations where it would make sense — like analyzing pattern library usage and analytics information. While this version won't see inherent changes to the reporting features, they're coming — along with simpler analyzer configuration that's less dependent on the overall crawl process.
- Fix sitemap and robots.txt processing — they were saving but failing to read from the correct location
- Add a
spidergram sitemap
command to gather URLs from one or more sites' robots.txt and sitemap.xml files - Use an in-memory URL cache during crawling to avoid thrashing the DB with "have we seen it before?" checks
- Fix ingestion of JSON files when using the
spidergram import
command - Made 'resume' behavior consistent across sitemap and crawl commands; resume is now the default, but can be toggled off as desired.
- Added
spidergram import <file>
to manually populate target URLs, and import arbitrary datasets for reporting. - Added
spidergram pagespeed
to run Google's Pagespeed API reporting tool on sets of URLs. - Pass all config values to the crawler even when other options are set from the CLI
- Support split/slice/join operations for string and array values when property mapping
- Re-enable Robots.txt and Sitemap downloading
- Recover gracefully from JSON parsing errors during page analysis
- Added a 'replace' operation to the global URL normalizer for correcting specific borked URL patterns
- Update to the latest version of Crawlee; this enables the
config.spider.sameDomainDelaySecs
property - Remove an axe library workaround that's no longer necessary
- Grab link tags in the HTML HEAD by default during extraction
- Fixed a flag-handling error that caused auto-extracted data to be overwritten in some analysis modes
- Fixed issues in shadow dom expansion; set
config.spider.shadowDom
to TRUE to activate the begavior. NOTE: Ny default spidergram only waits until 'domcontentloaded', which may not populate the shadow elements. If shadow DOM elements aren't appearing, try settingconfig.spider.waitUntil
to 'networkidle'. - Screenshots pathnames include the cropped/full flag, so batch output won't overwrite older screenshots.
- Excel reports that split based on a column with empty values no longer error out.
The Wappalyzer project has taken its GPL-licensed code repositories private. while the latest release exists on NPM, the uncached technology fingerprint definitions are no longer available from Wappalyzer's GitHub account. For the time being, a fork of the last public Wappalyzer release is being used.
- Complex multi-source property matches no longer skip matches when filtering results
- Fixed persistence of patterns and pattern instances
- Use a dedicated fork of the Wappalyzer project for tech definitions (longer-term fix in progress)
- Fix early bailout on property map values with multiple sources
- Update urlFilter tests to match the true/false/null response scenarios
- Fixed URL crawl/save filtering. If multiple filters are supplied, any match will cause the URL to be treated as a match. Explicit rejection of URLs is still possible using the full UrlFilter syntax; i.e.,
crawl: { propert: 'hostname', glob: '*foo.com', reject: true }
. - Added a
collapseSearchParams
normalizer option, so borked URL Search Param values likepage=1?page=2?page=3
can be collapsed to the last value in the list. The config value should be a glob pattern matching Search Param keys; i.e.,'name'
or'{name,id,search}'
etc. - Added support for stealth crawling; setting
spider.stealth
to TRUE in the Spidergram config will use theplaywright-extras
plugin to mask the crawler's identity. This is experimental and turned off by default; some pages currently cause it to crash the spider, requiring repeated restarts of the crawler to finish a site. - Added a
delete
CLI command that can be used to remove crawl records and dependent relationships. It uses the same filtering syntax as thequery
CLI command, but is obviously much more dangerous. Usingquery
first, thendelete
ing when you know you're sure of the results, is strongly recommended. This is particularly useful, though, when you'd like to 'forget' and re-crawl a set of pages. In the future we'll be adding support for explicitly recrawling without this dangerous step, but for now it's quite handy. - Bumped
@axe-core/playwright
to version 4.7.1
- Disable pattern discovery and site name extraction when using the
ping
command to avoid altering crawl data
- Removed an outdated reference to the old pattern_instances collection.
This release is dedicated to Peter Porker of Earth-8311, an innocent pig raised by animal scientist May Porker. After a freak acident with the world's first atomic powered hairdryer, Peter was bitten by the scientist and transformed into a crime-fighting superhero pig.
- Custom queries and multi-query reports can be defined in the Spidergram config files; Spidergram now ships with a handful of simple queries and an overview report as part of its core configuration.
- Spidergram can run an Axe Accessibility Report on every page as it crawls a site; this behavior can be turned on and off via the
spider.auditAccessiblity
config property. - Spidergram can now save cookies, performance data, and remote API requests made during page load using the
config.spider.saveCookies
,.savePerformance
, and.saveXhr
config properties. - Spidergram can identify and catalog design patterns during the post-crawl page analysis process; pattern definitions can also include rules for extracting pattern properties like a card's title and CTA link.
- Resources with attached downloads can be processed using file parsing plugins; Spidergram 0.10.0 comes with support for PDF and .docx content and metadata, image EXIF metadata, and audio/video metadata in a variety of formats.
- The
config.spider.seed
setting lets you set one or more URLs as the default starting points for crawling. - For large crawls, an experimental
config.offloadBodyHtml
settings flag has been added to Spidergram's global configuration. When it's set to 'db', all body HTML will be stored in a dedicated key-value collection, rather than theresources
collection. On sites with many large pages (50k+ pages of 500k+ html or more) this can significantly improve the speed of filtering, queries and reporting.
- Spidergram's CLI commands have been overhauled; vestigial commands from the 0.5.0 era have been removed and replaced. Of particular interest:
spidergram status
summarizes the current config and DB statespidergram init
generates a fresh configuration file in the current directoryspidergram ping
tests a remote URL using the current analysis settingsspidergram query
displays and saves filtered snapshots of the saved crawl graphspidergram report
outputs a collection of query results as a combined workbook or JSON filespidergram go
crawls one or more URLs, analyzes the crawled files, and generates a report in a single step.spidergram url test
tests a URL against the current normalizer and filter settings.spidergram url tree
replaces the oldurls
command for building site hierarchies.
- CLI consistency is significantly improved. For example:
analyze
,query
,report
, andurl tree
all support the same--filter
syntax for controlling which records are loaded from the database.
- URL matching and filtering has been smoothed out, and a host of tests have been added to ensure things stay solid. Previously, filter strings were treated as globs matched against the entire URL. Now,
{ property: 'hostname', glob: '*.foo.com' }
objects can be used to explicitly specify glob orr regex matches against individual URL components.
- Major improvements to the report structure.
- Reports defined in project configuration can reuse pre-defined queries with additional filters and return values.
- Reports can include custom settings to control output format and export options.
- XLSX formatted reports can override header styling, alter column header names, and populate Excel document metadata.
- XLSX formatted reports now auto-resize column widths (with an imposed max width) by default.
- Useful base queries and partial "query fragments" are now available in the
config.queries
global config property. Queries in that collection can be used in reports or referenced as 'base queries'. - A fix in the underlying AqBuilder library means queries and reports with aggregates should now sort correctly.
- Added a new
NamedEntity
abstract base type that can be used for graph entities that should be referred to by unique names rather than arbitrary IDs. - Added
Site
andPattern
NamedEntity types, and refocused the unusedAppearsOn
Relationship type as a dedicated Resource-To-Pattern relationship, replacing the oldFragment
entity type. - Added a
config.analysis.site
setting; it can specify either aPropertyMap
structure or a custom function to determine the site a give page belongs to. By default, it's set toparsed.hostname
. - Added a
-d --designPatterns
flag tospidergram analyze
; when set it will use theconfig.analyze.patterns
settings to extract and the patterns that appear on a given page. - Added an
-r --reprocess
flag tospidergram analyze
; unless this flag is set, it will ignore pages that have already been analyzed. - Renamed the
PropertySource
type toPropertyMap
- Renamed several configuation options:
config.pageAnalysis
is nowconfig.analysis
config.pageAnalysis.propertyMap
is nowconfig.analysis.properties
config.urlNormalizer
is nowconfig.normalizer
- Improved error handling for fetched downloads
- Efficiency improvements to propertyMapping in situations where many rules use DOM selectors; the resulting DOM object is now created once and reused for each rule.
spidergram crawl
was overwriting the save and crawl rules with the fallback defaults. Those responsible for the defaulting have been defaulted.
- Changed
spidergram test url
tospidergram url test
andspidergram tree
tospidergram url tree
for consistency. - Added a
urlNormalizer.discardFirstSegment
settings option for more focused removal of 'www' and similar prefixes. It accepts literals and globs likeurlNormalizer.discardSubdomain
, but only discards the first hostname segment. This allowswww.subdomain.domain.com
to becomesubdomain.domain.com
, while theurlNormalizer.stripSubdomain
setting would transform it todomain.com
. - Added
PropertySource.value
so mapped properties can return a 'clean' hard-coded value after finding an ugly one. UrlTools.filterUrl()
now supports property-scoped{ property: 'hostname', glob: '*.example.com' }
and{ property: 'path', regex: '.*\.pdf' }
expressions, simplifying the patterns necessary to match specific URL components.- Expanded the configuration options for reports
report.dropEmptyQueries
does what it says on the tinreport.pivotSingleResults
triggers a check for queries that return only one row, and pivots them for friendlier display. Still experimental.- The
report.queries
list allows a new 'modified query' structure, which includes both a pointer to an already-defined base query and a set of additional filters, return values, and so on. This allows you to reuse complex base, then filter them to a specific subdomain or other criteria without copying and pasting the underlying definition. report.modifications
is an optional list of modifications that will be made to each query in the report.- The
spidergram report
command now supports the--filter
flag; any filters from the command line will be added as 'modifications' to the report when it runs, allowing you to build a universal report and run it multiple times with different filters.
- Added a simple check for certain recursive URL chains (like
http://example.com/~/~/~/~
).spider.urls.recursivePathThreshold
is set to 3 by default, and setting it to 1 or less turns off the recursion-check. - The
spider.auditAccessibility
setting now allows the full audit to be saved to a separate table, with several summary formats (by impact, by category) for the primary results saved to the Resource. - XmlHttpRequests detected during page load are fed into the Tech Fingerprinting tool; this improves the detection rate for many third-party APIs.
- Custom technology figerprint rules can be added to the project configuration for site-specific libraries and APIs.
- Reports definitions can specify their output filename.
report.outputPath
can also unclude{{date}}
and{{name}}
wildcards; the name of the report and the current ISO date will be inserted when the file is written. If no output path is given,{{date}} - {{name}}
is the fallback. - Moved a number of our standalone browser manpipulation functions to the
BrowserTools
collection - Moved core URL filtering code out of the 'spider' codebase into
UrlTools
, where it's easier to use outside of the browser context. - Added the ability to save a list of XMLHTTPRequests made during the page load. This can be toggled on and off with the
spider.saveXhrList
option. - Added
--concurrency
flag tospidergram analyze
, allowing multiple pages to be processed simultaneously.
- The new
QueryFragment
utility collection reusable query specs that can be used to instantiate a new Query object before adding custom filters and aggregations. - The
spidergram crawl
CLI command features a--resume
flag that attempts to pick up a crawl where it was paused or aborted. It will be improved in coming versions, but for now it can be tinkered with. - When PDF files are parsed, any clickable links inside the file are saved as URLs; although they're not yet fed back into the crawler, it's a start.
- The
spider.urlOptions
setting has been renamed tospider.urls
, and andspider.urlOptions.enqueue
is nowspider.urls.crawl
. This is the visible part of a broader refactoring of URL filtering to make it faster, more flexible, and more reliable when using wildcard matches or custom filter logic. - The "generic" fallback normalizer is now exposed as the
genericNormalizer
property on any Spidergram instance. That makes it easier for custom normalizer functions to leverage it for most URLs while special-casing exceptions. - The genericNormalizer now supports a
supplySubdomain
option; it's FALSE by default but if it's set to a string, that string will be used as the subdomain when raw TLDs are encountered. For example, if it's set to 'www',http://example.com
will be transformed tohttp://www.example.com
buthttp://news.example.com
will remain untouched. - A new (hidden)
spidergram test url
command allows you to paste in any URL and see how the current normalizer settings will process it. In addition, it will make a best guess at whether the URL will be saved or enqueued during the crawling process, based on the current config settings.
- On extremely large crawls (100K+ pages, 500K+ HTML for each page) ad-hoc queries and reports can become very slow; the 'resources' collection that holds page metadata also holds the raw HTML, and scans through it for other properties can bog down. We've introduced a very experimental
offloadBodyHtml
flag to the global Spidergram options: when it's set todb
, Spidergram will stick Resource body HTML into a separate key-value store and look it up as needed. The intent is to be as invisible as possible to most code, though there may be some situations where it's necessary to callawait res.loadBody()
manually after loading a resource. In the future we'll be experimenting with filesystem-based storage of body HTML as well. - Downloaded files attached to a resource can now be parsed as part of the analysis process; the metadata extracted from them appears in the
content
anddata
properties of the resource object just like imformation extracted from html files.
- Saved entities now include
_created
and_modified
timestamps in ISO format; this can make identifying stale crawl data much simpler. - When a crawl completes, the final statistics are saved to the
ds_crawl_stats
Dataset. Again, this can be useful for tracking the time windows and performance profiles of multiple crawls, or partial crawls of a single site. - The
WorkerQuery
class now supports concurrency and rate-limiting. This is most useful when loading a bunch of entities and performing a remote API request for each of them. - Speaking of remote APIs, Google's PageSpeed Insights API is now supported via the
GoogleTools.PageSpeed
class. - The Axe Accessibility auditor and Wappalyzer fingerprinting tool have been refactored to match the PageSpeed class's conventions. Each has a static class with an async run() method that kicks off its remote API request, and optional formatting functions that can be used on the results. This change is invisible for anyone who was using the CLI tools, but it does change the syntax slightly for anyone who was using the Spidergram API in a custom NodeJS project.
- The
getPageData
function can now (attempt to) parse out Schema.org metadata; ifoptions.schemaOrg
is set to FALSE, it will leave the Schema.org structures untouched as raw JSON+LD in thepageData.json
values. - Property maps can include
matches
,limit
, andjoin
properties; if a property value is found during mapping, and it's an array,matches
filters the list to items matching a wildcard;limit
ensures a max number of results; andjoin
concatenates them into a single string using a specified delimiter. While they can't cover every scenario, when combined they can handle simple tasks like turning multiple found tags and authors into a delimited list, or grabbing the first class that starts with 'tmpl-' from the body attributes.
- Fixing overly-optimistic handling of page data in the
spidergram ping
command. - The
spidergram crawl
command no longer dumps a pile of JSON to the console; now it just summarizes the pages crawled. - The
HtmlTools.getPlaintext
helper function now hasgetReadableText
andgetVisibleText
variations that use different rendering presets to approximate visible-in-browser and heard-by-screenreader text. While it's imperfect and likely to be refined, it can be useful for quick smoke tests. - The
HtmlTools.getUniqueSelector
helper function constructs a best-guess unique CSS selector string for a given cheerio element.
- Fixing path resolution for core configuration when installed globally.
- Including the config directory helps the config work better.
- Axe accessibility testing can be enabled for all pages in a crawl using the
spider.auditAccessibility
configuration flag. Setting it to TRUE returns full raw results for every page, while setting it to 'summary' yields a more manageable summary of a11y violations by severity. - The
spidergram ping
(formerlyprobe
) command now uses the core data extraction and page analysis functions; this makes it a useful way to preview what kinds of data Spidergram will find at a given URL. - The
spidergram analyze
command now supports the--filter <x>
and--limit <n>
flags, making it easier to reprocess specific subsets of the craw data after tweaking content extraction rules. - The
spidergram tree
(formerlyurls
) command now supports the--filter <x>
flag when pulling URLs from the database. This makes it easier to build trees of specific sections or subsites. - The
spidergram query
command no longer defaults to a limit of 20 records; while it's now easier to spam yourself, it's also less frustrating to generate useful output. - The
spidergram ga
,spidergram init
,spidergram db
,spidergram project
, andspidergram sitemap
commands have been removed, as they rely on deprecated internal helpers or have been superceded by other functions. In particular,spidergram status
now displays global settings and DB connection info, whilespidergram cleanup
performs a variety of tidying tasks that were previouslydb
subcommands. - Technology fingerprinting now correctly includes header, meta tag, and script data when processing saved resources; technology fingerprinting should be noticably more accurate.
- Multi-query reports can be defined using the
Report
class, either by defining aReportConfig
object, or by building a custom Report subclass that handles its own data collection and output. - Named Reports (or ReportConfig objects) can be defined in the
reports
property of the Spidergram config for reuse. Thespidergram report
CLI command can be used to trigger and run any defined reports. - A new
spidergram go
CLI command accepts one or more URLs, and kicks off a crawl, page analysis, and report generation based on the currently loaded config, in a single command. - The previous
spidergram report
command has been renamedspidergram query
, as it's meant for running single ad-hoc queries. - Raw AQL queries can be embedded as strings in the config's
queries
collection. - Config files in JSON5 format are now supported, for folks who don't buy 50gal drums of double-quotes at Costco.
- The shared
analyzePage()
method can now rebuild a page'sLinksTo
metadata; this is useful when region selectors have changed and you want to make sure your link reporting stays up to date. Theanalyze
CLI command now has a--links
flag, though it defaults to false. Therelink
CLI command has been removed. - The
BrowserTools.getAxeReport
method, given a Playwright page handle, runs a full accessibility audit on the live page. - To help keep final artifacts separate from internal data, a separate
outputDirectory
property has been added to Spidergram's configuration. A new 'output' file bucket is also part of the default configuration, and can be written to/read from by passing its name into the Spidergram config object'sfiles()
method. Passing in the name of a non-existent bucket will fall back to the default storage directory.
- Graph queries can be saved as named reporting presets in the 'queries' section of Spidergram's configuration. The
spidergram report
CLI now offers a--query
flag that can use these presets by name. An 'errorPages' query is built into the default settings:spidergram report -q errorPages
. Aspidergram report --list
flag is now available to list stored queries, as well. - The spider now saves any cookies set during the page load on
Resource.cookies
. This improves the accuracy of technology fingerprinting, and can be useful in reconstructing certain on-page behavior. It can be turned off using thespider.saveCookies
setting. - Page analysis PropertyMaps now support a 'limit' option to be used in conjunction with the 'selector' option. It enforces a hard limit on the number of options that will be returned to populate the new property; using '1' as the limit will ensure a single value, never an array of values.
- The
getPageContent
function now now uses Cheerio to narrow down the page to its content region before passing HTML intohtmlToText
. Although HtmlToText can accept body selectors in its own config, its selector engine lacks support for some common CSS combinators, making common queries fail silently. - Several deprecated options have been culled from the
EnqueueUrlOptions
interface. Flags that controlled robots.txt and sitemap.xml auto-discovery previously lived here, but complicated the URL search code uncessarily. As we've accumualted better lifecycle control options for the Spider itself, they're no longer needed. - An additional option —
regions
— has been added toEnqueueUrlOptions
. It can contain a dictionary of named selectors that will be used to chop the page up into named regions before searching for links. Links that are found will be labeled with the name of the region they're found in, and those labels will be preserved in theLinksTo.label
property in the final crawl database. That property can then be used to filter graph traversal queries that map paths through a site. Super fun stuff.
- Config can now live in a YAML file — it's clunky for large structured configuration blocks, but just right for a handful of properties like DB connection strings or a few normalizer settings.
- The
getPageData
function now parses JSON and JSON+LD chunks by default. Normal scripts are still unparsed by default, but JSON+LD in particular often holds a wealth of useful Schema.org properties and configuration data. In Drupal, for example, thedrupal-settings-json
block often exposes internal category and tag information for each piece of content. @oclif/core
has been out of date for a while, and is now updated to the latest release.
- Unmatched selectors in
propertyMap
no longer return the full HTML document. - Mapping to deep property paths now sets
object.prop.subprop
rather thanobject['prop.subprop']
- Nicer handling of single-column result sets in
report
CLI - Spidergram-specific env variables now shown in
project
CLI output
EntityQuery
execution now works when passed a string collection name.Spider
,WorkerQuery
, andScreenshotTool
events are now standardized on 'progress' and 'end'. The first parameter for both is a JobStatus object, making it easy to route the event to a shared progress bar or status display function.- Events subscription methods return references to instances; this makes it easy to chain
spider.on(...)
andworker.on(...)
calls during setup. UrlEnqueueOptions.selector
is nowUrlEnqueueOptions.selectors
, and can accept a single CSS selector or a dictionary of named CSS selectors to categorize found links. (For example, header vs footer vs. body links). AremapLinks
helper function, and therelink
CLI command, can be used to rebuild existing LinkTo relationships.- The
analyzePage
helper function runs data extracton, content analysis, and technology fingerprinting on aResource
using the current project configuration. Custom configuration can be passed in at runtime as well. analyzePage
also supports simple property normalization viapropertyMap
on its options object. New properties on the Resource can be created from existing properties found during data extraction and content analysis, with fallbacks if specific properties weren't found.- The
screenshot
CLI command now attempts to give a progress update.
- The
crawl
,analyze
, andprobe
CLI functions now use global defaults. - Added a convenience wrapper (
getPageTechnologies
) around the Fingerprint library. - Config files can supply
pageTechnologies
,pageData
andpageContent
config options. - Config scripts can supply
getPageTechnologies
,getPageData
andgetPageContent
functions to globally override their operation. - Fixed an issue that prevented
Spidergram.init()
from loading without a database. - Added a handful of example queries in the default configuration.
Entity.get('property.path')
fallback value types should beunknown
, notundefined
.WorkerQuery
wasn't updating the start/finish times in its status property.
This release is dedicated to teen crime-fighter Gwen Stacy of Earth-65. She juggles high school, her band, and wisecracking web-slinging until her boyfriend Peter Parker becomes infatuated with Spider-Woman. Unable to reveal her secret identity, Spider-Woman is blamed for Peter's tragic lizard-themed death on prom night… and Gwen goes on the run.
- Major Changes
Vertice
andEdge
have been renamed toEntity
andRelationship
to avoid confusion with ArangoDB graph traversal and storage concepts. With the arrival of theDataset
andKeyValueStore
classes (see below), we also needed the clarity when dealing with full-fledged Entities vs random datatypes.- HtmlTools.getPageContent() and .getPageData() are both async, allowing them to use some of the aync parsing and extraction tools in our toolbox. If your extracted data and content suddenly appear empty, make sure you're awaiting the results of these two calls in your handlers and scripts.
- Improved report/query helpers. The
GraphWorker
andVerticeQuery
— both of which relied on raw snippets of AQL for filtering — have been replaced by a new query-builder system. A unifiedQuery
class can take a query definition in JSON format, or construct one piecemeal using fluent methods likefilterBy()
andsort()
. A relatedEntityQuery
class returns pre-instantiated Entity instances to eliminate boilerplate code, and aWorkerQuery
class executes a worker function against each query result while emitting progress events for easy monitoring. Project
class replaced bySpidergram
class, as part of the configuration management overhaul mentioned below. In most code, changingconst project = await Project.config();
toconst spidergram = await Spidergram.load();
andconst db = await project.graph();
toconst db = spidergram.arango;
should be sufficient.
- New Additions
- Spidergram configuration can now live in .json, .js, or .ts files — and can control a much wider variety of internal behaviors. JS and TS configuration files can also pass in custom functions where appropriate, like the
urlNormalizer
andspider.requestHandlers
settings. Specific environment variables, or.env
files, can also be used to supply or override sensitive properties like API account credentials. - Ad-hoc data storage with the
Dataset
andKeyValueStore
classes. Both offer staticopen
methods that give quick access to default or named data stores -- creating new storage buckets if needed, or pulling up existing ones. Datasets offerpushItem(anyData)
andgetItems()
methods, while KeyValueStores offersetItem(key, value)
andgetItem(key)
methods. Behind the scenes, they create and manage dedicated ArangoDB collections that can be used in custom queries. - PDF and DocX parsing via
FileTools.Pdf
andFileTools.Document
, based on the pdf-parse and mammoth projects. Those two formats are a first trial run for more generic parsing/handling of arbitrary formats; both can return filetype-specific metadata, and plaintext versions of file contents. For consistency, the Spreadsheet class has also been moved toFileTools.Spreadsheet
. - Site technology detection via
BrowserTools.Fingerprint
. Fingerprinting is currently based on the Wappalyzer project and uses markup, script, and header patterns to identify the technologies and platforms used to build/host a page. - CLI improvements. The new
spidergram report
command can pull up filtered, aggregated, and formatted versions of Spidergram crawl data. It can output to tabular overviews on the command line, raw JSON files for use with data visualization tools, or ready-to-read Excel worksheets. Thespidergram probe
command allows the new Fingerprint tool to be run from the command line, as well. - Groundwork for cleaner CLI code. While it's not as obvious to end users, we're moving more and more code away from the Oclif-dependent
SgCommand
class and putting it into the sharedSpiderCli
helper class where it can be used in more contexts. In the next version, we'll be leveraging these improvements to make Spidergram's built-in CLI tools take better advantage of the new global configuration settings.
- Spidergram configuration can now live in .json, .js, or .ts files — and can control a much wider variety of internal behaviors. JS and TS configuration files can also pass in custom functions where appropriate, like the
- Fixes and minor improvements
- Internal errors (aka, pre-request DNS problems or errors thrown during response processing) saave a wider range of error codes rather than mapping everything to
-1
. Any thrown errors are also saved inResource.errors
for later reference. - A subtle but long-standing issue with the
downloadHandler
(and by extensionsitemapHandler
androbotsTxtHandler
choked on most downloads but "properly" persisted status records rather than erroring out. The improved error handling caught it, and downloads now work consistently. - A handful of request handlers were
awaiting
promises unecessarily, clogging up the Spider's request queue. Crawls with multiple concurrent browser sessions will see some performance improvements.
- Internal errors (aka, pre-request DNS problems or errors thrown during response processing) saave a wider range of error codes rather than mapping everything to
This release is dedicated to Miles Morales of Earth-6160, star of Into The Spider-Verse.
- Improvements to structured data and content parsing;
HtmlTools.getPageData()
andHtmlTools.getPageContent()
are now useful as general-purpose extractors across most crawl data. HtmlTools.findPattern()
now pulls more data for each component, including raw internal markup if desired.- URL hierarchy building and rendering, and a general-purpose utility class for building hierarchies from other types of relationships like breadcrumb trails.
- CLI improvements (
spidergram urls
command gives quick access to the new hierarchy builder)
This release is dedicated to Cindy Moon, aka Silk, of Earth-616.
- New
Fragment
entity type for storing sub-page elements findPattern
helper function to extract recurring elements from pages- Automatic extraction of schema.org metadata
- Google Analytics integration
- Sitemap and Robots.txt parsing
Query
class to build and execute simple queries- New
create-spidergram
project to spin up custom crawlers
This release is dedicated to Peter Parker of Earth 616, the original friendly neighborhood Spider Man.
Initial public release of Spidergram. Please do not fold, spindle, or mutilate.