An easy way to archive any publicly accessible website locally using Docker and HTTrack.
This image uses a single YAML file, website-archiver.yml
to specify which sites to back up and how.
webarchiver_sites:
- site: "https://example.com"
dest: "/public/example.com"
Where:
- site is the site URL. Required.
- dest is the path in the container to save the file. Required.
If you want to archive multiple sites, simply add another entry:
webarchiver_sites:
- site: "https://example.com"
dest: "/public/example.com"
- site: "https://example.net"
dest: "/public/example.net"
You may wish to include additional URL patterns along with the site
. This could be CSS or JS files or media files. Do so with additional_url_patterns
:
webarchiver_sites:
- site: "https://example.com"
dest: "/public/example.com"
additional_url_patterns:
- "+https://example.com/*"
- "+*.css"
- "+*.js"
- "+mime:image/*"
- "+mime:video/*"
- "+mime:audio/*"
Where:
- additional_url_patterns is a list of patterns to include in the archive. Optional, defaults to CSS, JS, and media files.
You may also restrict your crawl to a folder of the site
by including it in additional_url_patterns
. For example, to craw only the URL https://example.com/folder
and any page therein:
- site: "https://example.com/folder"
dest: "/public/example.com"
additional_url_patterns:
- "+https://example.com/folder/*"
- "+*.css"
- "+*.js"
- "+mime:image/*"
- "+mime:video/*"
- "+mime:audio/*"
The archiver will attempt to follow all links it finds when backing up a site. You can control this with max_links
:
webarchiver_sites:
- site: "https://example.com"
dest: "/public/example.com"
max_links: 500000
Where:
- max_links The maximum depth of links to back up. Optional, defaults to 500000.
Note, setting this too low can cause the archiver to fail.
You can instruct the archiver to follow links in robots.txt and meta tags using follow_robots_txt
:
webarchiver_sites:
- site: "https://example.com"
dest: "/public/example.com"
follow_robot_txt: "never"
Where the value of follow_robot_txt
is:
- never Never follow. Default.
- sometimes Follow some links. See httrack documentation for more info.
- always Follow even more.
- even strict Follow even strictly disallowed links.
You can control the archive further with the following options:
webarchiver_sites:
- site: "https://example.com"
dest: "/public/example.com"
extra_log: yes
single_log: yes
disable_security_limits: yes
update: yes
max_transfer_rate: 0
max_links: 500000
include_near_files: yes
Where:
- extra_log Write extra information to the log. Optional, defaults to
yes
. - single_log Write to a single log file per archive. Optional, defaults to
yes
. - disable_security_limits Bypass internal limits on bandwidth abuse. Optional, defaults to
yes
. - update Update the existing archive if it was previously taken. Optional, defaults to
yes
. - max_transfer_rate The maximum transfer rate in bytes/sec. Optional, defaults to no limit.
You can run this image in several ways. First, create website-archiver.yml
using the website-archiver.yml.example
file as a template.
To run using docker:
docker run -it \
--volume `pwd`/public:/public \
--volume `pwd`/website-archiver.yml://config/httrack/website-archiver.yml
ten7/website-archiver
Or, to use the included docker-compose.yml file:
docker-compose run httrack
This container uses Ansible to perform start-up tasks. To get even more verbose output from the start up scripts, set the ANSIBLE_VERBOSITY
environment variable to 4
.
If the container will not start due to a failure of the entrypoint, set the WEBARCHIVER_SKIP_ENTRYPOINT
environment variable to true
or 1
, then restart the container.
There is an example .gitlab-ci.yml
file named .gitlab-ci.yml.example
that can be used to host a site archived into the default /public
folder using Gitlab Pages. Simply copy the file and push your repo to Gitlab and you'll be set.
Website Archiver is licensed under GPLv3. See LICENSE
for the complete language.