Warcworker

A dockerized queued high fidelity web archiver based on Squidwarc (Chrome headless), RabbitMQ and a small web frontend. Using the scripting abilities of Squidwarc, you can add scripts that should be run for a specific job (e.g. src-set enrichment, comment expansion etc). Please note that Warcworker is not a crawler (it will not crawl a website automatically - you have to use other software to build lists of URL:s to send to Warcworker).

Installation

Copy .env_example to .env. Update information in .env.

Start with docker-compose up -d --scale worker=3 (wait a minute for everything to start up)

Archiving and playback

Open web front end at http://0.0.0.0:5555 to enter URLs for archiving. You can prefill the text fields with the url and description request parameters. Play back the resulting WARC-files with Webrecorder Player

Using

Bookmarklet

Add a bookmarklet to your browser with the following link:

javascript:window.open('http://0.0.0.0:5555?url='+encodeURIComponent(location.href) + '&description=' + encodeURIComponent(document.title));window.focus();

Now you have two-click web archiving from your browser.

Command line

To use from the command line with curl:

curl -d "scripts=srcset&scripts=scroll_everything&url=https://www.peterkrantz.com/" -X POST http://0.0.0.0:5555/process/

Archivenow handler

To use from archivenow add a handler file handlers/ww_handler.py like this:

import requests
import json

class WW_handler(object):

    def __init__(self):
        self.enabled = True
        self.name = 'Warcworker'
        self.api_required = False

    def push(self, uri_org, p_args=[]):
        msg = ''
        try:
	    # add scripts in the order you want them to be run on the page
            payload = {"url":uri_org, "scripts":["scroll_everything", "srcset"]}

            r = requests.post('http://0.0.0.0:5555/process/', timeout=120,
                    data=payload,
                    allow_redirects=True)

            r.raise_for_status()
            return "%s added to queue" % uri_org

        except Exception as e:
            msg = "Error (" + self.name+ "): " + str(e)
        return msg

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
scripts		scripts
web		web
worker		worker
.env_example		.env_example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Warcworker

Installation

Archiving and playback

Using

Bookmarklet

Command line

Archivenow handler

About

Releases

Packages

Contributors 3

Languages

License

peterk/warcworker

Folders and files

Latest commit

History

Repository files navigation

Warcworker

Installation

Archiving and playback

Using

Bookmarklet

Command line

Archivenow handler

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages