Skip to content

An application that crawls the Common Crawl corpus for URLs with the specified file extensions.

License

Notifications You must be signed in to change notification settings

bottomless-archive-project/url-collector

Repository files navigation

URL Collector

The URL collector is an application that crawls the Common Crawl corpus for URLs with the specified file extensions.

Architecture

The application suite is a distributed system to increase the processing speed. It can be deployed to one machine, but if the crawling speed should be increased, the slave application can be deployed to multiple machines.

The url-collector suite is built from three applications.

URL Collector Master

Knows what WARC files (work units) should be crawled, store this data in a database and distribute the work to the slave applications.

URL Collector Slave

Gets a work unit (effectively a Common Crawl WARC file) from the Master application, that should be parsed for URLs. Parse the urls then upload the results to a warehouse.

URL Collector Merger

Merge the crawl results in a warehouse into one big file, while doing so, removing the duplicate entries in the files.

Installation

To start crawling, you need to have Java and MongoDB installed on your machine.

Installing Java

First, you need to download the Java 17 Runtime Environment. It’s available here. After the download is complete you should run the installer and follow the directions it provides until the installation is complete.

Once it’s done, if you open a command line (write cmd to the Start menu’s search bar) you will be able to use the java command. Try to write java -version. You should get something similar:

java version "15" 2020-09-15
Java(TM) SE Runtime Environment (build 15+36-1562)
Java HotSpot(TM) 64-Bit Server VM (build 15+36-1562, mixed mode, sharing)

Installing MongoDB

Download MongoDB 5.0 from here. After the download is complete run the installer and follow the directions it provides. If it’s possible, install the MongoDB Compass tool as well because you might need it later for administrative tasks.

Running the Master Application

You can download the Master Application from our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar url-collector-master-application-{release-number}.jar ...

In the place of the …​ you should write the various parameters with two dashes in the beginning and an equal sign between the parameter’s name and it’s value. For example:

java -jar url-collector-master-application-{release-number}.jar --example.parameter=value

Parameters

Table 1. Parameters
Parameter Description

database.host

The host location of the MongoDB database server. (Default value: localhost)

database.port

The port open for the MongoDB database server. (Default value: 27017)

database.uri

If present and not empty, it overrides the host and port parameter. Let the user inject a MongoDB Connection String directly. Should be used to define the credentials and other custom connection parameters. (Default value: "")

server.port

The port where the master server should listen on. (Default value: 8080)

Running the Slave Application

You can download the Slave Application from our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar url-collector-slave-application-{release-number}.jar ...

You can start the slave application on more than one machines if necessary, to improve the crawling speed.

In the place of the …​ you should write the various parameters with two dashes in the beginning and an equal sign between the parameter’s name and it’s value. For example:

java -jar url-collector-slave-application-{release-number}.jar --example.parameter=value

Parameters

Table 2. Parameters
Parameter Description

execution.parallelism-target

How many work units should be processed at the same time. (Default value: the number of processor cores multiplied by two)

warehouse.type

The type of the warehouse. Can be either 'local' or 'aws'.

warehouse.local.target-directory

The target directory where the crawled URLs should be saved. Only used if the warehouse.type is 'local'.

warehouse.aws.region

The region of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

warehouse.aws.bucket-name

The name of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

warehouse.aws.access-key

The access key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

warehouse.aws.secret-key

The secret key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

validation.types

The list of file executions that we need to save the URLs for.

master.host

The host location of the Master Application. (Default value: localhost)

master.port

The port location of the Master Application. (Default value: 8080)

Starting a crawl

To start a crawl, the Master application’s crawl initialization endpoint should be called. The request should be a POST request to the /crawl endpoint on the master with a body the contains the crawlId for the Common Crawl dataset that should be processed.

For example:

curl --location --request POST 'http://185.191.228.214:8081/crawl' \
--header 'Content-Type: application/json' \
--data-raw '{
"crawlId": "CC-MAIN-2021-31"
}'

Once it is done, the Slave application should automatically pick up the new work units in a matter of minutes.

Starting the Merger Application

You can download the Merger Application from our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar url-collector-merger-application-{release-number}.jar ...

In the place of the …​ you should write the various parameters with two dashes in the beginning and an equal sign between the parameter’s name, and it’s value. For example:

java -jar url-collector-merger-application-{release-number}.jar --example.parameter=value

Parameters

Table 3. Parameters
Parameter Description

database.host

The host location of the MongoDB database server. (Default value: localhost)

database.port

The port open for the MongoDB database server. (Default value: 27017)

database.uri

If present and not empty, it overrides the host and port parameter. Let the user inject a MongoDB Connection String directly. Should be used to define the credentials and other custom connection parameters. (Default value: "")

warehouse.type

The type of the warehouse. Can be either 'local' or 'aws'.

warehouse.local.target-directory

The target directory where the crawled URLs should be saved. Only used if the warehouse.type is 'local'.

warehouse.aws.region

The region of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

warehouse.aws.bucket-name

The name of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

warehouse.aws.access-key

The access key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

warehouse.aws.secret-key

The secret key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.

result.path

The location where the result of the merge should be saved at. It should be a directory. The result file will be saved there with the filename of 'result.ubds'.

Peeking into the results

The individual result files are LZMA encoded. If you want to peek into them, then first you should decompress the files (7-Zip can help you with this on Windows). The URLs will sit in a JSON array.

The merged result file is NOT compressed. In it, the URLs are separated by newline characters.

About

An application that crawls the Common Crawl corpus for URLs with the specified file extensions.

Topics

Resources

License

Stars

Watchers

Forks

Languages