The URL collector is an application that crawls the Common Crawl corpus for URLs with the specified file extensions.
The application suite is a distributed system to increase the processing speed. It can be deployed to one machine, but if the crawling speed should be increased, the slave application can be deployed to multiple machines.
The url-collector suite is built from three applications.
Knows what WARC files (work units) should be crawled, store this data in a database and distribute the work to the slave applications.
Gets a work unit (effectively a Common Crawl WARC file) from the Master application, that should be parsed for URLs. Parse the urls then upload the results to a warehouse.
To start crawling, you need to have Java and MongoDB installed on your machine.
First, you need to download the Java 17 Runtime Environment. It’s available here. After the download is complete you should run the installer and follow the directions it provides until the installation is complete.
Once it’s done, if you open a command line (write cmd to the Start menu’s search bar) you will be able to use the java command. Try to write java -version
. You should get something similar:
java version "15" 2020-09-15 Java(TM) SE Runtime Environment (build 15+36-1562) Java HotSpot(TM) 64-Bit Server VM (build 15+36-1562, mixed mode, sharing)
Download MongoDB 5.0 from here. After the download is complete run the installer and follow the directions it provides. If it’s possible, install the MongoDB Compass tool as well because you might need it later for administrative tasks.
You can download the Master Application from our release page. Please take care to choose a non "pre-release" version!
After the download is complete run the application via the following command:
java -jar url-collector-master-application-{release-number}.jar ...
In the place of the … you should write the various parameters with two dashes in the beginning and an equal sign between the parameter’s name and it’s value. For example:
java -jar url-collector-master-application-{release-number}.jar --example.parameter=value
Parameter | Description |
---|---|
database.host |
The host location of the MongoDB database server. (Default value: localhost) |
database.port |
The port open for the MongoDB database server. (Default value: 27017) |
database.uri |
If present and not empty, it overrides the host and port parameter. Let the user inject a MongoDB Connection String directly. Should be used to define the credentials and other custom connection parameters. (Default value: "") |
server.port |
The port where the master server should listen on. (Default value: 8080) |
You can download the Slave Application from our release page. Please take care to choose a non "pre-release" version!
After the download is complete run the application via the following command:
java -jar url-collector-slave-application-{release-number}.jar ...
You can start the slave application on more than one machines if necessary, to improve the crawling speed.
In the place of the … you should write the various parameters with two dashes in the beginning and an equal sign between the parameter’s name and it’s value. For example:
java -jar url-collector-slave-application-{release-number}.jar --example.parameter=value
Parameter | Description |
---|---|
execution.parallelism-target |
How many work units should be processed at the same time. (Default value: the number of processor cores multiplied by two) |
warehouse.type |
The type of the warehouse. Can be either 'local' or 'aws'. |
warehouse.local.target-directory |
The target directory where the crawled URLs should be saved. Only used if the warehouse.type is 'local'. |
warehouse.aws.region |
The region of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'. |
warehouse.aws.bucket-name |
The name of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'. |
warehouse.aws.access-key |
The access key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'. |
warehouse.aws.secret-key |
The secret key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'. |
validation.types |
The list of file executions that we need to save the URLs for. |
master.host |
The host location of the Master Application. (Default value: localhost) |
master.port |
The port location of the Master Application. (Default value: 8080) |
To start a crawl, the Master application’s crawl initialization endpoint should be called. The request should be a POST request to the /crawl endpoint on the master with a body the contains the crawlId for the Common Crawl dataset that should be processed.
For example:
curl --location --request POST 'http://185.191.228.214:8081/crawl' \
--header 'Content-Type: application/json' \
--data-raw '{
"crawlId": "CC-MAIN-2021-31"
}'
Once it is done, the Slave application should automatically pick up the new work units in a matter of minutes.
You can download the Merger Application from our release page. Please take care to choose a non "pre-release" version!
After the download is complete run the application via the following command:
java -jar url-collector-merger-application-{release-number}.jar ...
In the place of the … you should write the various parameters with two dashes in the beginning and an equal sign between the parameter’s name, and it’s value. For example:
java -jar url-collector-merger-application-{release-number}.jar --example.parameter=value
Parameter | Description |
---|---|
database.host |
The host location of the MongoDB database server. (Default value: localhost) |
database.port |
The port open for the MongoDB database server. (Default value: 27017) |
database.uri |
If present and not empty, it overrides the host and port parameter. Let the user inject a MongoDB Connection String directly. Should be used to define the credentials and other custom connection parameters. (Default value: "") |
warehouse.type |
The type of the warehouse. Can be either 'local' or 'aws'. |
warehouse.local.target-directory |
The target directory where the crawled URLs should be saved. Only used if the warehouse.type is 'local'. |
warehouse.aws.region |
The region of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'. |
warehouse.aws.bucket-name |
The name of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'. |
warehouse.aws.access-key |
The access key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'. |
warehouse.aws.secret-key |
The secret key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'. |
result.path |
The location where the result of the merge should be saved at. It should be a directory. The result file will be saved there with the filename of 'result.ubds'. |
The individual result files are LZMA encoded. If you want to peek into them, then first you should decompress the files (7-Zip can help you with this on Windows). The URLs will sit in a JSON array.
The merged result file is NOT compressed. In it, the URLs are separated by newline characters.