Minicrawler

Minicrawler parses URLs, executes HTTP (HTTP/2) requests while handling cookies, network connection management and SSL/TLS protocols. By default it follows redirect locations and returns a full response, final URL, parsed cookied and more. It is designed to handle many request in parallel in a single thread. It multiplexes connections, running the read/write communication asynchronously. The whole Minicrawler suite is licensed under the AGPL license.

URL Library (libminicrawler-url)

WHATWG URL Standard compliant parsing and serializing library written in C. It is fast and has only one external dependency – libicu. The library is licensed under the AGPL license.

Usage

#include <minicrawler/minicrawler-url.h>

/**
 * First argument input URL, second (optional) base URL
 */
int main(int argc, char *argv[]) {
	if (argc < 2) return 2;

	char *input = argv[1];
	char *base = NULL;
	if (argc > 2) {
		base = argv[2];
	}

	mcrawler_url_url url, *base_url = NULL;

	if (base) {
		base_url = (mcrawler_url_url *)malloc(sizeof(mcrawler_url_url));
		if (mcrawler_url_parse(base_url, base, NULL) == MCRAWLER_URL_FAILURE) {
			printf("Invalid base URL\n");
			return 1;
		}
	}

	if (mcrawler_url_parse(&url, input, base_url) == MCRAWLER_URL_FAILURE) {
		printf("Invalid URL\n");
		return 1;
	}

	printf("Result: %s\n", mcrawler_url_serialize_url(&url, 0));
	return 0;
}

Minicrawler Library (libminicrawler) Usage

#include <stdio.h>
#include <minicrawler/minicrawler.h>

static void onfinish(mcrawler_url *url, void *arg) {
    printf("%d: Status: %d\n", url->index, url->status);
}

void main() {
    mcrawler_url url[2];
    mcrawler_url *urls[] = {&url[0], &url[1], NULL};
    mcrawler_settings settings;
    memset(&url[0], 0, sizeof(mcrawler_url));
    memset(&url[1], 0, sizeof(mcrawler_url));
    mcrawler_init_url(&url[0], "http://example.com");
    url[0].index = 0;
    mcrawler_init_url(&url[1], "http://example.com");
    url[1].index = 1;
    mcrawler_init_settings(&settings);
    mcrawler_go(urls, &settings, &onfinish, NULL);
}

Minicrawler Binary Usage

minicrawler [options] [urloptions] url [[url2options] url2]...

Options

   options:
         -2         disable HTTP/2
         -6         resolve host to IPv6 address only
         -8         convert from page encoding to UTF-8
         -A STRING  custom user agent (max 255 bytes)
         -b STRING  cookies in the netscape/mozilla file format (max 20 cookies)
         -c         convert content to text format (with UTF-8 encoding)
         -DMILIS    set delay time in miliseconds when downloading more pages from the same IP (default is 100 ms)
         -g         accept gzip encoding
         -h         enable output of HTTP headers
         -i         enable impatient mode (minicrawler exits few seconds earlier if it doesn't make enough progress)
         -k         disable SSL certificate verification (allow insecure connections)
         -l         do not follow redirects
         -mINT      maximum page size in MiB (default 2 MiB)
         -pSTRING   password for HTTP authentication (basic or digest, max 31 bytes)
         -S         disable SSL/TLS support
         -tSECONDS  set timeout (default is 5 seconds)
         -u STRING  username for HTTP authentication (basic or digest, max 31 bytes)
         -v         verbose output (to stderr)
         -w STRING  write this custom header to all requests (max 4095 bytes)

   urloptions:
         -C STRING  parameter which replaces '%' in the custom header
         -P STRING  HTTP POST parameters
         -X STRING  custom request HTTP method, no validation performed (max 15 bytes)

Output header

Minicrawler prepends its own header into the output with the following meaning

URL: Requested URL
Redirected-To: Final absolute URL
Redirect-info: Info about each redirect
Status: HTTP Status of final response (negative in case of error)
- -10 Invalid input
- -9, -8 DNS error
- -7, -6 Connection error
- -5 SSL/TLS error
- -4, -3 Error during sending a HTTP request
- -2 Error during receiving a HTTP response
- -1 Decoding or converting error
Content-length: Length of the downloaded content in bytes
Timeout: Reason of timeout in case of timeout
Error-msg: Error message in case of error (negative Status)
Content-type: Correct content type of outputed content
WWW-Authenticate: WWW-Authenticate header
Cookies: Number of cookies followed by that number of lines of parsed cookies in Netscape/Mozilla file format
Downtime: Length of an interval between time of the first connection and time of the last received byte; time of the start of the first connection
Timing: Timing of request (DNS lookup, Initial connection, SSL, Request, Waiting, Content download, Total)
Index: Index of URL from command line

Dependencies

Asynchronous hostname resolving – c-ares
Gzip decoding – zlib
TLS/SSL – OpenSSL
HTTP2 – Nghttp2
Unicode processing – ICU

Build on Linux

Tested platforms: Debian Linux, Red Hat Linux, OS X.

Install following dependencies (including header files, i.e. dev packages):

c-ares
zlib1g
icu
OpenSSL (optional)
nghttp2 (optional)

On Linux with apt-get run:

apt install libc-ares-dev zlib1g-dev libicu-dev libssl-dev libnghttp2-dev

The GNU Autotools are also needed and the GNU Compiler Collection, they can be installed by:

apt install make autoconf automake autotools-dev libtool gcc

Link libminicrawler to your project

On macOS with homebrew CFLAGS and LDFLAGS need to contain proper paths. You can assign them directly as the configure script options.

 ./configure CFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/opt -L/usr/local/lib"

After installation, you can link libminicrawler by adding this to your Makefile:

CFLAGS += $(shell pkg-config --cflags libminicrawler-4)
LDFLAGS += $(shell pkg-config --libs libminicrawler-4)

Build minicrawler with docker

First create .env file with COMPOSE_PROJECT_NAME=minicrawler then build docker image

docker compose build minicrawler
docker compose run --rm minicrawler

Then run:

./autogen.sh
./configure --prefix=$PREFIX --with-ca-bundle=/var/lib/certs/ca-bundle.crt --with-ca-path=/etc/ssl/certs
make
make install
make check # for tests

Unit Tests

Unit tests are done by simply running make check. They need php-cli to be installed.

Integration Tests

Integration tests require a running instance of httpbin. You can use public one like on nghttp2.org or install it locally For example as a library from PyPI and run it using Gunicorn:

Running httpbin locally

apt install -y python3-pip
pip install httpbin
gunicorn httpbin:app

Then run the following command:

make -C integration-tests check

Running httpbin using Docker

docker compose up -d httpbin

make -C integration-tests check

Install minicrawler to your image

COPY --from=minicrawler:latest /var/lib/minicrawler/usr /usr

Users

Testomato – A simple website monitoring tool
add me here

Name		Name	Last commit message	Last commit date
Latest commit History 614 Commits
.docker		.docker
.github		.github
integration-tests		integration-tests
src		src
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Makefile.am		Makefile.am
README.md		README.md
autogen.sh		autogen.sh
compose.yml		compose.yml
configure.ac		configure.ac
docker-bake.hcl		docker-bake.hcl
libminicrawler-url.pc.in		libminicrawler-url.pc.in
libminicrawler.pc.in		libminicrawler.pc.in
license.txt		license.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Minicrawler

URL Library (libminicrawler-url)

Usage

Minicrawler Library (libminicrawler) Usage

Minicrawler Binary Usage

Options

Output header

Dependencies

Build on Linux

Link libminicrawler to your project

Build minicrawler with docker

Unit Tests

Integration Tests

Running httpbin locally

Running httpbin using Docker

Install minicrawler to your image

Users

About

Releases 62

Packages

Contributors 2

Languages

License

testomato/minicrawler

Folders and files

Latest commit

History

Repository files navigation

Minicrawler

URL Library (libminicrawler-url)

Usage

Minicrawler Library (libminicrawler) Usage

Minicrawler Binary Usage

Options

Output header

Dependencies

Build on Linux

Link libminicrawler to your project

Build minicrawler with docker

Unit Tests

Integration Tests

Running httpbin locally

Running httpbin using Docker

Install minicrawler to your image

Users

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 62

Packages 0

Contributors 2

Languages

Packages