Simple web crawling package.
Starts crawling from given domain_url. Visits every html link within specified domain via HTTP/HTTPS, collects each site title and links and repeats the process for collected links.
Returns dictionary of dictionaries as follows:
{
'http://0.0.0.0:8000': {
'title': 'Index',
'links': {'http://0.0.0.0:8000/example.html', 'http://0.0.0.0:8000/site.html'}
}
...
}
pip install git+https://github.com/myslak71/web_crawler.git
In scripts:
from web_crawler import site_map
site_map(url)
CLI:
$ web-crawler --url URL
OPTION | DESCRIPTION | |
---|---|---|
-u, --url | REQUIRED | Domain URL to start crawling from |
-h, --help | OPTIONAL | Help |