This project is crawling, extracting, indexing and processing the content of daily published news articles. The extracted content is indexed in ElasticSearch for further processing. This project also provides tooling to extract and preprocess the content for the NoD project.
git clone github.com/thesoenke/news-crawler
cd news-crawler
docker-compose up --build
This will start the crawler, Elastisearch and Kibana. In case ElasticSearch is crashing you might need to increase the virtual memory limit for Elasticsearch
sysctl -w vm.max_map_count=262144
When everything worked 3 containers should be running.
It is also possible to generate the NoDCore input in docker by running make nod-docker
. To generate only the data for the day before run make nod-yesterday
. In case you want to run make nod-yesterday
in a cronjob you have to provide the full path to docker-compose in the Makefile e.g. /usr/local/bin/docker-compose
. The output will be available in out/nod/german
. This will only work when the docker-compose setup is already running and the scraper has run at least once.
Starts the crawler and NoDWeb and automatically runs NoDCore when new data is available.
git clone github.com/thesoenke/news-crawler
cd news-crawler
docker-compose -f docker-compose.nod.yml up --build
- Open localhost:9000
- Make sure go is installed
go get github.com/thesoenke/news-crawler
$GOPATH/bin
should be in yourPATH
or run it directly with$GOPATH/bin/news-crawler
Download articles from a list of feeds
news-crawler feeds data/feeds_de.txt --lang german
The scraper downloads articles found by the feedreader and indexes them in Elasticsearch.
news-crawler scrape out/feeds/german/26-6-2017.json --lang german
An ElasticSearch instance needs to be running. Credentials can be set with the environment variables ELASTIC_URL
, ELASTIC_USER
and ELASTIC_PASSWORD
. Defaults are url: http://localhost:9200
, user: elastic
, password: changeme
newscrawler nod --lang german
Generated files can be found in out/nod/<lang>
by default. This command will only work when the scraper has run at least once to insert data into ElasticSearch.
It is also possible to only output the content from yesterday with the following command:
news-crawler nod --from yesterday --lang german
- Successful runs of the feedreader, scraper and nod exporter are logged with in
out/events.log
- The feedreader writes a log of feeds that could not be fetched to
out/feeds/<lang>/failures.log
- Articles that could not be fetched are logged in the ElasticSearch index
failures-<lang>
All fetched webpages are by default stored on disk in out/content/<lang>/<date>/<md5sum of the url>.html
When using the docker-compose setup open 4. Open localhost:5601 and add the index mapping. Index news-*
will contain all languages. Language specific index mapping can be created by using news-<lang>
. The index failures-<lang>
logs all failures.
Warning: The scraper has to be run at least once to create the indices
Before a new language can be added a list of feeds is required. After that the 3 environment variables need to be change in docker-compose.yml
CRAWLER_FEEDS_FILE
Path to a file with a list of feedsCRAWLER_LANGUAGE
Language of the parserCRAWLER_TIMEZONE
Timezone that should be used when storing the article publish dates