Skip to content

Latest commit

 

History

History

scripts

Scripts

This space is reserved for custom scripts. Based on the experience, some scripts are already in place which may provide an aid to the project.

  • crawl-logstats.sh - Extracts the crawl statistics from Nutch logs. Please see USAGE for more details.
  • crawl-fetchstats.sh - Extracts the fetch statistics from Nutch segments. Please see USAGE for more details.
  • memex_cca_esindex.py - Converts the Nutch Common Crawl Dump to CDRv2 format. Please see USAGE for more details.
  • splitter.py - Splits the CDRv2 JSON into multiple JSONs based on target websites. Please see USAGE for more details.