Skip to content

Latest commit

 

History

History
22 lines (10 loc) · 763 Bytes

README.md

File metadata and controls

22 lines (10 loc) · 763 Bytes

Nutch-Analytics

This is an Apache Spark based project to analyze crawls generated by Apache Nutch. The project is still in incubation and has the CDRv2 dump feature for now.

The vision is to continue developing Analytical features for Nutch using Spark. This will also interesect with awesome concepts like Machine Learning and Natural Language Processing.

Build and Deploy

mvn clean install

Run Analytics

java -cp analytics-1.0.jar gov.nasa.jpl.analytics.dump.Cdrv2Dump -m local[*] -s PATH_TO_SEGMENT_FOLDER -o OUTPUT_FILE -l PATH_TO_LINK_DB

Contact Us

In case you have any questions or suggestions, please drop them at irds-l@mymaillists.usc.edu

Website: http://irds.usc.edu