Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

Implementation for TFIDF and Searching of queries using keywords, using Java and Apache Hadoop

Notifications You must be signed in to change notification settings

prabhuvashwin/TFIDF-SearchQuery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ITCS 6190 - Cloud Computing for Data Analysis - Assignment 2

This is a read me document for the assignment 2 and explains all the files involved.

The input files used for this dataset is the Canterbury Corpus. It consists of 8 files of different formats.

While running the program, the output gets saved in the following manner: Consider OUTPUT_VALUE is the output location passed by the user, then DocWordCount Output location - OUTPUT_VALUE/docwordcount TermFrequency Output location - OUTPUT_VALUE/tf TFIDF Output location - OUTPUT_VALUE/tfidf Search Output location - OUTPUT_VALUE/search


COMMANDS used to the run the below files is same as that provided in the ClusterExample.pdf are also given below:

  • Copy all the source code files onto the cluster at path ~/assignment2/build/org/myorg/
  • Copy all the input files onto the hdfs file system at path /user//wordcount/input
  • Compile the source code files using the command - $ javac -cp /opt/cloudera/parcels/CDH/lib/hadoop/:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/ .java -d build -Xlint
  • Create jar file using the command - $ jar -cvf .jar -C build/ .
  • Execute the jar file using - $ hadoop jar .jar org.myorg. INPUT_PATH OUTPUT_PATH

The assignment has four sections as mentioned below:

  1. DocWordCount - The code for DocWordCount is present in DocWordCount.java. The output after running the DocWordCount on Canterbury corpus is available in output/docwordcount.out

  2. TermFrequency - The code for TermFrequency is present in TermFrequency. This file is not executed directly and is a part of chaining in TFIDF.java. The output after running the TermFrequency on Canterbury corpus is available in output/TermFrequency.out

  3. TFIDF - The code for TFIDF is present in TFIDF.java. The output after running the TFIDF on Canterbury corpus is available in output/TFIDF.out

  4. Search - The code for Search is present in Search.java. Two queries are run using Search.java. First one, where the query passed is “computer science”, and the output for this query is available in output/query1.out. Second one, where the query passed is “data analysis”, and the output for this query is available in output/query2.out.

About

Implementation for TFIDF and Searching of queries using keywords, using Java and Apache Hadoop

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published