GitHub - aman-nidhi/IR-Text-Document-Retrieval: simple Command line tool to search text file using tf-idf indexing and cosine similarity

##A Simple Text File Retrieval System

Documents and query are represented as vectors. The retrieved Text Files are ranked based on Cosine similarity of document vectors and the query vector. The vector representation of any document is an array of Tf-Idf score of the terms present in the respective document.

First run the create index program:

    python createIndex.py

Then run the query index program:

    python queryDoc.py pq

To run the query file, specify the the type of query

pq - phrase query ftq - free text query

english_stopwords.txt :is the stopwords File Index_db.json :is the inverted index of the corpus, stores the term and corresponding posting list
index_score_db.json :is the tf-idf database for each word

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data/pos		data/pos
demo_images		demo_images
progressbar		progressbar
stemmer		stemmer
stopwords		stopwords
.gitignore		.gitignore
README.md		README.md
createIndex.py		createIndex.py
index_db.json		index_db.json
index_score_db.json		index_score_db.json
queryDoc.py		queryDoc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

aman-nidhi/IR-Text-Document-Retrieval

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages