In this project, a parser and an inverter was made to parse HTML pages and create inverted index. Four search algorithms (Okapi-TF, Okapi-TFIDF, Okapi-BM25 and Language Model with Jelinek Mercer Smoothing) were also implemented for document retrieval.
Files should be run in the following order
- python parser.py <folder containing HTML files>
- uses stoplist.txt, files in folder (contains HTML files) provided while execution
- creates docids.txt, termids.txt, doc_index.txt
- python inverter.py
- uses docids.txt, termids.txt, doc_index.txt
- creates term_info.txt, term_index.txt
- python docLengthCalculator.py
- uses doc_index.txt
- creates doc_lengths.txt
- python query.py --score <score function> --query <search query>
- available score functions: TF, TF-IDF, BM25, JM
- uses docids.txt, termids.txt, stoplist.txt, term_index.txt, doc_lengths.txt
You can get in touch with me on my LinkedIn Profile: Farhan Shoukat
MIT Copyright (c) 2018 Farhan Shoukat