A search engine built upon the cranfield collection for "CS7IS3 INFORMATION RETRIEVAL AND WEB SEARCH. Read Report - here
Ran similarities
- tfidf
- boolean
- bm25
And a CustomAnalyzer
Grant Permission to bash script to automatically unzip trec_eval.zip, build java lucene project, and trecEval
git clone https://github.com/QUzair/LuceneSE.git
cd LuceneSE
chmod u+x trecEval.sh
./trecEval.sh
Main Classes:
- CranFileParser.java
Parses Cran Docs File and Index it with specified Analyzer
- CranfieldQueries
Parses Cran Queries File and creates DockRank for queries
- CranfieldModel
Basic model for field in cranfield doc (id,title,author,biblio,content)
- PersonalQueries
Class to create custom queries for created Index
- Main
Main class which indexes and searches with different analyzers and similarity classes
Within cran folder:
- cran.all.1400
Contains 1400 documents from the Cranfield Collection.
- cran.qry
Queries that will be used to test our Implementation of the Search Engine with trec_eval
- QRelsCorrectedforTRECeval
RelDocs used for evaluation of our own search results
Output/Other files:
- similarityFiles
Creating 'DocRanks' results from our scoring functionality with bm25, boolean and tfidf
- trecEval.sh
Bash Script to unzip and make trec_eval.zip, build java lucene project, and run trecEval on the outputted similarityFiles (contains 'DocRanks') and QRelsCorrectedforTRECeval
- stopWords.txt
List of stopwords taken from https://www.ranks.nl/stopwords
Basic Custom analyzer with stopwords taken from https://www.ranks.nl/stopwords
//Creating New Token Stream
TokenStream tokenStream = new LowerCaseFilter(source);
//Adding Filters
tokenStream = new EnglishPossessiveFilter(tokenStream);
tokenStream = new PorterStemFilter(tokenStream);
tokenStream = new EnglishMinimalStemFilter(tokenStream);
tokenStream = new KStemFilter(tokenStream);
CharArraySet newStopSet = null;
try {
newStopSet = StopFilter.makeStopSet(getStopWords()); //Set of Words from ranks.nl/stopwords
} catch (IOException e) {
e.printStackTrace();
}
tokenStream = new StopFilter(tokenStream, newStopSet);
return new TokenStreamComponents(source, tokenStream);
StandardAnalyzer | CustomAnalyzer | |
---|---|---|
tfidf | 0.1557 | 0.2796 |
boolean | 0.1782 | 0.2781 |
bm25 | 0.2864 | 0.3375 |
As can be seen bm25 provides the best results along with the CustomAnalyzer.