WikiSe : Wikipedia Search Engine

Arpit Bhayani

A wikipedia search engine built using:

Java
XML Parsing using SAX Parser.
Ranking Algorithms

It works on Wikipedia XML dumps.
XML Dump Name : enwiki-latest-pages-articles.xml.bz2
XML Dump Link : http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Implementation basic:
High level of indexing which reduces the search time.
The index terms are hashed to characters 'a' - 'z'
Index is compressed at bitlevel. ( total size = 9.7GB )
Special infobox parsing to provide direct answeres if possible.

Special Features :

Index compression to make index half of its size. ( bit level compression )
Special search fields provided so that user can directly search info infobox.

e.g. Search Query : website:mumbai

Title : mumbai
Title : mumbai indians
Title : 2012-13 mumbai f.c. season
Title : 2008 mumbai attacks
Title : maharashtra
Title : public transport in mumbai
Title : attribution of the 2008 mumbai attacks
Title : list of constituencies of maharashtra vidhan sabha
Title : wikipedia:files for deletion/2010 april 13
Title : list of colleges in mumbai
****** {{url|www.mcgm.gov.in}} ****** <------ Website link

Interesting search :

pratieik
chudail
joey tribbiani
cartoon

nick:phoebe buffay powers:batman age:dimple kapadia population:amravati location:takla lake portrayer:joey tribbiani series:joey tribbiani t:priyanka t:priyanka age:priyanka

Statistics: On a mchine of configuration : Lenovo Z580 , 4 GB of RAM , 5400rpm hard-disk

For 100 MB of data
- Size of index ( primary+secondary ) : 24.3 MB
- Time to primary index : 9.031 sec
- Time to secondary index : 1.041 sec
- Time to search : 0.007 sec
For 46.7 GB of data Wiki XML Dump :
- Size of index ( primary+secondary ) : 9.7 GB
- Time to index : 2hr 28min (average)
- Time to search : 0.251 sec (average on 100 searches)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.idea		.idea
src		src
.gitignore		.gitignore
README.md		README.md
WikiSe.iml		WikiSe.iml
problem_statement.pdf		problem_statement.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiSe : Wikipedia Search Engine

About

Releases

Packages

Languages

arpitbbhayani/WikiSe

Folders and files

Latest commit

History

Repository files navigation

WikiSe : Wikipedia Search Engine

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages