Skip to content

A wikipedia search engine that is completely built in Java and works on Wikipedia XML dumps

Notifications You must be signed in to change notification settings

arpitbbhayani/WikiSe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WikiSe : Wikipedia Search Engine

Arpit Bhayani

A wikipedia search engine built using:

  • Java
  • XML Parsing using SAX Parser.
  • Ranking Algorithms

It works on Wikipedia XML dumps.
XML Dump Name : enwiki-latest-pages-articles.xml.bz2
XML Dump Link : http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Implementation basic:
High level of indexing which reduces the search time.
The index terms are hashed to characters 'a' - 'z'
Index is compressed at bitlevel. ( total size = 9.7GB )
Special infobox parsing to provide direct answeres if possible.

Special Features :

  1. Index compression to make index half of its size. ( bit level compression )
  2. Special search fields provided so that user can directly search info infobox.

e.g. Search Query : website:mumbai

Title : mumbai
Title : mumbai indians
Title : 2012-13 mumbai f.c. season
Title : 2008 mumbai attacks
Title : maharashtra
Title : public transport in mumbai
Title : attribution of the 2008 mumbai attacks
Title : list of constituencies of maharashtra vidhan sabha
Title : wikipedia:files for deletion/2010 april 13
Title : list of colleges in mumbai
****** {{url|www.mcgm.gov.in}} ****** <------ Website link

Interesting search :

  • pratieik
  • chudail
  • joey tribbiani
  • cartoon

nick:phoebe buffay powers:batman age:dimple kapadia population:amravati location:takla lake portrayer:joey tribbiani series:joey tribbiani t:priyanka t:priyanka age:priyanka

Statistics: On a mchine of configuration : Lenovo Z580 , 4 GB of RAM , 5400rpm hard-disk

  • For 100 MB of data
    • Size of index ( primary+secondary ) : 24.3 MB
    • Time to primary index : 9.031 sec
    • Time to secondary index : 1.041 sec
    • Time to search : 0.007 sec
  • For 46.7 GB of data Wiki XML Dump :
    • Size of index ( primary+secondary ) : 9.7 GB
    • Time to index : 2hr 28min (average)
    • Time to search : 0.251 sec (average on 100 searches)

About

A wikipedia search engine that is completely built in Java and works on Wikipedia XML dumps

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages