About

This is a search engine created from the ground up that is capable of handling tens of thousands of documents or Web pages, under harsh operational constraints and having a query response time under 300 milliseconds.

Install Dependencies

Doing Install Dependencies step only if the virtual environment attached is not working. Skip this section if you want.

Install Python3

If you do not have Python 3.6+:

The program should use Python 3.6+ since some functions are not in Python 2+ versions

Windows: https://www.python.org/downloads/windows/

Linux: https://docs.python-guide.org/starting/install3/linux/

MAC: https://docs.python-guide.org/starting/install3/osx/

Check if pip is installed by opening up a terminal/command prompt and typing the commands python3 -m pip. This should show the help menu for all the commands possible with pip. If it does not, then get pip by following the instructions at https://pip.pypa.io/en/stable/installing/

To install the dependencies for this project run the following two commands after ensuring pip is installed for the version of python you are using. Admin privileges might be required to execute the commands. Also make sure that the terminal is at the root folder of this project.

Virtual Environment Tutorial

(venv) $ mkdir my_virtual_environment
(venv) $ cd my_virtual_environment
(venv) $ python3 -m venv venv
(venv) $ cd ..
(venv) $ source my_virtual_environment/venv/bin/activate
(venv) $ pip install --upgrade pip
(venv) $ pip install flask
(venv) $ pip install flask-wtf
(venv) $ pip install flask-sqlalchemy
(venv) $ pip install nltk
(venv) $ pip install BeautifulSoup4

Type this line in terminal for running in virtual environment from venv folder.

$ source my_virtual_environment/venv/bin/activate

Resource Requirements

Option 1: Using crawler program to crawler the pages. Store the page results in DEV folder inside the folder containing all files.
Option 2: Download and decompress the zip folder. Add DEV folder inside the folder containing all files.

Web Browser Launch

If the output folder with inverted index already exists, you can skip this and update directly on the web UI. Otherwise, if you want to create output folder with the inverted index list:

(venv) $ python3 web_launch.py

Use Makefile to running WebUI (it will automatically run all 5 lines below):

(venv) $ make

Instead of make command line, you can set Flask environment variables and running the WebUI:.

(venv) $ export FLASK_APP=web_launch.py
(venv) $ export FLASK_ENV=development
(venv) $ export FLASK_RUN_HOST=localhost
(venv) $ export FLASK_RUN_PORT=8000
(venv) $ python3 -m flask run

Using web browser to access http://localhost:8000/ (if you set different host name, port number, use the link shown on console output)
To exit the virtual environment:

(venv) $ deactivate

Program File Descriptions

config.ini

configurations for file names, variables, etc.

config.py

read config.ini for the program

indexer.py

M1 part for creating inverted index

search.py

M2 part for search query

ranking.py

M3 part for ranking

posting.py

class Entry_Posting

Entry_Posting(doc_id,freq,tf_idf, positions)

helper.py

some helpers functions for web_launch.py, indexer.py, search.py and ranking.py
some functions are useful to read the inverted index file (at specific line), doc_ids file, term_line_relation file

forms.py

query search form in WebUI

web_launch.py

main program for web launch using Flask
using HTML and CSS files in static & templates folders to build a webUI

You can find more specific function descriptions in each file. Check the output files after running to confirm the format if you need to read again or use some functions in helper.py file

Output File Descriptions

Since the output files are binary files, this gives you a look at the data structures of each files

output/doc_ids.bin

# dictionary with key is doc_id, value is doc_name
{ doc_id : doc_name }

output/index.bin

# Each line is a dictionary with the key is the term, and value is posting.
# Use line offset to read the posting of each term
# posting = { doc_id : entry }
{ term1 : posting1 }
{ term2 : posting2 }
{ term3 : posting3 }

output/strong_terms.bin

# a dictionary with key as strong terms (title, bold), value is doc_ids
{term : [doc_id]}

output/anchor_terms.bin

# a dictionary with key as anchor terms, value is a list of doc_ids
{term : [doc_id]}

output/term_line_relationships.bin

# a dictionary with key is term, value is the line_offset of its posting in index.bin
{ term : line_offset}

output/partial_index/[0-N]

All the partial index files and folder will be auto deleted after merging

# Each line is a partial_posting which is dictionary
# key is the doc_id, and the value is entry of that doc_id of a term
# partial_posting = { doc_id : entry }
{ partial_posting1 }
{ partial_posting2 }
{ partial_posting3 }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Install Dependencies

Install Python3

Virtual Environment Tutorial

Resource Requirements

Web Browser Launch

Program File Descriptions

config.ini

config.py

indexer.py

search.py

ranking.py

posting.py

helper.py

forms.py

web_launch.py

Output File Descriptions

Demo

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
my_virtual_environment/venv		my_virtual_environment/venv
static		static
templates		templates
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
config.ini		config.ini
config.py		config.py
entry.py		entry.py
forms.py		forms.py
helper.py		helper.py
indexer.py		indexer.py
ranking.py		ranking.py
search.py		search.py
web_launch.py		web_launch.py
web_ui.gif		web_ui.gif

vudh1/Search_Engine_Website

Folders and files

Latest commit

History

Repository files navigation

About

Install Dependencies

Install Python3

Virtual Environment Tutorial

Resource Requirements

Web Browser Launch

Program File Descriptions

config.ini

config.py

indexer.py

search.py

ranking.py

posting.py

helper.py

forms.py

web_launch.py

Output File Descriptions

Demo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages