search engine where anyone can register website and this engine do crawling and index web pages developed using django.
Most of search engine have to perform following tasks:
- Crawling
- Scraping or Extracting Data
- Indexing
- PageRank
There are many packages in python for crawling like BeautifulSoup, scrapy etc. Here we use BeautifulSoup to crawl a website.
pip install beautifulsoup4
For crawling a website you need a url of a website and then you have to generate request for that website.
Generate a request
first install requests library to generate request
pip install requests
generate request and crawl a webpage
source = requests.get('enter your url here').text soup = BeautifulSoup(source, 'lxml')
Whenever crawling performs it stores a copy of a webpage and in database store the url of copied webpage and title of webpage and many more information. Below is the example of extracting data:-
title = soup.title.text h2 = soup.find('h2') #finding link link = soup.find('a') #finding href attribute from link href = link['href']
Indexing means store each and every word with it's url you can think as a dictionary where word is the key and value is url and this is called backward indexing if you want to know more about backward indexing and other search engine indexing follow
Below is the database for backward indexing
class Backward_Index(models.Model): data = models.CharField(max_length=100) urls = models.TextField() def __str__(self): return self.data
Code to performing backward indexing
#perform backward indexing def backward_indexing(name, list_of_words): #bi = Backward_Index() #bi.back_indexing(name, list_of_words) for words in list_of_words: if Backward_Index.objects.filter(data=words).exists(): indexes = Backward_Index.objects.filter(data=words) for index in indexes: if name in index.urls: continue else: index.urls = index.urls + ',' + name index.save() else: Backward_Index.objects.create(data=words, urls=name)
Page rank algorithm is the heart of the every search engine to get best search results. There are many page ranking algorithm exists, here point distribution algorithm implemented.
To perform this algorithm you have to make graph of every url where search keyword exists and then distribute point. Making a graph using networkx library.
Install networkx
pip install networkx