concurrent-web-crawler

Web crawler and indexer using parallel processing.

Three crawlers having different functioning have been analysed for the project. The most optimised version is a hybrid of the other two crawlers. The configurations are as follows:

Serial Crawler and Indexer - SCSI - serial_crawler.py
Concurrent Crawler and Indexer - CCCI - concurrent_crawler.py
Concurrent Crawler with Serial Indexer - CCSI - hybrid_crawler.py

The third version - "Concurrent Crawler with serial Indexer", shows the most optimum results when tested.

For the above results, URL used was

urlinput = https://en.wikipedia.org/wiki/Black_hole

base = https://en.wikipedia.org/wiki/

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
Webcrawler_Final_Report.docx		Webcrawler_Final_Report.docx
Webcrawler_Final_Report.pdf		Webcrawler_Final_Report.pdf
concurrent_crawler.py		concurrent_crawler.py
hybrid_crawler.py		hybrid_crawler.py
inverted_index.json		inverted_index.json
serial_crawler.py		serial_crawler.py
urls.txt		urls.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

concurrent-web-crawler

About

Releases

Packages

Contributors 2

Languages

parthnamdev/concurrent-web-crawler

Folders and files

Latest commit

History

Repository files navigation

concurrent-web-crawler

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages