Skip to content

YotamAmrani/webCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

webCrawler

A web crawler CLI
Given a URL, the crawler will scan the webpage for any images, continue to every link inside that page and scan it as well. The crawling should stop once is reached. depth=3 means we can go as deep as 3 pages from the source URL (denoted by the param), and depth=0 is just the first page.

Results are saved to the results.json file in the following format:
{
results: [ { imageUrl: string,
sourceUrl: string // the page url this image was found on
depth: number // the depth of the source at which this image was found on } ] }

Files:

crawler.py - The main file to run.
results.json - assumed to be empty, the json file to be filled with the images that were found.

How to use the crawler:

Assuming both files are at the same directory,
Access the crawler directory through the CMD and Run the crawler:
'python "your_starting_node_url" _depth'

An example run would be: 'python "https://www.geeksforgeeks.org/" 1'

in this run we will search for the images in https://www.geeksforgeeks.org/ (i.e. depth 0), and the images
of it's 'neighbors' web pages (i.e. depth 1), and store them.

About

A web crawler CLI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages