A web crawler CLI
Given a URL, the crawler will scan the webpage for any images, continue to every link inside that page and scan it as well.
The crawling should stop once is reached. depth=3 means we can go as deep as 3 pages from the source URL (denoted by the param), and depth=0
is just the first page.
Results are saved to the results.json file in the following format:
{
results: [
{
imageUrl: string,
sourceUrl: string // the page url this image was found on
depth: number // the depth of the source at which this image was found on
}
]
}
crawler.py - The main file to run.
results.json - assumed to be empty, the json file to be filled with the images that were found.
Assuming both files are at the same directory,
Access the crawler directory through the CMD and Run the crawler:
'python "your_starting_node_url" _depth'
An example run would be: 'python "https://www.geeksforgeeks.org/" 1'
in this run we will search for the images in https://www.geeksforgeeks.org/ (i.e. depth 0), and the images
of it's 'neighbors' web pages (i.e. depth 1), and store them.