GitHub - e73b025/simple-python-url-crawler: Super simple Python3 website URL scraper/crawler. Multi-threaded.

Description

A super simple multi-threaded website URL crawler. Returns a Python array of all found URLs. It can be configured to return either internal urls, external urls or both.

Dependencies

pip install requests
pip install beautifulsoup4

Features

Super simple; two lines of code to get a list of URLs on a website.
Multi-threaded.
Enable or disable logging.
Can return internal, external or both URLs.
Can provide optional callback method for LIVE URL finds.
Not much else.

Usage

The following code sample will scan site "strongscot.com", using 5 threads and hiding all logging information.

Find Internal and External URLs

crawler = SiteUrlCrawler("https://strongscot.com", 5, False)

# Print the found URLs
for url in crawler.crawl(SiteUrlCrawler.Mode.ALL):
    print("Found: " + url)

Will output something similar to this:

Found: https://strongscot.com/
Found: https://strongscot.com/projects/
Found: https://strongscot.com/cv/
Found: https://strongscot.com/contact/
Found: https://github.com/strongscot
Found: https://strongscot.com/blog/20/03/03/simple-site-crawler.html
Found: https://strongscot.com/blog/20/02/19/birthday.html
Found: https://strongscot.com/blog/19/12/09/new-site.html
Found: https://strongscot.com/blog/19/09/09/body-goals.html
Found: https://strongscot.com/blog/19/09/09/cool-dropdown-ui.html
Found: https://strongscot.com/blog/19/09/09/flying-in-a-flight-machine.html
Found: https://github.com/strongscot/simple-python-url-crawler

Find Only Internal URLs

crawler = SiteUrlCrawler("https://strongscot.com")

# Print the found URLs
for url in crawler.crawl(SiteUrlCrawler.Mode.INTERNAL):
    print("Found: " + url)

Find Only External URLs

crawler = SiteUrlCrawler("https://strongscot.com")

# Print the found URLs
for url in crawler.crawl(SiteUrlCrawler.Mode.EXTERNAL):
    print("Found: " + url)

Will output:

Found: https://github.com/strongscot
Found: https://twitter.com/thestrongscot

Using Callback (getting live URL finds as they happen)

If you wish to get each URL as it is found rather than at the end in an array, you can pass an optional argument to the crawl() method that will do exactly that. For example:

crawler = SiteUrlCrawler("https://strongscot.com")

def callback(url):
    print("Found: " + url)

# Get ALL urls and print them
crawler.crawl(SiteUrlCrawler.Mode.ALL, callback)

Bad-Tip

Want to make it a small Google Bot? Comment-out lines 134 - 136 in file SiteUrlCrawler.py and it will trawl even external links.

Author

@strongscot

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SiteUrlCrawler.py		SiteUrlCrawler.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Dependencies

Features

Usage

Find Internal and External URLs

Find Only Internal URLs

Find Only External URLs

Using Callback (getting live URL finds as they happen)

Bad-Tip

Author

License

About

Releases

Packages

Contributors 2

Languages

License

e73b025/simple-python-url-crawler

Folders and files

Latest commit

History

Repository files navigation

Description

Dependencies

Features

Usage

Find Internal and External URLs

Find Only Internal URLs

Find Only External URLs

Using Callback (getting live URL finds as they happen)

Bad-Tip

Author

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages