Description This is a web scraping tool that allows you to extract links and data from a website. The tool provides the following scraping methods:
- Scrape subdomain & related links
- Scrape pages links
- Scrape robots.txt
- Scrape embedded links
- Extract metadata
- Analyze content
- Check links
- Performance metrics
- Handle cookies
- Parse sitemap
- Detect language
- Get JavaScript content
To use the ScrapEZ web scraping tool, follow these steps:
- Clone the repository: Run
git clone https://github.com/jakk-er/ScrapEZ.git
to clone the repository to your local machine.
git clone https://github.com/jakk-er/ScrapEZ.git
- Install dependencies: Run
pip install -r requirements.txt
to install the required dependencies, includingrequests
,beautifulsoup4
andurllib.parse
.
pip install -r requirements.txt
- Run the script: Run
python scrapez.py
.
python scrapez.py
Note: Make sure you have Python installed on your system, along with the required dependencies. If you're using a virtual environment, activate it before running the script.
Run the script and enter the website URL when prompted. Choose which scraping methods to use by entering the corresponding numbers (separated by commas). The tool will extract the requested data and print it to the console.
Here are some example usage scenarios:
-
To scrape subdomain & related links and pages links from
https://example.com
, enter:https://example.com
pressenter
and then choose1,2
orexample.com
pressenter
and then choose1,2
. -
To scrape robots.txt and embedded links from
https://www.example.net
, enter:https://www.example.net
pressenter
and then choose3,4
orwww.example.net
pressenter
and then choose3,4
. -
To scrape all available data from
https://subdomain.example.io
, enter:https://subdomain.example.io
pressenter
and then choose1,2,3,4
orsubdomain.example.io
pressenter
and then choose1,2,3,4
.
The script saves data into Markdown files within the Results
directory, with each file named according to the data type:
- Subdomain Links:
scraped_data_subdomain_links.md
- Page Links:
scraped_data_pages_links.md
- Robots.txt:
scraped_data_robots_txt.md
- Embedded Links:
scraped_data_embedded_links.md
- Metadata:
scraped_data_metadata.md
- Broken Links:
scraped_data_broken_links.md
- Performance Metrics:
scraped_data_performance_metrics.md
- Cookies:
scraped_data_cookies.md
- Sitemap URLs:
scraped_data_sitemap_urls.md
- Detected Language:
scraped_data_language.md
- JavaScript Content:
[sanitized_url]-js-content.html
- Content Analysis:
url_analysis.md
This work is licensed under a Creative Commons Attribution 4.0 International License. You must give appropriate credit, provide a link to the license, and indicate if changes were made. Details: https://creativecommons.org/licenses/by/4.0/
This software is intended for educational purposes only. You agree to use the software solely for educational, research, or academic purposes, and not for any commercial or malicious activities.
You acknowledge that you are solely responsible for any misuse of the software, including but not limited to using it to target websites or systems without their permission. The authors and copyright holders shall not be liable for any damages or claims arising from such misuse.
You are permitted to modify the software for your own educational purposes, but you agree not to modify the software in a way that would compromise its integrity or security. You also agree not to remove or alter any copyright notices, trademarks, or other proprietary rights notices from the software.
When redistributing or sharing modified versions of the software, you must provide appropriate attribution, indicate if changes were made, and include a link to the original license.
This software requires the following dependencies:
requests
beautifulsoup4
langdetect
playwright
selenium
webdriver-manager
jakk-er
2.0