web_crawler

The web_crawler is a asynchoronous gevent link crawler that maps all the associated local links constrained by the input webpage url.

PLEASE MAKE SURE YOU RUN THE FOLLOWING COMMAND FOR CORRECT EXECUTION AND SAVING OF THE LOCAL LINK RELATIONS JSON FILE TO /DATA FOLDER:

python3 crawl_website.py -l <url> -s True

Requirements
Setup
- Windows
- Linux
Run Script
- Run Default with 'bbc.co.uk'
- Run with custom options
Testing
Notes
- Team Work & Planning
- Deployment Testing
Issues

Requirements

Dependencies (included in requirements.txt)
- bs4
- requests
- gevent
Python Version Tested
- 3.7.10

Setup

Windows

python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt

Linux

python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt

Run Script (Linux)

Run default settings with 'bbc.co.uk'

python3 crawl_website.py

Run with custom options

python3 crawl_website.py -l https://webscrapethissite.org -n 10

Testing

pytest test/

Or for detailed view

pytest -v test/

Notes

Team Work & Planning

Project's Kanban Board

Create a Kanban Board to structure project management, process tasks into bitesize tickets that are actionable.
In order to work with others in a team, I would of had a meeting to discuss the required tasks in order to complete the project.
Assigned tickets to each person that can be worked on simultaenously without conflict and set deadlines.
Create a system of accountability to review each others code via Kanban Board columns and fix any blocked tasks.
Set meetings within the team that match the deadlines set at important milestones.
Create branches in version control whereby we create multiple methods to implement or fix a feature.
Peer review branches to understand what code goes into the main branch and into deployment.

Deployment Testing

Create tests for development usage to ensure correct functionality
Create secret tests that haven't been used in development to finally test the deployed code, ideally someone who hasn't coded the functionality within the team.

Issues

The web crawler is unable to handle erroneous url links that contain no body.
Failed HTTP GET request due to unauthorised permissions, partly due to headers.
Asynchronous gevent threads are causing the queue within the Crawler to be empty while spawning.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
crawl_website.py		crawl_website.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web_crawler

Requirements

Setup

Windows

Linux

Run Script (Linux)

Run default settings with 'bbc.co.uk'

Run with custom options

Testing

Notes

Team Work & Planning

Deployment Testing

Issues

About

Releases

Packages

Languages

License

moj124/web_crawler

Folders and files

Latest commit

History

Repository files navigation

web_crawler

Requirements

Setup

Windows

Linux

Run Script (Linux)

Run default settings with 'bbc.co.uk'

Run with custom options

Testing

Notes

Team Work & Planning

Deployment Testing

Issues

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages