Multiprocesses Email Address Scraper

This project is a multiprocess email address scraper for the De La Salle University website staff directory.

This is the major course output in an advanced operating systems class for master's students under Mr. Gregory G. Cu of the Department of Software Technology, De La Salle University. The task is to create an email address scraper that employs parallel programming techniques. The complete project specifications can be found in the document Project Specifications.pdf.

Technical Paper: Technical Paper.pdf
Video Demonstration: https://www.youtube.com/watch?v=zYA5TIbF9UE

Approach

Combining both functional and data decomposition, our proposed approach models the scraping task as a multiple producer – multiple consumer problem:

The set of personnel IDs in the staff directory is divided by department, and multiple producers are mapped to different department directories. Each producer retrieves the personnel IDs from its assigned department directory and stores them in a synchronized queue.
Concurrently, the IDs are dequeued by consumer subprocesses, which use them to visit the staff members' individual web pages, scrape pertinent information (names, email addresses, and departments) from there, and store these details in another queue.
A dedicated subprocess gets the details from this queue and writes them on the output file.

Running our proposed approach with five threads achieves a 7.22× superlinear speedup compared to serial execution. Further experiments show that it achieves better scalability and performance than baseline parallel programming approaches that scrape from the root directory.

Running the Scraper

Create a copy of this repository:
- If git is installed, type the following command on the terminal:
```
git clone https://github.com/memgonzales/parallel-email-scraper
```
- If git is not installed, click the green Code button near the top right of the repository and choose Download ZIP. Once the zipped folder has been downloaded, extract its contents.
Install Google Chrome. It is recommended to retain the default installation directory.
Install the necessary dependencies. All the dependencies can be installed via pip.
Run the following command on the terminal:
```
python scraper.py
```
The following output files will be produced once the program is finished running:
- Scraped_Emails.csv - A text file containing the scraped details (names, email addresses, and departments)
- Website_Statistics.txt - A text file containing the number of pages scraped, the number of email addresses found, and the URLs scraped
Sample screenshots of the running program and output files are also provided in this repository.

Built Using

This project was built using Python 3.8, with the following libraries and modules used:

Libraries/Modules	Description	License
Selenium 4.7.2	Provides functions for enabling web browser automation	Apache License 2.0
Webdriver Manager 3.8.5	Simplifies management of binary drivers for different browsers	Apache License 2.0
`multiprocessing`	Offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock	Python Software Foundation License
`time`	Provides various time-related functions	Python Software Foundation License

The descriptions are taken from their respective websites.

Authors

Mark Edward M. Gonzales
mark_gonzales@dlsu.edu.ph
gonzales.markedward@gmail.com
Hans Oswald A. Ibrahim
hans_oswald_ibrahim@dlsu.edu.ph
hans.ibrahim2001@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Sample Output Files		Sample Output Files
Sample Screenshots		Sample Screenshots
Project Specifications.pdf		Project Specifications.pdf
README.md		README.md
Technical Paper.pdf		Technical Paper.pdf
approach.png		approach.png
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multiprocesses Email Address Scraper

Approach

Running the Scraper

Built Using

Authors

About

Releases

Packages

Contributors 2

Languages

memgonzales/parallel-email-scraper

Folders and files

Latest commit

History

Repository files navigation

Multiprocesses Email Address Scraper

Approach

Running the Scraper

Built Using

Authors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages