This project is a multiprocess email address scraper for the De La Salle University website staff directory.
This is the major course output in an advanced operating systems class for master's students under Mr. Gregory G. Cu of the Department of Software Technology, De La Salle University. The task is to create an email address scraper that employs parallel programming techniques. The complete project specifications can be found in the document Project Specifications.pdf
.
- Technical Paper:
Technical Paper.pdf
- Video Demonstration: https://www.youtube.com/watch?v=zYA5TIbF9UE
Combining both functional and data decomposition, our proposed approach models the scraping task as a multiple producer – multiple consumer problem:
- The set of personnel IDs in the staff directory is divided by department, and multiple producers are mapped to different department directories. Each producer retrieves the personnel IDs from its assigned department directory and stores them in a synchronized queue.
- Concurrently, the IDs are dequeued by consumer subprocesses, which use them to visit the staff members' individual web pages, scrape pertinent information (names, email addresses, and departments) from there, and store these details in another queue.
- A dedicated subprocess gets the details from this queue and writes them on the output file.
Running our proposed approach with five threads achieves a 7.22× superlinear speedup compared to serial execution. Further experiments show that it achieves better scalability and performance than baseline parallel programming approaches that scrape from the root directory.
-
Create a copy of this repository:
-
If git is installed, type the following command on the terminal:
git clone https://github.com/memgonzales/parallel-email-scraper
-
If git is not installed, click the green
Code
button near the top right of the repository and chooseDownload ZIP
. Once the zipped folder has been downloaded, extract its contents.
-
-
Install Google Chrome. It is recommended to retain the default installation directory.
-
Install the necessary dependencies. All the dependencies can be installed via
pip
. -
Run the following command on the terminal:
python scraper.py
-
The following output files will be produced once the program is finished running:
Scraped_Emails.csv
- A text file containing the scraped details (names, email addresses, and departments)Website_Statistics.txt
- A text file containing the number of pages scraped, the number of email addresses found, and the URLs scraped
Sample screenshots of the running program and output files are also provided in this repository.
This project was built using Python 3.8, with the following libraries and modules used:
Libraries/Modules | Description | License |
---|---|---|
Selenium 4.7.2 | Provides functions for enabling web browser automation | Apache License 2.0 |
Webdriver Manager 3.8.5 | Simplifies management of binary drivers for different browsers | Apache License 2.0 |
multiprocessing |
Offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock | Python Software Foundation License |
time |
Provides various time-related functions | Python Software Foundation License |
The descriptions are taken from their respective websites.
-
Mark Edward M. Gonzales
mark_gonzales@dlsu.edu.ph
gonzales.markedward@gmail.com -
Hans Oswald A. Ibrahim
hans_oswald_ibrahim@dlsu.edu.ph
hans.ibrahim2001@gmail.com