comments from the original author: " A simple HTML Lexer/Parser to wrap my head around lexers and parsers, the foundations of a compiler/interpreter. Since I love webscraping thought I'd give it a try! "
Munchlex is a multi-threaded web scraper written in C that resolves around the idea of a lexer/parser. The origin of this project started when the need to further enhance the performance of a python based web scraper was required. originally Python based web scrapers though efficient, were only sequential in nature. This means that the page was parsed line by line which was time consuming. Hence, Munchlex and the idea of a multi-threaded web scraper was born.
the multi-threaded web scraper is designed to take advantage of concurrent execution by utilizing multiple threads. and hence a departure from the traditional sequential web scrapers written in languages like Python. the use of multiple threads aims to enhance the speed and efficiency of the web scraper. making it capable to handle the large scale data extraction and processing
- Multi-threaded: harnesses the power of parallel processing capabilities of multiple threads to scrape data more efficently and reduce the time take to scrape the data
- Fast: the multi-threaded web scraper is designed to be fast and efficient.
- performance optimization: the scraper addresses the limitations of sequential web scrapers writtent in interpreted langauges, ensurign a fastre and more efficient data extraction process
- Scalability: thanks to multi-threading, the scraper is capable of handling large scale data extraction and processing easily and more efficiently
This section contains steps that one can use to get there own copy of Munchlex
-
firstly, fork this repository using the fork button on the top right corner of the page to get your own copy of the repository. This is to ensure that you can actively experiment with the code of the web scraper
-
secondly, once you have made the fork, you can clone the repository to your local machine using the following command:
git clone <paste URL of the forked repo here>
- thirdly, once you have cloned the repository, you can navigate to the directory of the repository and run the following command to compile the code:
make all
This will create the executable file by compiling and linking the project. we can then enter the test dir and run the shell script in there to test the web scraper.
- Note: Please note that the web scraper is not yet ready and tested for windows systems natively. The use of a Linux based system is recommended.
- Note: To run the scraper on a windows system, you can use the windows subsystem for linux (WSL2) to run the scraper.
The web scraper works by using multiple threads to scrape the data from the web pages. the function 'munchLex' processes each line fo the input file, tokenizes it and then constructs a tree of tokens which represents the document structure. the resulting tree is also printed to the log file. The function performs lexcial analysis fo the page.
- when the user begins the execution. the main drive program 'main.c' orchestrates the scraping process using a thread pool.
- the main parses command-line arguments to gather information about the number of threads, files to scrape, and optional flags like "Daemon" mode (Still in development).
- the creation of the thread pool then takes place and a thread pool with specified number of threads to parallelize the process is created.
- for each file, a thread from the thread pool is assigend to execute the 'munchLex' function. This function is main parser and lexor of the scraper. the functions performs lexical analysis of the page and tokenizes the input file. ending with the construction of a tree structure for representation of the document.
- the 'lexer' function then identifies tokens such as tags and text, and a tree structure is generated to represent the hierarchical structure of the document. the tree structure is also stored in a log file.
- NOTE: it is recommend for the user to run the scraper to understand the working of the scraper in more detail. And how does the tree structure is generated and how the lexical analysis is performed.
- The scraper is still in development and there are many features that are still to be implemented.
- Daemon mode is still in development and will be implemented in the future.
- The scraper is yet to be tested on windows systems and will be tested in the future to ensure a smooth eperience for windows users.
The multi-threaded web scraper is designed to be fast and efficient. It is capable of handling large scale data extraction and processing easily and more efficiently.