A little web-app entirely written in Go, aiming to illustrate the potential of concurrent programing for web scraping
This is a little web application written in the Go programming language. The user can input the URL of any website through the Front End interface, and the app returns a short analysis containing the following pieces of information:
- The HTML version used for writing the provided web page;
- The title of the page;
- The number of headings of the webpage, sorted by level;
- The number of links contained in the webpage, sorted by type: internal or external;
- The number of inaccessible links.
The application also checks for the presence of any login forms in the web page.
To be sure, as it usually happens during web scraping tasks (including limited-scope ones like this one), requests may not be successful for all web pages for a number of reasons. In such cases, the application informs the user about the encountered error and attempts to provide meaningful, if short, insights about it.
Fig. 1: The Web Page Analyzer in action |
-
I decided to keep the application as simple as possible, using Gin as a web framework and implementing a pleasant, yet minimal Front End;
-
The backbone of the application is the
Analyzer
interface, which contains aHTMLAnalyzer
and aLinkChecker
, dealing with the two main tasks to be carried out, respectively.- The
HTMLAnalyzer
interface provides methods for going through the tokenized HTML and singling out the elements that the app aims at retrieving, based on their tags. This includes a list of links, which are then passed to theLinkChecker
; - The
LinkChecker
interface is responsible for checking the availability of the links retrieved by theHTMLAnalyzer
. It first checks that links are formally correct, and then, whenever applicable, it performs GET requests in parallel, in order to minimize the execution time, which may be significant for long HTML documents.
- The
-
At the end of the process, the results of the analysis (or the error message) are collected in a Golang struct named
AnalysisResult
and passed to the server. They are thus integrated in the HTML templates and displayed by the Front End.
During the development, it has been necessary to make a few assumptions or to take arbitrary decisions:
-
Both links and login forms can be embedded into HTML in a number of ways. After some research, I singled out some of these, and decided to concentrate on them. The code has been written so as to allow easy extension should one decide to include other search strategies, but it should be noted that the application's output does not cover all possible scenarios;
-
Internal links are not "complete" by definition. Accordingly, no GET request performed outside the website that contains them can succeed. For this to happen, before making the call, one should reconstruct their absloute path. Because my code does not carry out such reconstruction, the application classifies them as inaccessible, which I have deemed acceptable, because in the end their accessibility depends on one's point of view. From where should they be accessible?
-
Needless to say, it would have been possible to use Go routines more extensively than what I have done. After several attempts, though, I decided to limit them to the
LinkChecker
, as I could not appreciate any significant performance improval by also employing them in theHTMLAnalyzer
and I did not wish to needlessly complicate the code.
First of all, clone this repository and navigate inside the folder:
git clone https://github.com/fra-mari/home24
cd home24
Then, use the following instructions to build and start the application, either directly or, if you do not wish to install Go
, using Docker
.
⚠️ N.B.: If you use a Windows system, or you prefer to useDocker
, please follow the instructions in the following paragraph.
-
Ensure you have
Go
installed on your system. You can download it from the official Go website. -
In the project directory, download the dependencies:
go mod tidy
-
Build the application:
go build -o analyzer_build
Note: The
-o
flag specifies the output file name. In this example, the compiled binary will be namedanalyzer_build
and placed in the current directory. -
Set the server to release mode:
GIN_MODE=release
-
Start the application:
./analyzer_build
The application will be accessible at http://localhost:8080
. To gracefully shut it down, you may press Ctrl+C
.
-
Ensure you have
Docker
installed on your system. You can download it from the official Docker website. -
Build the Docker image:
docker build -t analyzer .
-
Run the Docker container:
docker run -p 8080:8080 analyzer
The application will be accessible at http://localhost:8080
. To gracefully shut it down, you may press Ctrl+C
.
- Add the unit tests. The code has been written with tests in mind: the
Analyzer
interface as well as theHTMLAnalyzer
andLinkChecker
interfaces allow for a straightforward implementation of mock methods, which in turn facilitates complete and granular testing of the business logic; - Implement strategies for preventing web pages from refusing requests (403 Errors);
- Improve the mechanism that tries to recognize login forms, as the app currently identifies but a fraction of them, although significant;
- The code should be benchmarked to identifiy remaining bottlenecks that hinder performance, so as to reformat and to adopt strategies to further boost the speed of the Analyzer, especially when it processes particularly long HTML documents.