The Machine Learning part of this application is a recommendation system. We create a model for an article recommendation system that classifies fire, criminal, and health-related topics. The readers' read data is sent to the cloud and the title of the articles will be processed in the model. After that, based on what the user has read, the model will recommend appropriate categories.
The dataset we use comes from CNN Indonesia (train and validation data) and Kompas.com (for testing). We collect the article title, author, category, article link, image link, and some content from the article. To retrieve these datasets, we use two methods:
- HTML parser Beautiful Soup
- Automates web browser Selenium
We take the html tags that store the data such as <h1>
and <h2>
for the title,<p>
the content, <a>
tags for the article links, and <img>
which holds the image links. The scrapped data is then stored in CSV format.
For more details visit datasets.
- We use the Embedding Layer to convert the words into a numerical representation. Each word will be represented with a word space vector.
- Bidirectional LSTM layer. LSTM is a type of recurrence model that can overcome the vanishing gradient problem in artificial neural networks.
- Dropout Layer is used to avoid overfitting in the model.
The model achieved a loss of 0.1744 and an accuracy of 0.9292 on the training data. While in the validation data, the model achieved a loss of 0.6373 and an accuracy of 0.7874.
To run this model you need to follow these steps:
- Download the datasets here
- Upload the dataset in your notebook environment
- Install the required libraries
- Pre-process the data
- Tokenize to vectorize the text corpus
- Build and compile the model with the architectures as mentioned above
- Do a model evaluation
- Convert the model to
.h5
format