ApplianceInsight: Web Scraping, ML Label Validation, and Visualization for Energy-Efficient Appliances
This project, developed as part of the 4th semester of Datamatiker/Computer Science at UCN, aims to extract information using a web scraper and validate the data through machine learning.
The project comprises several components:
-
Web Scraping:
- Utilizes Scrapy, a Python library, to extract relevant links from sitemaps to household appliances from a predefined list of websites.
- Employs Playwright to open the links, extract information, and capture screenshots of the relevant content.
-
Machine Learning Validation:
- Utilizes FastAPI and a pre-trained object detection model based on YOLO via the Ultralytics framework.
- Screenshots are processed by the model to determine adherence to EU energy label laws:
- Identifies if it's a new EU energy label, a pre-2021 label, or no label detected.
- Information, along with previously collected data, is saved to a MongoDB database.
-
Database:
- FastAPI is used to access the MongoDB database with endpoints for data manipulation (POST and DELETE).
-
Frontend:
- An Angular-based frontend interacts with the MongoDB API to display products from specified sites.
- Features a grid-like list of products, statistics, and a pie diagram showcasing the distribution of new, old, and unlabeled products.
- The website is in Danish to cater to local users.
Provide instructions on how to:
- Set up the project environment.
- Install necessary dependencies (python requirement files and npm install).
- Run the different components of the project.
The dashboard displays products from specified sites, along with statistics and a pie diagram showcasing the distribution of new, old, and unlabeled products. The website features a grid-like list of products, statistics, and a pie diagram showcasing the distribution of new, old, and unlabeled products.
- Purpose: Web scraping framework in Python used to extract relevant links to household appliances from a predefined list of websites.
- Key Features:
- Efficiently extracts structured data from websites.
- Enables the creation of robust web crawlers.
- Purpose: Headless browser automation library used alongside Scrapy to open links, extract information, and capture screenshots of relevant content.
- Key Features:
- Provides cross-browser compatibility for web automation.
- Allows interaction with web pages programmatically.
- Purpose: Python-based web framework utilized to create APIs for interacting with the machine learning validation process and the MongoDB database.
- Key Features:
- High performance and asynchronous support.
- Simplified and easy-to-use API development.
- Purpose: Pre-trained object detection model based on the YOLO (You Only Look Once) architecture employed to validate screenshots.
- Key Features:
- Efficient real-time object detection.
- Flexibility and accuracy in identifying objects within images.
- Little code required to implement (only 1 lines of code).
- Purpose: NoSQL database used to store extracted data, including information from appliances and their respective energy labels.
- Key Features:
- Document-oriented database for flexibility in storing unstructured data.
- Scalability and ease of integration with Python.
- Purpose: Frontend framework used to build the user interface that interacts with the MongoDB API to display product information.
- Key Features:
- Component-based architecture for building dynamic web applications.
- Two-way data binding and dependency injection for efficient development.
The project was developed by a group of 4 students:
The project is not complete, and there are several areas that could be improved upon:
- The web scraper could be improved to extract more information from the websites.
- The machine learning validation could be improved to identify more information from the screenshots, and expanded to the full energy labels.
- The frontend could be improved to display more information from the database.