This project provides a hands-on experience of building a complete data pipeline using Azure services and implementing a machine learning model for movie recommendations. It is an excellent opportunity for me to understand the practical aspects of data engineering and machine learning.
In this project, we will build a end-to-end data pipeline for a movie recommendation system using Azure services. The recommendation system is developed using collaborative filtering and PySpark ML, which is a machine learning library in Spark.
- We will use the datasets from Movielens, which includes ratings and movie data up to 25M records.
- The data will be stored in Azure Blob Storage, a scalable and secure data storage solution.
- The data transformation process will be handled by Azure Databricks, a fast, easy, and collaborative Apache Spark-based analytics platform.
- The data pipeline will be orchestrated in Azure DataFactory, a cloud-based data integration service.
- Additional components like Azure Logic App, Azure Active Directory, and Key Vault will be used for automation, security, and identity management.
- The model will be trained and tested on a large dataset, achieving a RMSE of 0.814 on the test set. The system will be capable of recommending the top 10 movies for a given user.
- For me it kind of simple because Azure gives free credits for new user, and can access all the services (In this project i use my email of my university and get free $100 credits without Visa or debit card)
- And of course, plenty of resources to learn from...
Ingest the data we have, thanks for Movielens give us milions of data from flat file, etc,...
-
First task, we need to create a Resource Group, A resource group is a container that holds related resources for an Azure solution.
-
Within the Resource Group, create an "Azure Storage Account" to store your data files. Choose the type of storage (e.g., Blob storage) and specify the region for the storage account. Blob storage is suitable for storing unstructured data like movie data files (e.g., CSV files).
-
Upload your movie data files to the Blob storage containers within the Azure Storage Account. These files will be stored securely and can be accessed by Azure services for further processing and analysis.
-
Configure Azure services such as Azure Data Factory or Azure Databricks to access and process the data stored in the Azure Storage Account. These services can perform tasks like data transformations, ETL operations, and generating movie recommendations.
- Raw data ingest to Azure Blob storage, we can trigger the pipeline run whenever the data come (I will show you later).
- We transform using Azure Databricks and get the validated data and rejected data.
- Then we orchestrate the ETL using Azure DataFactory,
- Addtionally, I will show you how to use Key Vault to store you identify, and use that to mount the services with each others.
- Eventually the notebook we run in Azure Databricks give outs the result to us, then we have Azure Logic App send the movie recommendation to us.
What you need to run the project:
- Azure account - You must have to have at least $15 for this project and more for further use.
- Azure Databricks - I highly recommend using you account Azure Databricks, not the community edition (its will not give you permission to generate the Databricks token for external connection).
- Documents - Check the documents for update and stay up-to-date.
-
There's two way to run this project:
-- First one is use can trigger the pipeline in Azure DataFactory with the ETL and mounting pre-defined.
-- Second one is the pipeline have been automated trigger whenever there's a file upload in to the specific location.
Inspired by following codes, articles and videos:
- Amazing tutor, and a youtuber providing such details video
- Document and answers by Microsoft
- Mounting and configuration on your choice
- Link to the Demo:
Link