A multi-stage movie recommendation system using YouTube's Tow-Tower architecture.
To set up the virtual environment and install the necessary dependencies, run the following command:
make install
This will ensure your environment is prepared with all required packages, ready to run the project.
Run the test.py for a quick start.
Different versions of the MovieLens dataset are used for both training and evaluation purposes. Due to significant data redundancy in the original datasets, I created the script dataset.py to generate separate files for movies, users, and ratings, effectively eliminating duplication and reducing the overal size of each datasets along side feature engineering and feature selection for the task. This is most similar to the data at hand in a production environment.
- Retained all features from the original dataset, except for
raw_user_age
, which is only available in the100k
version of the dataset. - Removed
user_occupation_text
sinceuser_occupation_label
was derived from it, making it redundant.
- Most of the
movie_title
values in the dataset include their release years in parentheses, which were extracted and treated as a separate feature. For movies without a listed release year, the corresponding values are left asNaN
. - Converted the
user_gender
feature from boolean values to integer representations.
- data/ - Directory for storing 100k and 1m versions of the Movielense dataset.
- src/ - Contains the core implementation, including model definitions, architecture, and api functions.
- scripts/ - Useful scripts.
Videos
Papers
Libraries
Codes
Contributions are welcome and greatly appreciated! If you have an idea for improvement, or if you find a bug, feel free to contribute by opening an issue or a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.