Recommends the k most similar movie(s) after their plot texts' similarities.
5000 American movies are selected from a wiki dataset (see in Credits). For each movie plot, I created a text embedding with OpenAI's "text-embedding-3-small" model.
Text embeddings measure the relatedness of text strings by turning the texts into high-dimentional vectors of floating point numbers. The distance between two vectors measures their relatedness: small distances suggest high relatedness and large distances suggest low relatedness.
To list movie recommendations for a selected movie, I selected the records with the smallest vector distances.
By visualising the high-dimensional text embeddings in a 2D map with the help of NOMIC Atlas, we can see distinguishable clusters.
https://atlas.nomic.ai/data/csernusszilvi/experimental-arora/map
-
Prerequisites:
- Make sure Python3 is installed.
- If you don't have an account with OpenAI, create one here: https://openai.com/
- Create a project API key under Dashboard / API keys
- Create a NOMIC Atlas account here: https://atlas.nomic.ai/
-
Clone the project. - Be aware that the project includes the original dataset I used (
wiki_movie_plots_deduped.csv
) as well as the cached,movie_embeddings.pkl
file which are 81MB and 86MB in size, respectively. Assuming you choose to run the embedding function with the same parameters as in the project, the cache file would help avoid charges from OpenAI,. If you plan to use the embedding function for a different dataset / model, downloading these files won't be neccessary. -
Create a virtual environment inside the project folder:
python -m venv venv
-
Activate the virtual environment:
Mac:
source venv/bin/activate
Windows:
venv\Scripts\activate
-
Select interpreter in VSCode:
(on Mac) Cmd + Shift + P ---> Select Interpreter ---> Select the created venv environment
-
Create an
.env
file in the root folder and add your project's API key:OPENAI_API_KEY=your-unique-opanai-project-key
-
Install the python dependencies:
pip install -r requirements.txt
-
Log in into
NOMIC Atlas
- In the terminal: run
nomic login
, - click the link to retrieve your API KEY then return to the terminal to run
nomic login <your-api-key>
to get authenticated.
- In the terminal: run
-
Run the Jupyter Notebook:
jupyter notebook
command will open the Notebook in the browser.- Run the commands in the given order in the
movies-embedding.ipynb
file, adjusting the models and cost calculations as neccessary. - I used caching when I ran the embedding function myself. The cached pickle file,
movie_embeddings.pkl
is part of this project folder. If you don't change the dataset or the text-embedding model, you won't be charged as the embedding function will use the cached data whenever it's available. - Be aware that you'll be charged by OpenAI for running the embedding function if you use a different dataset and / or embedding model.
-
This project was adopted from Colt Steele's Walkthrough project on Udemy: Mastering OpenAI Python APIs.
Changes made: My code and logic are significantly different from Colt's version, I used updated APIs and made improvements to the code's logic.
-
Original dataset: https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots?resource=download