diff --git a/(Study Case V) Film Recommendation System/Film Recommender.ipynb b/(Study Case V) Film Recommendation System/Film Recommender.ipynb new file mode 100644 index 0000000..4669bde --- /dev/null +++ b/(Study Case V) Film Recommendation System/Film Recommender.ipynb @@ -0,0 +1,4488 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QXLJDtxsguzf" + }, + "source": [ + "> Nama : Muhammad Ammar Nabil
\n", + "Kelas  : M03
\n", + "Email  : mammarnabil1@gmail.com" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xt9VQpBWz-XA" + }, + "source": [ + "
\n", + "\n", + "# **Movie Recommender**\n", + "###### [Zahra Nazari, Hamidreza Koohi, Javad Mousavi](https://jad.shahroodut.ac.ir/article_2390.html)\n", + "---\n", + "\n", + "
\n", + "\n", + "In this notebook, we learn how to build a recommender model to recommend a simillar film based on what they like (Content Based Filtering) or community like (Collaborative Filtering). In this model, i'll use MovieLens dataset in kaggle that includes:\n", + "* Film Title \n", + "* Genre\n", + "* Tag\n", + "* Rating\n", + "\n", + "> Number Film listed is **27262 data**\n", + "\n", + "> Number ratings listed is **20.000.000 data**\n", + "\n", + "
\n", + "\n", + "## • ***Background***\n", + "\n", + "I choose this problem because it's can help to improve experience of stream app company to recommend similar film that they like or community like. This is can improve satisfaction of client that can impact revenue of company.\n", + "\n", + "My reference comes from **Zahra Nazari, Hamidreza Koohi, Javad Mousavi** in the journal entitled _**\"Increasing Performance of Recommender Systems by Combining Deep Learning and Extreme Learning Machine\"**_. In the journal, they applied new deep learning-based clustering methods in order to overcome the data sparsity problem, and increment the efficiency of the recommender systems based on precision, accuracy, F-measure, and recal. They use dataset from kaggle [MovieLens 20M Dataset](https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset). For more details about my model, download this [Details Report](https://colab.research.google.com/drive/14zROOHUuS7qmjisQAtCLewyGE_GZ5ML1?usp=sharing) Only available in Bahasa Indonesia\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "FSKDD17gzGf0" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import tensorflow as tf\n", + "import matplotlib.pyplot as plt\n", + "from keras import layers\n", + "from tensorflow import keras\n", + "from google.colab import files\n", + "from sklearn.preprocessing import Normalizer\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.metrics.pairwise import cosine_similarity\n", + "from sklearn.feature_extraction.text import TfidfVectorizer\n", + "from tensorflow.keras.callbacks import EarlyStopping, CSVLogger" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h_7HniFhz08I" + }, + "source": [ + "## **Import and Understanding Dataset**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5_FM0R6A-V88" + }, + "source": [ + "### *1. Data Loading*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 74 + }, + "collapsed": true, + "id": "PiqjgZaZqeHu", + "outputId": "cab22b3b-6fc8-44da-d598-f059d8c9152e" + }, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "\n", + " \n", + " \n", + " Upload widget is only available when the cell has been executed in the\n", + " current browser session. Please rerun this cell to enable.\n", + " \n", + " " + ] + }, + "metadata": {} + }, + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Saving kaggle.json to kaggle.json\n" + ] + } + ], + "source": [ + "# Upload kaggle.json API\n", + "!mkdir ~/.kaggle\n", + "files.upload()\n", + "!mv kaggle.json ~/.kaggle/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "collapsed": true, + "id": "0Wa0VGWxJRj3", + "outputId": "857d8b28-1c6c-4e0d-eb0d-3c05477f1795" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "total 16\n", + "drwxr-xr-x 2 root root 4096 Sep 25 16:58 .\n", + "drwx------ 1 root root 4096 Sep 25 16:58 ..\n", + "-rw------- 1 root root 63 Sep 25 16:58 kaggle.json\n" + ] + } + ], + "source": [ + "# Change permission\n", + "!chmod 600 ~/.kaggle/kaggle.json\n", + "!ls ~/.kaggle/ -la" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "collapsed": true, + "id": "QN_62ntdKq11", + "outputId": "ec3bed35-d4eb-4253-ba91-2f406536835e" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Downloading movielens-20m-dataset.zip to /content\n", + " 97% 189M/195M [00:08<00:00, 34.6MB/s]\n", + "100% 195M/195M [00:08<00:00, 24.9MB/s]\n", + "Archive: movielens-20m-dataset.zip\n", + " inflating: genome_scores.csv \n", + " inflating: genome_tags.csv \n", + " inflating: link.csv \n", + " inflating: movie.csv \n", + " inflating: rating.csv \n", + " inflating: tag.csv \n" + ] + } + ], + "source": [ + "# Download and extract kaggle dataset\n", + "!kaggle datasets download -d grouplens/movielens-20m-dataset\n", + "!unzip movielens-20m-dataset.zip\n", + "!rm movielens-20m-dataset.zip" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true, + "id": "PN-F2Sfey7yy" + }, + "outputs": [], + "source": [ + "# load the dataset\n", + "movies = pd.read_csv('movie.csv')\n", + "ratings = pd.read_csv('rating.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "JAahsD9kj0Y2", + "outputId": "647cc67a-ca5a-4c68-f153-148341313a63" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Total movies \t\t= 27262 movie\n", + "Total rating count \t= 26744 rating\n" + ] + } + ], + "source": [ + "print(f'Total movies \\t\\t= {(len(movies.title.unique()))} movie')\n", + "print(f'Total rating count \\t= {(len(ratings.movieId.unique()))} rating')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V4bj8rD0-PdR" + }, + "source": [ + "### *2. Exploratory Data Analysis - Variable Description*" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "80_TMAaBl6sm" + }, + "source": [ + "#### **EDA Movies**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "NZnSsxMjmGz8", + "outputId": "67ae125d-b2d5-4617-8dea-103b635553ac" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Shape of movies (27278, 3)\n", + "\n", + "\n", + "RangeIndex: 27278 entries, 0 to 27277\n", + "Data columns (total 3 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 movieId 27278 non-null int64 \n", + " 1 title 27278 non-null object\n", + " 2 genres 27278 non-null object\n", + "dtypes: int64(1), object(2)\n", + "memory usage: 639.5+ KB\n" + ] + } + ], + "source": [ + "# Check data type each atribute\n", + "print(f'Shape of movies {movies.shape}\\n')\n", + "movies.info()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "poBQrSCy1C5W", + "outputId": "e7f73eb1-590b-4f86-ac8c-b5c6b1677cee" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "RangeIndex: 27278 entries, 0 to 27277\n", + "Data columns (total 3 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 movieId 27278 non-null category\n", + " 1 title 27278 non-null object \n", + " 2 genres 27278 non-null category\n", + "dtypes: category(2), object(1)\n", + "memory usage: 1.6+ MB\n" + ] + } + ], + "source": [ + "# Change dataType of movieId\n", + "movies = movies.astype({'movieId': 'category', 'genres': 'category'})\n", + "movies.info()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "VvBpuPtxnSTC", + "outputId": "72f00982-8b5f-4b5d-9766-126526f01192" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Total movies : 27262 movie\n", + "\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array(['Toy Story (1995)', 'Jumanji (1995)', 'Grumpier Old Men (1995)',\n", + " ..., 'The Pirates (2014)', 'Rentun Ruusu (2001)',\n", + " 'Innocence (2014)'], dtype=object)" + ] + }, + "metadata": {}, + "execution_count": 9 + } + ], + "source": [ + "# Check title attribute\n", + "print(f'Total movies : {len(movies.title.unique())} movie\\n') \n", + "movies.title.unique()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "JDHIRePrnsHB", + "outputId": "51f32f0e-0c7b-418e-de92-05dc86348473" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Total genres : 1342 genre\n", + "\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['Adventure|Animation|Children|Comedy|Fantasy', 'Adventure|Children|Fantasy', 'Comedy|Romance', 'Comedy|Drama|Romance', 'Comedy', ..., 'Adventure|Children|Drama|Sci-Fi', 'Children|Documentary|Drama', 'Action|Adventure|Animation|Fantasy|Horror', 'Animation|Children|Comedy|Fantasy|Sci-Fi', 'Animation|Children|Comedy|Western']\n", + "Length: 1342\n", + "Categories (1342, object): ['(no genres listed)', 'Action', 'Action|Adventure',\n", + " 'Action|Adventure|Animation', ..., 'Thriller|Western', 'War', 'War|Western', 'Western']" + ] + }, + "metadata": {}, + "execution_count": 10 + } + ], + "source": [ + "# Check genre attribute\n", + "print(f'Total genres : {len(movies.genres.unique())} genre\\n') \n", + "movies.genres.unique()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "znMufTthmIyf" + }, + "source": [ + "#### **EDA Ratings**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "DdK7qhCDmHXM", + "outputId": "7a9f4b97-285b-4414-8a11-aa0893de4d7f" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Shape of Ratings (20000263, 4)\n", + "\n", + "\n", + "RangeIndex: 20000263 entries, 0 to 20000262\n", + "Data columns (total 4 columns):\n", + " # Column Dtype \n", + "--- ------ ----- \n", + " 0 userId int64 \n", + " 1 movieId int64 \n", + " 2 rating float64\n", + " 3 timestamp object \n", + "dtypes: float64(1), int64(2), object(1)\n", + "memory usage: 610.4+ MB\n" + ] + } + ], + "source": [ + "# Check data type each attribute\n", + "print(f'Shape of Ratings {ratings.shape}\\n')\n", + "ratings.info()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "i5koKrww1_mY", + "outputId": "863d3d77-7959-4af7-c7fc-bf3acf07af91" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "RangeIndex: 20000263 entries, 0 to 20000262\n", + "Data columns (total 4 columns):\n", + " # Column Dtype \n", + "--- ------ ----- \n", + " 0 userId category \n", + " 1 movieId category \n", + " 2 rating float32 \n", + " 3 timestamp datetime64[ns]\n", + "dtypes: category(2), datetime64[ns](1), float32(1)\n", + "memory usage: 349.6 MB\n" + ] + } + ], + "source": [ + "# Change datatype to get less memory\n", + "ratings = ratings.astype({'movieId': 'category', 'userId': 'category', \n", + " 'rating': 'float32', 'timestamp': 'datetime64[ns]'})\n", + "ratings.info()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "a2fvOjE4o6nb", + "outputId": "14be5312-13e2-43e0-c5ab-3f39a7e9b190" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "count 20000263.0\n", + "mean 4.0\n", + "std 1.0\n", + "min 0.0\n", + "25% 3.0\n", + "50% 4.0\n", + "75% 4.0\n", + "max 5.0\n", + "Name: rating, dtype: float64" + ] + }, + "metadata": {}, + "execution_count": 13 + } + ], + "source": [ + "# Range of ratings\n", + "ratings['rating'].describe().round()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "stGZ4WzqoAZn", + "outputId": "a70a97d6-8915-484e-c665-56ce40b8ee5c" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Total Ratings count : 26744 rating\n", + "\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 3.5\n", + "1 3.5\n", + "2 3.5\n", + "3 3.5\n", + "4 3.5\n", + "Name: rating, dtype: float32" + ] + }, + "metadata": {}, + "execution_count": 14 + } + ], + "source": [ + "# Check rating attribute\n", + "print(f'Total Ratings count : {len(ratings.movieId.unique())} rating\\n') \n", + "ratings.rating.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xXMgxzzPXqsA" + }, + "source": [ + "### *3. Exploratory Data Analysis - Checking Missing Value*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "8fHcuJG9XP5e", + "outputId": "65a994ed-313c-4229-a16d-6140aa399071" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "movieId 0\n", + "title 0\n", + "genres 0\n", + "dtype: int64" + ] + }, + "metadata": {}, + "execution_count": 15 + } + ], + "source": [ + "# Check Null value in movies\n", + "movies.isna().sum()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "wz1thDRKXJQc", + "outputId": "c1ea24ea-8567-4908-f8c8-81e6d5cab825" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "userId 0\n", + "movieId 0\n", + "rating 0\n", + "timestamp 0\n", + "dtype: int64" + ] + }, + "metadata": {}, + "execution_count": 16 + } + ], + "source": [ + "# Check null value in ratings\n", + "ratings.isna().sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z0RYJwW2R6W2" + }, + "source": [ + "## **Data Preprocessing**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VEtZKK2YR8gI" + }, + "source": [ + "### *1. Content Based Filtering*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 480 + }, + "id": "Ixvj9ynhQ4Ef", + "outputId": "292bca8d-fe03-430e-d4f8-a7614805f533" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:12: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only\n", + " if sys.path[0] == '':\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " movieId film genre\n", + "0 1 toy story Adventure\n", + "1 2 jumanji Adventure\n", + "2 3 grumpier old men Comedy\n", + "3 4 waiting to exhale Comedy\n", + "4 5 father of the bride part ii Comedy\n", + "... ... ... ...\n", + "27273 131254 kein bund für's leben Comedy\n", + "27274 131256 feuer, eis & dosenbier Comedy\n", + "27275 131258 the pirates Adventure\n", + "27276 131260 rentun ruusu NaN\n", + "27277 131262 innocence Adventure\n", + "\n", + "[27278 rows x 3 columns]" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movieIdfilmgenre
01toy storyAdventure
12jumanjiAdventure
23grumpier old menComedy
34waiting to exhaleComedy
45father of the bride part iiComedy
............
27273131254kein bund für's lebenComedy
27274131256feuer, eis & dosenbierComedy
27275131258the piratesAdventure
27276131260rentun ruusuNaN
27277131262innocenceAdventure
\n", + "

27278 rows × 3 columns

\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ] + }, + "metadata": {}, + "execution_count": 21 + } + ], + "source": [ + "# Make a film only have one genre\n", + "genre_fix = movies['genres'].map(lambda genre: genre.split('|')[0])\n", + "genre_fix = pd.DataFrame(genre_fix.replace('(no genres listed)', np.nan))\n", + "\n", + "# Rename genres and rename film title\n", + "genre_fix = genre_fix.replace({'genres': {'Sci-Fi': 'Scifi', 'Film-Noir': 'Noir'}})\n", + "title_fix = pd.DataFrame(movies['title'].map(lambda title: title.lower()[:-7]))\n", + "\n", + "movies_fix = movies.copy()\n", + "movies_fix['film'] = title_fix\n", + "movies_fix['genre'] = genre_fix\n", + "movies_fix.drop(['title', 'genres'], 1, inplace=True)\n", + "movies_fix['genre'] = movies_fix['genre'].astype('category')\n", + "movies_fix" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 638 + }, + "id": "9Tsud3emVO6O", + "outputId": "b83a7d11-3f45-4700-d11e-c2e92ffe449d" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Before droping null value\n", + "movieId 0\n", + "film 0\n", + "genre 246\n", + "dtype: int64 \n", + "\n", + "After droping null value\n", + "movieId 0\n", + "film 0\n", + "genre 0\n", + "dtype: int64 \n", + "\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " movieId film genre\n", + "0 1 toy story Adventure\n", + "1 2 jumanji Adventure\n", + "2 3 grumpier old men Comedy\n", + "3 4 waiting to exhale Comedy\n", + "4 5 father of the bride part ii Comedy\n", + "... ... ... ...\n", + "27272 131252 forklift driver klaus: the first day on the job Comedy\n", + "27273 131254 kein bund für's leben Comedy\n", + "27274 131256 feuer, eis & dosenbier Comedy\n", + "27275 131258 the pirates Adventure\n", + "27277 131262 innocence Adventure\n", + "\n", + "[27032 rows x 3 columns]" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movieIdfilmgenre
01toy storyAdventure
12jumanjiAdventure
23grumpier old menComedy
34waiting to exhaleComedy
45father of the bride part iiComedy
............
27272131252forklift driver klaus: the first day on the jobComedy
27273131254kein bund für's lebenComedy
27274131256feuer, eis & dosenbierComedy
27275131258the piratesAdventure
27277131262innocenceAdventure
\n", + "

27032 rows × 3 columns

\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ] + }, + "metadata": {}, + "execution_count": 22 + } + ], + "source": [ + "# Drop null value\n", + "print('Before droping null value')\n", + "print(movies_fix.isna().sum(), '\\n')\n", + "\n", + "print('After droping null value')\n", + "movies_fix.dropna(inplace=True)\n", + "print(movies_fix.isna().sum(), '\\n')\n", + "movies_fix" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 424 + }, + "id": "NbTmyWDuX9X9", + "outputId": "40a28f4b-60f9-4ea1-8ba6-c8ec8d759cfc" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " movieId film genre\n", + "0 1 toy story Adventure\n", + "1 2 jumanji Adventure\n", + "2 3 grumpier old men Comedy\n", + "3 4 waiting to exhale Comedy\n", + "4 5 father of the bride part ii Comedy\n", + "... ... ... ...\n", + "27271 131250 no more school Comedy\n", + "27272 131252 forklift driver klaus: the first day on the job Comedy\n", + "27273 131254 kein bund für's leben Comedy\n", + "27274 131256 feuer, eis & dosenbier Comedy\n", + "27275 131258 the pirates Adventure\n", + "\n", + "[25966 rows x 3 columns]" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movieIdfilmgenre
01toy storyAdventure
12jumanjiAdventure
23grumpier old menComedy
34waiting to exhaleComedy
45father of the bride part iiComedy
............
27271131250no more schoolComedy
27272131252forklift driver klaus: the first day on the jobComedy
27273131254kein bund für's lebenComedy
27274131256feuer, eis & dosenbierComedy
27275131258the piratesAdventure
\n", + "

25966 rows × 3 columns

\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ] + }, + "metadata": {}, + "execution_count": 23 + } + ], + "source": [ + "# Drop duplacate data\n", + "movies_fix.drop_duplicates('film', inplace=True)\n", + "movies_fix" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 424 + }, + "id": "63h1PdlteBBh", + "outputId": "fc8d11bd-61d0-48b5-c025-8cf9b78c5660" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " id film genre\n", + "0 1 toy story Adventure\n", + "1 2 jumanji Adventure\n", + "2 3 grumpier old men Comedy\n", + "3 4 waiting to exhale Comedy\n", + "4 5 father of the bride part ii Comedy\n", + "... ... ... ...\n", + "27271 131250 no more school Comedy\n", + "27272 131252 forklift driver klaus: the first day on the job Comedy\n", + "27273 131254 kein bund für's leben Comedy\n", + "27274 131256 feuer, eis & dosenbier Comedy\n", + "27275 131258 the pirates Adventure\n", + "\n", + "[25966 rows x 3 columns]" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idfilmgenre
01toy storyAdventure
12jumanjiAdventure
23grumpier old menComedy
34waiting to exhaleComedy
45father of the bride part iiComedy
............
27271131250no more schoolComedy
27272131252forklift driver klaus: the first day on the jobComedy
27273131254kein bund für's lebenComedy
27274131256feuer, eis & dosenbierComedy
27275131258the piratesAdventure
\n", + "

25966 rows × 3 columns

\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ] + }, + "metadata": {}, + "execution_count": 24 + } + ], + "source": [ + "# Create data variable\n", + "data = movies_fix.copy().rename(columns={'movieId':'id'})\n", + "data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "inLs88GJSKXV" + }, + "source": [ + "### *2. Collaborative Filtering*" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a-qeCNoQqx8l" + }, + "source": [ + "#### **Generate movie, user to index variable**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 661 + }, + "collapsed": true, + "id": "ulQFfTL-S21d", + "outputId": "c3a027d3-1d6b-492e-8026-98035422e619" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " userId movieId rating timestamp\n", + "0 1 2 3.5 2005-04-02 23:53:47\n", + "1 1 29 3.5 2005-04-02 23:31:16\n", + "2 1 32 3.5 2005-04-02 23:33:39\n", + "3 1 47 3.5 2005-04-02 23:32:07\n", + "4 1 50 3.5 2005-04-02 23:29:40\n", + "... ... ... ... ...\n", + "20000258 138493 68954 4.5 2009-11-13 15:42:00\n", + "20000259 138493 69526 4.5 2009-12-03 18:31:48\n", + "20000260 138493 69644 3.0 2009-12-07 18:10:57\n", + "20000261 138493 70286 5.0 2009-11-13 15:42:24\n", + "20000262 138493 71619 2.5 2009-10-17 20:25:36\n", + "\n", + "[20000263 rows x 4 columns]" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
userIdmovieIdratingtimestamp
0123.52005-04-02 23:53:47
11293.52005-04-02 23:31:16
21323.52005-04-02 23:33:39
31473.52005-04-02 23:32:07
41503.52005-04-02 23:29:40
...............
20000258138493689544.52009-11-13 15:42:00
20000259138493695264.52009-12-03 18:31:48
20000260138493696443.02009-12-07 18:10:57
20000261138493702865.02009-11-13 15:42:24
20000262138493716192.52009-10-17 20:25:36
\n", + "

20000263 rows × 4 columns

\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ] + }, + "metadata": {}, + "execution_count": 41 + } + ], + "source": [ + "# Create rate dataset\n", + "rate_raw = ratings.copy()\n", + "rate_raw" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "lDD2DmZJhgWv", + "outputId": "7529143a-c812-45f2-a03b-eb817ba71449" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Total userID: 138493\n", + "Total restoID: 26744 \n", + "\n" + ] + } + ], + "source": [ + "# Get unique userID\n", + "user_id = rate_raw['userId'].unique().tolist()\n", + "movie_id = rate_raw['movieId'].unique().tolist()\n", + "print('Total userID: ', len(user_id))\n", + "print('Total restoID: ', len(movie_id), '\\n')\n", + " \n", + "# Create dic user:index\n", + "user_to_index = {x: i for i, x in enumerate(user_id)}\n", + "# print('Encoded userID : ', user_to_index)\n", + " \n", + "# Create dic index:user\n", + "index_to_user = {i: x for i, x in enumerate(user_id)}\n", + "# print('Encoded index:userID: ', index_to_user)\n", + "\n", + "# Create dic movie:index\n", + "movie_to_index = {x: i for i, x in enumerate(movie_id)}\n", + "# print('Encoded movieID: ', movie_to_index)\n", + "\n", + "# Create dic index:movie\n", + "index_to_movie = {i: x for i, x in enumerate(movie_id)}\n", + "# print('Encoded index:movieID: ', index_to_movie)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kh-qi7hblkqB" + }, + "source": [ + "Mohon maaf saya tidak menampilkan hasil output diatas dikarenakan membuat crash aplikasi karena memuat data yang besar (>200000 data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "CCy0UYH22U-O", + "outputId": "c2ba9b65-02c1-46ff-fc6f-ad434fa786a6" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Total of users = 138493\n", + "Total of movies = 26744\n" + ] + } + ], + "source": [ + "# Create length of user and movie\n", + "num_user = len(user_to_index)\n", + "num_movie = len(index_to_movie)\n", + "print('Total of users = ', num_user)\n", + "print('Total of movies = ', num_movie)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "eVrgPqW1oaoB" + }, + "outputs": [], + "source": [ + "# Add collumns user and movie based on user and movie index\n", + "rate_raw['user'] = rate_raw['userId'].map(user_to_index)\n", + "rate_raw['movie'] = rate_raw['movieId'].map(movie_to_index)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 357 + }, + "id": "Mt0bn317qi1H", + "outputId": "3114af17-a32e-44b4-dad4-0a9893c71491" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " userId movieId rating timestamp user movie\n", + "0 1 2 3.5 2005-04-02 23:53:47 0 0\n", + "1 1 29 3.5 2005-04-02 23:31:16 0 1\n", + "2 1 32 3.5 2005-04-02 23:33:39 0 2\n", + "3 1 47 3.5 2005-04-02 23:32:07 0 3\n", + "4 1 50 3.5 2005-04-02 23:29:40 0 4" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
userIdmovieIdratingtimestampusermovie
0123.52005-04-02 23:53:4700
11293.52005-04-02 23:31:1601
21323.52005-04-02 23:33:3902
31473.52005-04-02 23:32:0703
41503.52005-04-02 23:29:4004
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ] + }, + "metadata": {}, + "execution_count": 45 + } + ], + "source": [ + "rate_raw.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IjXngGc7vJhn" + }, + "source": [ + "#### **Normalize**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YDXzpupTyI1t" + }, + "outputs": [], + "source": [ + "# Create min max rate_raw to normalize targed\n", + "min_rate = min(rate_raw['rating'])\n", + "max_rate = max(rate_raw['rating'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 318 + }, + "collapsed": true, + "id": "9-taeuarvJhw", + "outputId": "2ee29525-ab7a-43ea-f9ec-27ff95417528" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Before normalize : \n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " rating\n", + "count 20000263.00\n", + "mean 3.53\n", + "std 1.05\n", + "min 0.50\n", + "25% 3.00\n", + "50% 3.50\n", + "75% 4.00\n", + "max 5.00" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
rating
count20000263.00
mean3.53
std1.05
min0.50
25%3.00
50%3.50
75%4.00
max5.00
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ] + }, + "metadata": {}, + "execution_count": 77 + } + ], + "source": [ + "# Divide data and label then Normalize\n", + "print('Before normalize : ')\n", + "\n", + "x = rate_raw[['user', 'movie']].to_numpy()\n", + "y = rate_raw[\"rating\"].apply(lambda x: (x - min_rate) / (max_rate - min_rate)).to_numpy()\n", + "\n", + "rate_raw.describe().round(2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + }, + "collapsed": true, + "id": "KxjLG9KZvJh0", + "outputId": "70361749-2a13-4436-bf20-1174ba134520" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " 0\n", + "count 20000263.0\n", + "mean 0.7\n", + "std 0.2\n", + "min 0.0\n", + "25% 0.6\n", + "50% 0.7\n", + "75% 0.8\n", + "max 1.0" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
count20000263.0
mean0.7
std0.2
min0.0
25%0.6
50%0.7
75%0.8
max1.0
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ] + }, + "metadata": {}, + "execution_count": 78 + } + ], + "source": [ + "# After Normalize\n", + "pd.DataFrame(y).describe().round(1)" + ] + }, + { + "cell_type": "code", + "source": [ + "# Change x dtype to get less memory\n", + "x = x.astype('int32')\n", + "x.dtype" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "mRcddjD3c5AL", + "outputId": "d83b0e25-ea92-4754-87c8-a10b2f3bba1b" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "dtype('int32')" + ] + }, + "metadata": {}, + "execution_count": 80 + } + ] + }, + { + "cell_type": "code", + "source": [ + "# Change y dtype to get less memory\n", + "y = y.astype('float16')\n", + "y.dtype" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "rIBN5H5Rd29D", + "outputId": "a830303d-4805-419b-90ce-4356bd1d2815" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "dtype('float16')" + ] + }, + "metadata": {}, + "execution_count": 81 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hwE9MmNyXDuo" + }, + "source": [ + "#### **Split Dataset**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BYQWJqlmXJyE" + }, + "outputs": [], + "source": [ + "# Split train and test set\n", + "x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.001, \n", + " random_state = 123)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "collapsed": true, + "id": "qjZmCZf7XTDc", + "outputId": "b3fd6b8c-e6f4-4957-9d36-2e33d53e33d5" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Total # of sample in whole dataset: 20000263\n", + "Total # of sample in train dataset: 19980262\n", + "Total # of sample in test dataset: 20001\n" + ] + } + ], + "source": [ + "# Check total of test and train set\n", + "print(f'Total # of sample in whole dataset: {len(x)}')\n", + "print(f'Total # of sample in train dataset: {len(x_train)}')\n", + "print(f'Total # of sample in test dataset: {len(x_test)}')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9zpiVQdCZpou" + }, + "source": [ + "## **Model Development**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0SLgSmWLfwzp" + }, + "source": [ + "### *1. Content Based Filtering*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "U004XzEwf4iw", + "outputId": "86633ba6-5768-4d2a-9407-e66ed99e0abd" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['Adventure', 'Comedy', 'Action', 'Drama', 'Crime', ..., 'Romance', 'War', 'Scifi', 'Musical', 'IMAX']\n", + "Length: 19\n", + "Categories (19, object): ['Action', 'Adventure', 'Animation', 'Children', ..., 'Scifi', 'Thriller',\n", + " 'War', 'Western']" + ] + }, + "metadata": {}, + "execution_count": 26 + } + ], + "source": [ + "data.genre.unique()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "collapsed": true, + "id": "cn5nPatEZu2Q", + "outputId": "ef0b6d36-43b2-4ade-d8c0-33c2e0bfa2ba" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Shape of TF-IDF Matrix = (25966, 19) \n", + "\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array(['action', 'adventure', 'animation', 'children', 'comedy', 'crime',\n", + " 'documentary', 'drama', 'fantasy', 'horror', 'imax', 'musical',\n", + " 'mystery', 'noir', 'romance', 'scifi', 'thriller', 'war',\n", + " 'western'], dtype=object)" + ] + }, + "metadata": {}, + "execution_count": 27 + } + ], + "source": [ + "# Vectorize data with TF-IDF\n", + "tf = TfidfVectorizer()\n", + "tfidf_matrix = tf.fit_transform(data['genre']) \n", + "\n", + "print('Shape of TF-IDF Matrix =', tfidf_matrix.shape, '\\n')\n", + "tf.get_feature_names_out()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "QWmqupDtjiBZ", + "outputId": "9038b96d-a7b4-45bb-efb0-777242707e6a" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "matrix([[0., 1., 0., ..., 0., 0., 0.],\n", + " [0., 1., 0., ..., 0., 0., 0.],\n", + " [0., 0., 0., ..., 0., 0., 0.],\n", + " ...,\n", + " [0., 0., 0., ..., 0., 0., 0.],\n", + " [0., 0., 0., ..., 0., 0., 0.],\n", + " [0., 1., 0., ..., 0., 0., 0.]])" + ] + }, + "metadata": {}, + "execution_count": 28 + } + ], + "source": [ + "# Dense TF-IDF Matrix\n", + "tfidf_matrix.todense()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "gSquNe3gjpZ5", + "outputId": "237ec34f-63ce-472f-cbc8-27c0997729c2" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[1., 1., 0., ..., 0., 0., 1.],\n", + " [1., 1., 0., ..., 0., 0., 1.],\n", + " [0., 0., 1., ..., 1., 1., 0.],\n", + " ...,\n", + " [0., 0., 1., ..., 1., 1., 0.],\n", + " [0., 0., 1., ..., 1., 1., 0.],\n", + " [1., 1., 0., ..., 0., 0., 1.]])" + ] + }, + "metadata": {}, + "execution_count": 29 + } + ], + "source": [ + "# Calculate cosine similarity of tfidf_matrix\n", + "cosine_matrix= cosine_similarity(tfidf_matrix) \n", + "cosine_matrix" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 540 + }, + "collapsed": true, + "id": "L0-oM3Hfj2O9", + "outputId": "b4b28684-e3ef-4b35-aec3-85d441b43056" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Shape: (25966, 25966)\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "film toy story jumanji grumpier old men \\\n", + "film \n", + "toy story 1.0 1.0 0.0 \n", + "jumanji 1.0 1.0 0.0 \n", + "grumpier old men 0.0 0.0 1.0 \n", + "waiting to exhale 0.0 0.0 1.0 \n", + "father of the bride part ii 0.0 0.0 1.0 \n", + "\n", + "film waiting to exhale father of the bride part ii \\\n", + "film \n", + "toy story 0.0 0.0 \n", + "jumanji 0.0 0.0 \n", + "grumpier old men 1.0 1.0 \n", + "waiting to exhale 1.0 1.0 \n", + "father of the bride part ii 1.0 1.0 \n", + "\n", + "film heat sabrina tom and huck sudden death \\\n", + "film \n", + "toy story 0.0 0.0 1.0 0.0 \n", + "jumanji 0.0 0.0 1.0 0.0 \n", + "grumpier old men 0.0 1.0 0.0 0.0 \n", + "waiting to exhale 0.0 1.0 0.0 0.0 \n", + "father of the bride part ii 0.0 1.0 0.0 0.0 \n", + "\n", + "film goldeneye ... what men talk about \\\n", + "film ... \n", + "toy story 0.0 ... 0.0 \n", + "jumanji 0.0 ... 0.0 \n", + "grumpier old men 0.0 ... 1.0 \n", + "waiting to exhale 0.0 ... 1.0 \n", + "father of the bride part ii 0.0 ... 1.0 \n", + "\n", + "film three quarter moon ants in the pants \\\n", + "film \n", + "toy story 0.0 0.0 \n", + "jumanji 0.0 0.0 \n", + "grumpier old men 1.0 1.0 \n", + "waiting to exhale 1.0 1.0 \n", + "father of the bride part ii 1.0 1.0 \n", + "\n", + "film werner - gekotzt wird später brother bear 2 \\\n", + "film \n", + "toy story 0.0 1.0 \n", + "jumanji 0.0 1.0 \n", + "grumpier old men 0.0 0.0 \n", + "waiting to exhale 0.0 0.0 \n", + "father of the bride part ii 0.0 0.0 \n", + "\n", + "film no more school \\\n", + "film \n", + "toy story 0.0 \n", + "jumanji 0.0 \n", + "grumpier old men 1.0 \n", + "waiting to exhale 1.0 \n", + "father of the bride part ii 1.0 \n", + "\n", + "film forklift driver klaus: the first day on the job \\\n", + "film \n", + "toy story 0.0 \n", + "jumanji 0.0 \n", + "grumpier old men 1.0 \n", + "waiting to exhale 1.0 \n", + "father of the bride part ii 1.0 \n", + "\n", + "film kein bund für's leben feuer, eis & dosenbier \\\n", + "film \n", + "toy story 0.0 0.0 \n", + "jumanji 0.0 0.0 \n", + "grumpier old men 1.0 1.0 \n", + "waiting to exhale 1.0 1.0 \n", + "father of the bride part ii 1.0 1.0 \n", + "\n", + "film the pirates \n", + "film \n", + "toy story 1.0 \n", + "jumanji 1.0 \n", + "grumpier old men 0.0 \n", + "waiting to exhale 0.0 \n", + "father of the bride part ii 0.0 \n", + "\n", + "[5 rows x 25966 columns]" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filmtoy storyjumanjigrumpier old menwaiting to exhalefather of the bride part iiheatsabrinatom and hucksudden deathgoldeneye...what men talk aboutthree quarter moonants in the pantswerner - gekotzt wird späterbrother bear 2no more schoolforklift driver klaus: the first day on the jobkein bund für's lebenfeuer, eis & dosenbierthe pirates
film
toy story1.01.00.00.00.00.00.01.00.00.0...0.00.00.00.01.00.00.00.00.01.0
jumanji1.01.00.00.00.00.00.01.00.00.0...0.00.00.00.01.00.00.00.00.01.0
grumpier old men0.00.01.01.01.00.01.00.00.00.0...1.01.01.00.00.01.01.01.01.00.0
waiting to exhale0.00.01.01.01.00.01.00.00.00.0...1.01.01.00.00.01.01.01.01.00.0
father of the bride part ii0.00.01.01.01.00.01.00.00.00.0...1.01.01.00.00.01.01.01.01.00.0
\n", + "

5 rows × 25966 columns

\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ] + }, + "metadata": {}, + "execution_count": 30 + } + ], + "source": [ + "# Create cosine similary dataframe with film\n", + "cosine_df = pd.DataFrame(cosine_matrix, index=data['film'], columns=data['film'])\n", + "print('Shape:', cosine_df.shape)\n", + " \n", + "cosine_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QuF4iwhf0RzL" + }, + "source": [ + "### *2. Collaborate Filtering*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SwKfyUIe0UGd" + }, + "outputs": [], + "source": [ + "# Modify RecommenderNet Class\n", + "class RecommenderNet(tf.keras.Model):\n", + " \n", + " # Define Constructor\n", + " def __init__(self, num_users, num_movie, embedding_size, **kwargs):\n", + " super(RecommenderNet, self).__init__(**kwargs)\n", + " self.num_users = num_users\n", + " self.num_movie = num_movie\n", + " self.embedding_size = embedding_size\n", + " self.user_embedding = layers.Embedding( num_users,\n", + " embedding_size,\n", + " embeddings_initializer = 'he_normal',\n", + " embeddings_regularizer = keras.regularizers.l2(1e-6)\n", + " )\n", + " self.user_bias = layers.Embedding(num_users, 1) \n", + " self.resto_embedding = layers.Embedding( \n", + " num_movie,\n", + " embedding_size,\n", + " embeddings_initializer = 'he_normal',\n", + " embeddings_regularizer = keras.regularizers.l2(1e-6)\n", + " )\n", + " self.resto_bias = layers.Embedding(num_movie, 1)\n", + " \n", + " def call(self, inputs):\n", + " user_vector = self.user_embedding(inputs[:,0]) \n", + " user_bias = self.user_bias(inputs[:, 0]) \n", + " resto_vector = self.resto_embedding(inputs[:, 1]) \n", + " resto_bias = self.resto_bias(inputs[:, 1]) \n", + " \n", + " dot_user_resto = tf.tensordot(user_vector, resto_vector, 2) \n", + " \n", + " x = dot_user_resto + user_bias + resto_bias\n", + " \n", + " return tf.nn.sigmoid(x) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EcLEC-Xm0dOA" + }, + "outputs": [], + "source": [ + "# Create Model\n", + "model = RecommenderNet(num_user, num_movie, 50)\n", + "model.compile('Adam', 'binary_crossentropy', \n", + " [tf.keras.metrics.MeanSquaredError()])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NYPeZ1L1hMHZ" + }, + "outputs": [], + "source": [ + "# Create Callback\n", + "callback= [\n", + " EarlyStopping('val_mean_squared_error', 0.1, 8, 1),\n", + "]" + ] + }, + { + "cell_type": "code", + "source": [ + "# Train Model\n", + "history = model.fit(\n", + " x_train, \n", + " y_train, \n", + " 200000, \n", + " 50, \n", + " callbacks=callback,\n", + " validation_data = (x_test, y_test)\n", + ")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "B4_HCA1NeX5e", + "outputId": "7c6dd734-a0d3-4aae-ffa2-e9a543c366e0" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Epoch 1/50\n", + "100/100 [==============================] - 29s 265ms/step - loss: 0.6538 - mean_squared_error: 0.0639 - val_loss: 0.6472 - val_mean_squared_error: 0.0611\n", + "Epoch 2/50\n", + "100/100 [==============================] - 27s 267ms/step - loss: 1.4197 - mean_squared_error: 0.1478 - val_loss: 0.8099 - val_mean_squared_error: 0.1398\n", + "Epoch 3/50\n", + "100/100 [==============================] - 33s 331ms/step - loss: 6.1476 - mean_squared_error: 0.2614 - val_loss: 1.3699 - val_mean_squared_error: 0.1526\n", + "Epoch 4/50\n", + "100/100 [==============================] - 26s 262ms/step - loss: 6.5779 - mean_squared_error: 0.1901 - val_loss: 2.0804 - val_mean_squared_error: 0.4447\n", + "Epoch 5/50\n", + "100/100 [==============================] - 26s 258ms/step - loss: 4.5104 - mean_squared_error: 0.2610 - val_loss: 0.8934 - val_mean_squared_error: 0.1162\n", + "Epoch 6/50\n", + "100/100 [==============================] - 27s 266ms/step - loss: 4.1191 - mean_squared_error: 0.1977 - val_loss: 2.6865 - val_mean_squared_error: 0.4805\n", + "Epoch 7/50\n", + "100/100 [==============================] - 33s 324ms/step - loss: 3.3527 - mean_squared_error: 0.2311 - val_loss: 0.6211 - val_mean_squared_error: 0.0457\n", + "Epoch 8/50\n", + "100/100 [==============================] - 26s 259ms/step - loss: 4.7516 - mean_squared_error: 0.2012 - val_loss: 4.0370 - val_mean_squared_error: 0.5024\n", + "Epoch 9/50\n", + "100/100 [==============================] - 29s 284ms/step - loss: 6.0048 - mean_squared_error: 0.2688 - val_loss: 0.6586 - val_mean_squared_error: 0.0594\n", + "Epoch 9: early stopping\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7rcvfAibTuI6" + }, + "source": [ + "## **Model Evaluation**" + ] + }, + { + "cell_type": "markdown", + "source": [ + "### *1. Content Based Filtering*" + ], + "metadata": { + "id": "HF4rnIr8oi2n" + } + }, + { + "cell_type": "markdown", + "source": [ + "Sebelum menjalankan code dibawah, jalankan terlebih dahulu testing Content Based Filtering di cell terakhir" + ], + "metadata": { + "id": "YX2hMvT8pBvx" + } + }, + { + "cell_type": "code", + "source": [ + "# Check genre of toy story\n", + "genre_target = data[data.film.eq('toy story')]['genre'][0]\n", + "genre_target" + ], + "metadata": { + "id": "jL5h-fFReF5w", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + }, + "outputId": "063ef040-a773-4522-b051-111171e1eae8" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "'Adventure'" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + } + }, + "metadata": {}, + "execution_count": 33 + } + ] + }, + { + "cell_type": "code", + "source": [ + "# Get recommendations\n", + "num_recommendation = 15\n", + "\n", + "result = film_recommendations('toy story', num_recommendation)\n", + "num_correct = result[result.genre == genre_target].genre.count()\n", + "pred_score = (num_correct / num_recommendation)*100\n", + "num_all_genre_target = data[data.genre == genre_target].genre.count()\n", + "recall_score = (num_correct / num_all_genre_target)*100\n", + "\n", + "\n", + "print(f'Precission of model is : {int(pred_score)} %')\n", + "print(f'Recall of model is : {recall_score} ')" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "dni1uUiPmTcd", + "outputId": "2d7d780c-4eaa-4569-c908-3c590a31a9ac" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Precission of model is : 100 %\n", + "Recall of model is : 1.1727912431587177 \n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "film_recommendations('toy story', num_recommendation)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 520 + }, + "id": "zZRyCax7tmUk", + "outputId": "0c8b2067-5c3e-408c-93ae-51114b85b64b" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " film genre\n", + "0 The Pirates Adventure\n", + "1 Toy Story 3 Adventure\n", + "2 Descent: Part 2, The Adventure\n", + "3 The Black Rose Adventure\n", + "4 Young Winston Adventure\n", + "5 St Trinian'S 2: The Legend Of Fritton'S Gold Adventure\n", + "6 Sky Crawlers, The (Sukai Kurora) Adventure\n", + "7 Shrek Forever After (A.K.A. Shrek: The Final C... Adventure\n", + "8 Percy Jackson & The Olympians: The Lightning T... Adventure\n", + "9 B.N.B. (Bunty Aur Babli) Adventure\n", + "10 Agora Adventure\n", + "11 When Dinosaurs Ruled The Earth Adventure\n", + "12 North Face (Nordwand) Adventure\n", + "13 Men Who Tread On The Tiger'S Tail, The (Tora N... Adventure\n", + "14 How To Train Your Dragon Adventure" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filmgenre
0The PiratesAdventure
1Toy Story 3Adventure
2Descent: Part 2, TheAdventure
3The Black RoseAdventure
4Young WinstonAdventure
5St Trinian'S 2: The Legend Of Fritton'S GoldAdventure
6Sky Crawlers, The (Sukai Kurora)Adventure
7Shrek Forever After (A.K.A. Shrek: The Final C...Adventure
8Percy Jackson & The Olympians: The Lightning T...Adventure
9B.N.B. (Bunty Aur Babli)Adventure
10AgoraAdventure
11When Dinosaurs Ruled The EarthAdventure
12North Face (Nordwand)Adventure
13Men Who Tread On The Tiger'S Tail, The (Tora N...Adventure
14How To Train Your DragonAdventure
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ] + }, + "metadata": {}, + "execution_count": 68 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### *2. Collaborative Filtering*" + ], + "metadata": { + "id": "tGr1gqR5omsa" + } + }, + { + "cell_type": "code", + "source": [ + "plt.style.use('dark_background')\n", + "\n", + "plt.plot(history.history['mean_squared_error'], '#1f77b4')\n", + "plt.plot(history.history['val_mean_squared_error'], '#ff7f0e')\n", + "plt.title('Mean Squared Error')\n", + "plt.ylabel('MSE')\n", + "plt.xlabel('epoch')\n", + "plt.legend(['train', 'test'], loc='upper left')\n", + "plt.show()" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 295 + }, + "outputId": "d3732c35-9da7-4d07-b595-232cb6aa0912", + "id": "OEFtKSGbomsd" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/png": "\n" + }, + "metadata": {} + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vxHfPGCs8RTV" + }, + "source": [ + "## **Testing**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l0lc1ZeODqCa" + }, + "source": [ + "### *1. Content Based Filtering*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pjPmobAQ8Ulv" + }, + "outputs": [], + "source": [ + "# Create recommendations function\n", + "def film_recommendations(film: str, \n", + " n: int=5,\n", + " similarity_data: pd.DataFrame=cosine_df, \n", + " items: pd.DataFrame=data[['film', 'genre']]):\n", + " \"\"\"\n", + " Recommends top N-Recommendation of similar film based on genre.\n", + " \n", + " Parameter:\n", + " ---\n", + " film : string (str)\n", + " A Film that will be a reference for recommendations.\n", + " similarity_data : pd.DataFrame (object)\n", + " Dataframe similarity, symmetric, resto as a index and \n", + " collumns\n", + " items : pd.DataFrame (object)\n", + " Dataframe of film info\n", + " k : integer (int)\n", + " Total of return recommendations\n", + "\n", + " Returns\n", + " ---\n", + " Returns a dataframe of top N-Recommendation.\n", + " \"\"\"\n", + "\n", + " # Locate similarity restaurant\n", + " index = similarity_data.loc[:,film.lower()].to_numpy().argpartition(\n", + " range(-1, -n, -1))\n", + " \n", + " # Sort closest similarity\n", + " closest = similarity_data.columns[index[-1:-(n+2):-1]]\n", + " \n", + " # Drop unused data restaurant\n", + " closest = closest.drop(film.lower(), errors='ignore')\n", + "\n", + " result = pd.DataFrame(closest).merge(items).head(n)\n", + " film_fix = result['film'].map(lambda title: title.title())\n", + " result['film'] = film_fix\n", + " \n", + " return result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 81 + }, + "id": "57632FgrBTOh", + "outputId": "babff8b0-add9-46b5-824f-d8f9b1dd6423" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " id film genre\n", + "0 1 toy story Adventure" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idfilmgenre
01toy storyAdventure
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ] + }, + "metadata": {}, + "execution_count": 36 + } + ], + "source": [ + "# Check genre of toy story\n", + "data[data.film.eq('toy story')]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 520 + }, + "id": "B_hq1_O7BepY", + "outputId": "d7c80c38-6ea1-4737-aaec-bbeba37bd22e" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " film genre\n", + "0 The Pirates Adventure\n", + "1 Toy Story 3 Adventure\n", + "2 Descent: Part 2, The Adventure\n", + "3 The Black Rose Adventure\n", + "4 Young Winston Adventure\n", + "5 St Trinian'S 2: The Legend Of Fritton'S Gold Adventure\n", + "6 Sky Crawlers, The (Sukai Kurora) Adventure\n", + "7 Shrek Forever After (A.K.A. Shrek: The Final C... Adventure\n", + "8 Percy Jackson & The Olympians: The Lightning T... Adventure\n", + "9 B.N.B. (Bunty Aur Babli) Adventure\n", + "10 Agora Adventure\n", + "11 When Dinosaurs Ruled The Earth Adventure\n", + "12 North Face (Nordwand) Adventure\n", + "13 Men Who Tread On The Tiger'S Tail, The (Tora N... Adventure\n", + "14 How To Train Your Dragon Adventure" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filmgenre
0The PiratesAdventure
1Toy Story 3Adventure
2Descent: Part 2, TheAdventure
3The Black RoseAdventure
4Young WinstonAdventure
5St Trinian'S 2: The Legend Of Fritton'S GoldAdventure
6Sky Crawlers, The (Sukai Kurora)Adventure
7Shrek Forever After (A.K.A. Shrek: The Final C...Adventure
8Percy Jackson & The Olympians: The Lightning T...Adventure
9B.N.B. (Bunty Aur Babli)Adventure
10AgoraAdventure
11When Dinosaurs Ruled The EarthAdventure
12North Face (Nordwand)Adventure
13Men Who Tread On The Tiger'S Tail, The (Tora N...Adventure
14How To Train Your DragonAdventure
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ] + }, + "metadata": {}, + "execution_count": 37 + } + ], + "source": [ + "# Get recommendations\n", + "film_recommendations('toy story', 15)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7pbOlhvDDv-H" + }, + "source": [ + "### *2. Collaborate Filtering*" + ] + }, + { + "cell_type": "code", + "source": [ + "dataset = data.copy()" + ], + "metadata": { + "id": "fLaPor3LZe9l" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TmdckyZFLpwi" + }, + "outputs": [], + "source": [ + "def highRatedMovie(user: int=user_id, \n", + " n: int=5,\n", + " data_raw: pd.DataFrame=dataset, \n", + " data_rate: pd.DataFrame=ratings):\n", + " '''\n", + " Show N-High Rated Movie from user\n", + "\n", + " Parameter\n", + " ---\n", + " user: int\n", + " Id of user. Must member of dataset.\n", + " n: int\n", + " Total returns of high rate movies. Default 5\n", + " data_raw: pd.DataFrame=dataset\n", + " Dataset of movies. Must contain movieId and genres. Default\n", + " dataset\n", + " data_rate: pd.DataFrame\n", + " Dataset of rating. Must contain rating, movieId, and userId.\n", + " Default ratings\n", + "\n", + " Return\n", + " ---\n", + " Prompt of N-High Rated Movie from user\n", + " '''\n", + " # Prompt Result\n", + " print('===' * 13)\n", + " print(f'Movie with high ratings from user {user}')\n", + " print('===' * 13)\n", + "\n", + " watched_movie_by_user = data_rate[ratings.userId == user]\n", + " top_movie_user = (watched_movie_by_user.sort_values(\n", + " by = 'rating',\n", + " ascending=False\n", + " )\n", + " .head(n)\n", + " .movieId.values\n", + " )\n", + " \n", + " dataset_rows = data_raw[data_raw['id'].isin(top_movie_user)]\n", + " for row in dataset_rows.itertuples():\n", + " print(f' •', (row.film).title(), ':', row.genre)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4n2-0tlMODeI", + "outputId": "4baaebcd-8180-4863-9923-6d4e7f95929b" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "=======================================\n", + "Movie with high ratings from user 546\n", + "=======================================\n", + " • Dr. Strangelove Or: How I Learned To Stop Worrying And Love The Bomb : Comedy\n", + " • Godfather, The : Crime\n", + " • William Shakespeare'S Romeo + Juliet : Drama\n", + " • Reservoir Dogs : Crime\n", + " • Ice Storm, The : Drama\n", + " • American Beauty : Comedy\n", + " • Ghost Dog: The Way Of The Samurai : Crime\n", + " • City Of God (Cidade De Deus) : Action\n", + " • Lost In Translation : Comedy\n", + " • Garden State : Comedy\n" + ] + } + ], + "source": [ + "highRatedMovie(546,10)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "g5WKFOkcGMJ2" + }, + "outputs": [], + "source": [ + "def movieRecommendation(user: int=user_id, \n", + " n: int=5,\n", + " data_raw: pd.DataFrame=dataset,\n", + " data_rate: pd.DataFrame=ratings,\n", + " movieIdx: dict=movie_to_index,\n", + " userIdx: dict=user_to_index):\n", + " '''\n", + " Show N-Recommendation of Movie from user\n", + "\n", + " Parameter\n", + " ---\n", + " user: int\n", + " Id of user. Must member of dataset.\n", + " n: int\n", + " Total returns of high rate movies. Default 5\n", + " data_raw: pd.DataFrame=dataset\n", + " Dataset of movies. Must contain movieId and genres. Default\n", + " dataset\n", + " data_rate: pd.DataFrame\n", + " Dataset of rating. Must contain rating, movieId, and userId.\n", + " Default ratings\n", + " movieIdx: dict\n", + " Movie to movie encode (index) dataframe. Default movie_to_index\n", + " userIdx: dict\n", + " User to user encode (index) dataframe. Default user_to_index\n", + "\n", + " Return\n", + " ---\n", + " Prompt of N-Recommendation of Movie from user\n", + " '''\n", + " # Filter not visited movie\n", + " watched_movie_by_user = data_rate[data_rate.userId == user]\n", + " \n", + " movie_not_watched = data_raw[~data_raw['id'].isin(watched_movie_by_user.movieId.values)]['id'] \n", + " movie_not_watched = list(\n", + " set(movie_not_watched)\n", + " .intersection(set(movieIdx.keys()))\n", + " )\n", + " movie_not_watched = [[movieIdx.get(x)] for x in movie_not_watched]\n", + "\n", + " user_encoder = userIdx.get(user)\n", + " user_movie_array = np.hstack(\n", + " ([[user_encoder]] * len(movie_not_watched), movie_not_watched)\n", + " )\n", + " recommendation = model.predict(user_movie_array).flatten()\n", + " top_ratings_indices = recommendation.argsort()[-n:][::-1]\n", + "\n", + " recommended_movie_ids = [\n", + " index_to_movie.get(movie_not_watched[x][0]) for x in top_ratings_indices\n", + " ]\n", + " \n", + " # Prompt result\n", + " print('====' * 11)\n", + " print(f'Top {n} movie recommendation for user {user}')\n", + " print('====' * 11)\n", + " \n", + " recommended_resto = data_raw[data_raw['id'].isin(recommended_movie_ids)]\n", + " for row in recommended_resto.itertuples():\n", + " print(f' •', (row.film).title(), ':', row.genre)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "-HJBNHpTLsB8", + "outputId": "aeb03bea-0550-4fe1-ea53-6cb96fa70014" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "============================================\n", + "Top 10 movie recommendation for user 546\n", + "============================================\n", + " • Vertigo : Drama\n", + " • Rear Window : Mystery\n", + " • It Happened One Night : Comedy\n", + " • Sunset Blvd. (A.K.A. Sunset Boulevard) : Drama\n", + " • 12 Angry Men : Drama\n", + " • Best Years Of Our Lives, The : Drama\n", + " • On The Waterfront : Crime\n", + " • 400 Blows, The (Les Quatre Cents Coups) : Crime\n", + " • Rashomon (Rashômon) : Crime\n", + " • Secret In Their Eyes, The (El Secreto De Sus Ojos) : Crime\n" + ] + } + ], + "source": [ + "movieRecommendation(546, 10)" + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [ + "h_7HniFhz08I", + "5_FM0R6A-V88", + "V4bj8rD0-PdR", + "80_TMAaBl6sm", + "znMufTthmIyf", + "xXMgxzzPXqsA", + "z0RYJwW2R6W2", + "VEtZKK2YR8gI", + "inLs88GJSKXV", + "a-qeCNoQqx8l", + "IjXngGc7vJhn", + "hwE9MmNyXDuo", + "0SLgSmWLfwzp", + "QuF4iwhf0RzL", + "7rcvfAibTuI6", + "tGr1gqR5omsa" + ], + "provenance": [], + "authorship_tag": "ABX9TyO2ZIy/QtQU/4ia5H4B9vy2", + "include_colab_link": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file