Anime Recommendation System

Introduction

From anime dataset, We create recommendation system witch use cluster technique.
Recommended anime were extracted from characteristic of cluster.
User was segmented by user anime rating history.

myanimelist site

Steps

Preprocessing
Visualization
K-Mean clustering
Characteristic of each cluster

Libraries

Import all libraries we need for data mining.

# Basic libraries
from random import randint
from chernoff_faces import cface

# Import numpy, pandas and matplot libraries
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from wordcloud import WordCloud
%matplotlib inline

# Machine learning libraries
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Finding spark in jupyter notebook
import findspark
findspark.init()

# Frequent pattern libraries
from pyspark.sql import SparkSession
from pyspark.ml.fpm import FPGrowth
from pyspark.sql.functions import *

# Import HTML lib for changing direction of page
from IPython.display import HTML
from IPython.display import display
HTML('<style>.output{flex-direction:row;flex-wrap:wrap}</style>')

# Setup style of charts
plt.style.use('seaborn')
%config InlineBackend.figure_formats = {'png', 'retina'}

Preprocessing

In this section, we are going to pre-process the data, so we will first clean the data and replace the missing values with valid values - The attribute mean for all samples belonging to the same class - and then remove the extra columns, and we will also combine the data sets. And sampling data to reduce data volume.

Read anime dataset

# Read anime dataset from csv file
anime = pd.read_csv("../data/anime.csv")
# Show the first 3 records
anime.head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	MAL_ID	Name	Score	Genres	English name	Japanese name	Type	Episodes	Aired	Premiered	...	Score-10	Score-9	Score-8	Score-7	Score-6	Score-5	Score-4	Score-3	Score-2	Score-1
0	1	Cowboy Bebop	8.78	Action, Adventure, Comedy, Drama, Sci-Fi, Space	Cowboy Bebop	カウボーイビバップ	TV	26	Apr 3, 1998 to Apr 24, 1999	Spring 1998	...	229170.0	182126.0	131625.0	62330.0	20688.0	8904.0	3184.0	1357.0	741.0	1580.0
1	5	Cowboy Bebop: Tengoku no Tobira	8.39	Action, Drama, Mystery, Sci-Fi, Space	Cowboy Bebop:The Movie	カウボーイビバップ天国の扉	Movie	1	Sep 1, 2001	Unknown	...	30043.0	49201.0	49505.0	22632.0	5805.0	1877.0	577.0	221.0	109.0	379.0
2	6	Trigun	8.24	Action, Sci-Fi, Adventure, Comedy, Drama, Shounen	Trigun	トライガン	TV	26	Apr 1, 1998 to Sep 30, 1998	Spring 1998	...	50229.0	75651.0	86142.0	49432.0	15376.0	5838.0	1965.0	664.0	316.0	533.0

3 rows × 35 columns

# Count of row and column
anime.shape

(17562, 35)

Anime dataset reduction

# Remove extra columns from anime dateframe and then rename `MAL_ID` to `Anime_ID`,
# also set `Anime_ID` to index of dataframe
anime = (
    anime[
        [
            "MAL_ID",
            "Ranked",
            "Popularity",
            "Name",
            "Genres",
            "Type",
            "Source",
            "Rating",
            "Episodes",
            "Score",
            "Members",
            "Favorites",
        ]
    ]
    .rename(
        {
            "MAL_ID": "anime_id",
            "Ranked": "ranked",
            "Popularity": "popularity",
            "Name": "name",
            "Genres": "genres",
            "Type": "type",
            "Source": "source",
            "Rating": "rating",
            "Episodes": "episodes",
            "Score": "score",
            "Members": "members",
            "Favorites": "favorites",
        },
        axis=1,
    )
    .set_index("anime_id")
)

# Remove invalid row with `Unknown` ranked
anime = anime[(anime["ranked"] != "Unknown") & (anime["ranked"] != "0.0")]

Anime dataset transaction

# Replace missing `Score` and `Episodes` with zero
anime["score"].replace("Unknown", 0.0, inplace=True)
anime["episodes"].replace("Unknown", 0, inplace=True)

# Change the `Ranked`, `Episodes` and `Score` columns to numeric for math operations,
# as well as sort the table by `Ranked`
anime = (
    anime.astype({"ranked": "float"})
    .astype({"ranked": "int", "episodes": "int", "score": "float"})
    .sort_values("ranked")
)

Calculate mean for each type

# Calculate mean of `Score` and `Episodes` for each `Type`
group_by_type = anime.groupby("type")
print("✓ Mean of score for each type")
display(mean_scores := group_by_type["score"].mean().round(2))

print("\n✓ Mean of episodes for each type")
display(mean_episodes := group_by_type["episodes"].mean().round().astype(int))

✓ Mean of score for each type



type
Movie      4.42
Music      2.93
ONA        3.58
OVA        4.27
Special    5.14
TV         5.50
Name: score, dtype: float64



✓ Mean of episodes for each type



type
Movie       1
Music       1
ONA         9
OVA         2
Special     2
TV         33
Name: episodes, dtype: int32

Handel missing values

# Replace zeroes `Score` with its own category mean
for index in mean_scores.index:
    anime["score"].mask(
        (anime["type"] == index) & (anime["score"] == 0.0),
        mean_scores[index],
        inplace=True,
    )

# Replace zeroes `Episodes` with its own category mean
for index in mean_episodes.index:
    anime["episodes"].mask(
        (anime["type"] == index) & (anime["episodes"] == 0),
        mean_episodes[index],
        inplace=True,
    )

Write reduced anime dataset

# Wright anime dataset to csv file
anime.to_csv("../data/anime_reduce.csv")

Display information of anime dataset

anime = pd.read_csv("../data/anime_reduce.csv")
anime.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	anime_id	ranked	popularity	name	genres	type	source	rating	episodes	score	members	favorites
0	5114	1	3	Fullmetal Alchemist: Brotherhood	Action, Military, Adventure, Comedy, Drama, Ma...	TV	Manga	R - 17+ (violence & profanity)	64	9.19	2248456	183914
1	40028	2	119	Shingeki no Kyojin: The Final Season	Action, Military, Mystery, Super Power, Drama,...	TV	Manga	R - 17+ (violence & profanity)	16	9.17	733260	44862
2	9253	3	9	Steins;Gate	Thriller, Sci-Fi	TV	Visual novel	PG-13 - Teens 13 or older	24	9.11	1771162	148452
3	38524	4	63	Shingeki no Kyojin Season 3 Part 2	Action, Drama, Fantasy, Military, Mystery, Sho...	TV	Manga	R - 17+ (violence & profanity)	10	9.10	1073626	40985
4	28977	5	329	Gintama°	Action, Comedy, Historical, Parody, Samurai, S...	TV	Manga	PG-13 - Teens 13 or older	51	9.10	404121	11868

anime.shape

(15798, 12)

anime.describe().round(2)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	anime_id	ranked	popularity	episodes	score	members	favorites
count	15798.00	15798.00	15798.00	15798.00	15798.00	15798.00	15798.00
mean	21601.96	7896.21	8884.46	12.46	5.88	37752.95	503.86
std	14671.07	4556.74	5229.80	49.12	1.36	131649.10	4281.61
min	1.00	1.00	1.00	1.00	1.85	25.00	0.00
25%	6514.00	3946.50	4084.00	1.00	5.14	298.00	0.00
50%	23350.00	7896.00	9195.50	2.00	6.08	1737.00	2.00
75%	35454.75	11845.75	13518.75	12.00	6.92	15706.00	31.00
max	48480.00	15780.00	17560.00	3057.00	9.19	2589552.00	183914.00

Read rating dataset

rating = pd.read_csv("../data/rating.csv").rename({"rating": "user_rating"}, axis=1)
rating.head(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	anime_id	user_rating
0	430	9
1	1004	5
2	3010	7
3	570	7
4	2762	9
5	431	8
6	578	10
7	433	6
8	1571	10
9	121	9

rating.shape

(57633278, 3)

rating["user_rating"].describe().round(2)

count    57633278.00
mean            7.51
std             1.70
min             1.00
25%             7.00
50%             8.00
75%             9.00
max            10.00
Name: user_rating, dtype: float64

Definition of Like

Because of many users, many differences criteria for rating anime. Then, We decide to find rating mean of each user. Anime which got rating higher than user rating mean will assign as like

rating['user_rating'] > rating['mean_rating'] => User like this anime

# User 922 has a low in rating mean
rating[rating["user_id"] == 922]["user_rating"].mean().round(3)

1.0

# User 99 has a middle in rating mean
rating[rating["user_id"] == 105]["user_rating"].mean().round(3)

5.0

# User 12 has a hight in rating mean
rating[rating["user_id"] == 12]["user_rating"].mean().round(3)

10.0

Calculate mean rating per user

mean_rating_per_user = rating.groupby(["user_id"]).mean().reset_index()
mean_rating_per_user["mean_rating"] = mean_rating_per_user["user_rating"]

mean_rating_per_user.drop(["anime_id", "user_rating"], axis=1, inplace=True)

mean_rating_per_user.head(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	user_id	mean_rating
0	0	7.400000
1	1	8.058252
2	2	8.333333
3	3	7.603175
4	4	7.652542
5	5	8.162791
6	6	7.073955
7	7	7.908046
8	8	7.611111
9	10	7.750000

Merge mean rating to rating data frame

rating = pd.merge(rating, mean_rating_per_user, on=["user_id", "user_id"])

rating.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	anime_id	user_rating	mean_rating
0	430	9	7.4
1	1004	5	7.4
2	3010	7	7.4
3	570	7	7.4
4	2762	9	7.4

# Drop anime's user don't liked
rating = rating.drop(rating[rating["user_rating"] < rating["mean_rating"]].index)

# user 922 favorite only one anime
rating[rating["user_id"] == 922].head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	user_id	anime_id	user_rating	mean_rating
149818	922	16870	1	1.0

# user 105 favorite only one anime
rating[rating["user_id"] == 105].head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	user_id	anime_id	user_rating	mean_rating
13008	105	249	5	5.0

rating[rating["user_id"] == 12].head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	user_id	anime_id	user_rating	mean_rating
1246	12	31964	10	10.0
1247	12	16335	10	10.0
1248	12	11021	10	10.0
1249	12	35062	10	10.0
1250	12	20785	10	10.0

rating.shape

(30706635, 4)

# Number of users
len(rating["user_id"].unique())

Combine two datasets

In this section, we decide to reduce size of dataset, because of running time and memory

# Merge anime and rating data frame
mergedata = pd.merge(anime, rating, on=["anime_id", "anime_id"])
# Choice record with User_ID lower than equal 25000, This is a random choice
mergedata = mergedata[mergedata["user_id"] <= 25000]
mergedata.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	anime_id	ranked	popularity	name	genres	type	source	rating	episodes	score	members	favorites	user_id	user_rating	mean_rating
0	5114	1	3	Fullmetal Alchemist: Brotherhood	Action, Military, Adventure, Comedy, Drama, Ma...	TV	Manga	R - 17+ (violence & profanity)	64	9.19	2248456	183914	1	10	8.058252
1	5114	1	3	Fullmetal Alchemist: Brotherhood	Action, Military, Adventure, Comedy, Drama, Ma...	TV	Manga	R - 17+ (violence & profanity)	64	9.19	2248456	183914	6	10	7.073955
2	5114	1	3	Fullmetal Alchemist: Brotherhood	Action, Military, Adventure, Comedy, Drama, Ma...	TV	Manga	R - 17+ (violence & profanity)	64	9.19	2248456	183914	7	10	7.908046
3	5114	1	3	Fullmetal Alchemist: Brotherhood	Action, Military, Adventure, Comedy, Drama, Ma...	TV	Manga	R - 17+ (violence & profanity)	64	9.19	2248456	183914	11	10	8.503106
4	5114	1	3	Fullmetal Alchemist: Brotherhood	Action, Military, Adventure, Comedy, Drama, Ma...	TV	Manga	R - 17+ (violence & profanity)	64	9.19	2248456	183914	12	10	10.000000

# Count of anime in mergedata
len(mergedata["anime_id"].unique())

# Count of anime in actual dataset
len(anime["anime_id"].unique())

Write merge dataset

mergedata.to_csv('../data/mergedata.csv')

Create Crosstable

Show detail of anime which each user like

mergedata = pd.read_csv("../data/mergedata.csv")
user_anime = pd.crosstab(mergedata["user_id"], mergedata["name"])
user_anime.columns.name = None
user_anime.index.name = None

user_anime.head(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	...	xxxHOLiC	xxxHOLiC Kei	xxxHOLiC Shunmuki
0	...	0	0	0
1	...	0	0	0
2	...	0	0	0
3	...	0	0	0
4	...	1	1	1
5	...	0	0	0
6	...	0	0	0
7	...	0	0	0
8	...	0	0	0
10	...	0	0	0

10 rows × 11409 columns

user_anime.shape

(21804, 11409)

Principal Component Analysis (PCA)

Principal Component Analysis converts our original variables to a new set of variables, which are a linear combination of the original set of variables. Main goal is to reduce dimension of data for clustering and visualize

wikipedia

pca = PCA(n_components=3)
pca.fit(user_anime)
pca_samples = pca.transform(user_anime)

ps = pd.DataFrame(pca_samples)
ps.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
0	-2.461742	0.477328	-0.081578
1	-1.083951	-1.401628	-1.014477
2	-1.693449	-0.630825	-0.298041
3	3.778695	0.226784	1.138751
4	-1.802398	1.464769	0.436394

ps.shape

(21804, 3)

Visualization

In this section, we use a variety of graphs to display the data so that we can better understanding

Data points in 3D PCA axis

to_cluster = pd.DataFrame(ps[[0, 1, 2]])

fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot(projection="3d")
ax.scatter(to_cluster[0], to_cluster[2], to_cluster[1], s=24)

plt.title("Data points in 3D PCA axis", fontsize=18)
plt.savefig("../charts/Data_points_in_3D_PCA_axis.png")
plt.show()

Data points in 2D PCA axis

plt.scatter(to_cluster[1], to_cluster[0], s=24)
plt.xlabel("x_values")
plt.ylabel("y_values")
plt.title("Data points in 2D PCA axis", fontsize=18)
plt.savefig("../charts/Data_points_in_2D_PCA_axis.png")
plt.show()

Chernoff faces

Display multivariate data in the shape of a human face. The individual parts, such as eyes, ears, mouth and nose represent values of the variables by their shape, size, placement and orientation (wikipedia)

# Reduce dimension for face attributes
pca = PCA(n_components=17)
pca.fit(user_anime)
pca_samples = pca.transform(user_anime)
ps = pd.DataFrame(pca_samples)
ps.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
0	-2.461742	0.477327	-0.081578	0.064732	0.634798	0.142396	-0.051592	0.269147	-0.324508	-0.142589	0.113462	-0.245301	-0.565451	0.133959	-0.538926	-0.205612	0.021235
1	-1.083951	-1.401628	-1.014476	0.235317	-0.484953	0.098983	-1.004047	0.046567	0.459869	0.146468	0.718675	0.356861	0.807967	-0.057999	-0.275366	0.364260	0.360791
2	-1.693449	-0.630825	-0.298040	0.511194	-0.784856	0.568464	0.385865	-0.590062	0.612352	-0.411691	-0.189001	0.241942	0.175324	-0.141344	0.211314	0.080413	0.156759
3	3.778695	0.226784	1.138750	-1.648948	-0.991642	-1.601837	1.068610	-0.782450	-1.796181	-0.444752	-0.452496	-0.774258	0.586723	-0.709391	1.154184	-0.719425	-0.100087
4	-1.802398	1.464768	0.436394	0.679526	0.561298	-0.313808	-1.640556	-0.554039	-0.152285	0.861153	-0.331548	-0.356287	-0.692186	0.140269	0.834920	-0.020511	0.596688

# Change size of figure
fig = plt.figure(figsize=(11, 11))

# Create sixteen face by random selection
for i in range(16):
    ax = fig.add_subplot(4, 4, i + 1, aspect="equal")
    cface(ax, 0.9, *ps.iloc[randint(0, 21804), :])
    ax.axis([-1.2, 1.2, -1.2, 1.2])
    ax.set_xticks([])
    ax.set_yticks([])

fig.subplots_adjust(hspace=0, wspace=0)
plt.savefig("../charts/Chernoff_faces.png", bbox_inches="tight")
plt.show()

Box plots

fig, axes = plt.subplots(1, 3, figsize=(12, 7))
fig.suptitle("Box plots", fontsize=18)

anime["score"].plot.box(ax=axes[0])
anime["popularity"].plot.box(ax=axes[1])
rating["user_rating"].plot.box(ax=axes[2])

plt.savefig("../charts/Box_plots.png")
plt.show()

Score histogram

data = anime["score"].values
bins = np.arange(1, 11)
plt.hist(data, bins, histtype="bar", rwidth=0.95)

plt.title("Score histogram", fontsize=18)
plt.xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
plt.xlabel("score")
plt.ylabel("value_count")

plt.savefig("../charts/Score_histogram.png")
plt.show()

Pixel-Oriented Visualization

Pixel plots are the representation of a 2-dimension data set. In these plots, each pixel refers to a different value in a data set

# Select first thousand data
head_anime = anime.head(1000)
columns = ["ranked", "popularity", "episodes", "score", "members", "favorites"]

# Creating a plot
fig, axes = plt.subplots(1, 6, figsize=(20, 8))

index = 0
for ax in axes:
    # plotting a plot
    data = head_anime[columns[index]].values.reshape((50, 20))
    ax.grid(False)
    ax.pcolor(data, cmap="Blues")

    # Customizing plot
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel(columns[index], fontsize=18, labelpad=20)

    index += 1

# Save a plot
plt.savefig("../charts/Pixel_oriented.png")

# Show plot
plt.show()

Score quantile plot

In statistics, a Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.

data = anime["score"]
fig, ax = plt.subplots(1, 1)

# Create quantile plot
stats.probplot(data, dist="norm", plot=ax)

# Calculate quantiles
median = data.median()
percent_25 = data.quantile(0.25)
percent_75 = data.quantile(0.75)

# Guide lines
plt.plot([-4, 4], [percent_25, percent_25], linestyle="dashed", label="Quartile1")
plt.plot([-4, 4], [median, median], linestyle="dashed", label="Median")
plt.plot([-4, 4], [percent_75, percent_75], linestyle="dashed", label="Quartile3")

# Customizing plot
plt.title("Score quantile plot", fontsize=18)
plt.legend()

plt.savefig("../charts/Score_quantile_plot.png")
plt.show()

Scatter plot matrices

Provides a first look at data to see clusters of points, outliers, etc

fig, axes = plt.subplots(2, 2, sharex=True, figsize=(12, 8))
fig.suptitle("Scatter plot", fontsize=18)
fig.supxlabel("Ranked")

group_setting = {
    # Type  marker   color   alpha
    "Movie": ["X", "#E72626", 0.4],
    "Music": ["^", "#48DA4B", 0.2],
    "ONA": [".", "#E820D7", 0.2],
    "OVA": ["*", "#FFD323", 0.2],
    "Special": [",", "#E4337A", 0.2],
    "TV": ["+", "#3719CC", 0.4],
}

# plotting a plot
for name in group_by_type.groups:
    data = group_by_type.get_group(name)
    color = group_setting[name][1]
    alpha = group_setting[name][2]

    axes[0, 0].scatter(data["ranked"], data["popularity"], marker=".", color=color, alpha=alpha, label=name)
    axes[0, 1].scatter(data["ranked"], data["episodes"], marker=".", color=color, alpha=alpha)
    axes[1, 0].scatter(data["ranked"], data["favorites"], marker=".", color=color, alpha=alpha)
    axes[1, 1].scatter(data["ranked"], data["members"], marker=".", color=color, alpha=alpha )

    # Customizing plot
    axes[0, 0].set_ylabel("Popularity")
    axes[0, 1].set_ylabel("Episodes")
    axes[1, 0].set_ylabel("Favorites")
    axes[1, 1].set_ylabel("Members")

fig.legend()
plt.savefig("../charts/Scatter_plot_matrices.png")
plt.show()

Anime types

fig, axes = plt.subplots(1, 2, figsize=(12, 8))
fig.suptitle("Anime types", fontsize=18)

# Bar chart
data = anime["type"].value_counts()
for i in data.index:
    axes[0].bar(i, data[i])

axes[0].set_ylabel("Count")

# Pie chart
axes[1].pie(data, explode=(0, 0.1, 0, 0, 0, 0), autopct="%1.1f%%")

plt.savefig("../charts/Anime_types.png")
plt.show()

Anime sources

fig, axes = plt.subplots(1, 2, figsize=(12, 8))
fig.suptitle("Anime sources", fontsize=18)

# Bar chart
data = anime["source"].value_counts()
for i in data.index:
    axes[0].bar(i, data[i])

axes[0].tick_params(axis="x", rotation=90)
axes[0].set_ylabel("Count")

# Pie chart
axes[1].pie(data, autopct="%1.1f%%")

plt.savefig("../charts/Anime_sources.png")
plt.show()

Anime ratings

fig, axes = plt.subplots(1, 2, figsize=(12, 8))
fig.suptitle("Anime ratings", fontsize=18)

# Bar chart
data = anime["rating"].value_counts()
for i in data.index:
    axes[0].bar(i, data[i])

axes[0].tick_params(axis="x", rotation=90)
axes[0].set_ylabel("Count")

# Pie chart
axes[1].pie(data, autopct="%1.1f%%")

plt.savefig("../charts/Anime_ratings.png")
plt.show()

Frequent Pattern Mining

Now, to find repetitive favorite anime's, we can use frequent patterns, find patterns and use them to predict what anime each user may watch. Here we use the Spark library and FPGrowth algorithm and its guide, which is as follows:

minSupport: the minimum support for an itemset to be identified as frequent. For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
minConfidence: minimum confidence for generating Association Rule. Confidence is an indication of how often an association rule has been found to be true. For example, if in the transactions itemset X appears 4 times, X and Y co-occur only 2 times, the confidence for the rule X => Y is then 2/4 = 0.5. The parameter will not affect the mining for frequent itemsets, but specify the minimum confidence for generating.
association rules from frequent itemsets. numPartitions: the number of partitions used to distribute the work. By default the param is not set, and number of partitions of the input dataset is used.

Initialize Spark Library

spark = (
    SparkSession.builder.master("local")
    .appName("anime_recommendation_system")
    .getOrCreate()
)
sc = spark.sparkContext

# Create a map file
map_file = anime[["anime_id", "name"]]
map_file.set_index("anime_id", inplace=True)
map_file.to_csv("../data/frequent-pattern/map_file.csv")

Prepare Data

We create a itemset for each user whose favorite anime is in these baskets.

# Group rows with user_id and then convert anime_id to list
df = mergedata.groupby("user_id")["anime_id"].apply(list)
# Write to csv file
df.to_csv("../data/frequent-pattern/itemset.csv")
df

user_id
0        [199, 164, 431, 578, 2236, 121, 2034, 2762, 15...
1        [5114, 9253, 11061, 28851, 32281, 199, 19, 232...
2        [9253, 11061, 2904, 263, 1575, 1535, 30276, 32...
3        [9253, 32281, 2904, 1, 17074, 23273, 1575, 103...
4        [2904, 1575, 1535, 1698, 2685, 1142, 3091, 422...
                               ...                        
24996                                               [3470]
24997    [2904, 199, 1, 1575, 164, 431, 1535, 32, 5, 30...
24998    [5114, 38524, 11061, 32935, 37510, 263, 34599,...
24999    [5114, 9253, 11061, 32281, 199, 1, 164, 245, 4...
25000    [32281, 16782, 19111, 1689, 31953, 15051, 3537...
Name: anime_id, Length: 21804, dtype: object

Then read itemset and remove double quotation from itemset

# Read csv file
itemset = (
    spark.read.option("header", True)
    .option("inferSchema", True)
    .csv("../data/frequent-pattern/itemset.csv")
)

# Remove double quotes, brackets and cast to integer array
itemset = itemset.withColumn("anime_id", regexp_replace(col("anime_id"), '"|\[|\]', ""))
itemset = itemset.withColumn("anime_id", split(col("anime_id"), ",").cast("array<int>"))

print(itemset.printSchema())
itemset.show(5)

root
 |-- user_id: integer (nullable = true)
 |-- anime_id: array (nullable = true)
 |    |-- element: integer (containsNull = true)

None
+-------+--------------------+
|user_id|            anime_id|
+-------+--------------------+
|      0|[199, 164, 431, 5...|
|      1|[5114, 9253, 1106...|
|      2|[9253, 11061, 290...|
|      3|[9253, 32281, 290...|
|      4|[2904, 1575, 1535...|
+-------+--------------------+
only showing top 5 rows

Read map_file from disk

# Read csv file
map_file = (
    spark.read.option("header", True)
    .option("inferSchema", True)
    .csv("../data/frequent-pattern/map_file.csv")
)
map_file.show(5)

+--------+--------------------+
|anime_id|                name|
+--------+--------------------+
|    5114|Fullmetal Alchemi...|
|   40028|Shingeki no Kyoji...|
|    9253|         Steins;Gate|
|   38524|Shingeki no Kyoji...|
|   28977|            Gintama°|
+--------+--------------------+
only showing top 5 rows

FPGrowth Algorithm

Create FPGrowth model from itemset

fpGrowth = FPGrowth(itemsCol="anime_id", minSupport=0.1, minConfidence=0.8)
model = fpGrowth.fit(itemset)

Display frequent itemsets

freqItemsets = model.freqItemsets.withColumn("item_id", monotonically_increasing_id())
freqItemsets.show(5)

+--------------+----+-------+
|         items|freq|item_id|
+--------------+----+-------+
|         [223]|2679|      0|
|       [21881]|2560|      1|
|[21881, 11757]|2188|      2|
|       [37510]|3319|      3|
| [37510, 5114]|2350|      4|
+--------------+----+-------+
only showing top 5 rows

print("Number of frequent item set :", freqItemsets.count())

Number of frequent item set : 2477

Convert items id to anime name and save them

# Merge freqItemsets and map_file to get name of anime's
freqItemsets = (
    freqItemsets.select("item_id", explode("items").alias("anime_id"))
    .join(map_file, "anime_id")
    .groupBy("item_id")
    .agg(collect_list(struct("name")).alias("items"))
    .join(freqItemsets.drop("items"), "item_id")
    .drop("item_id")
)
# Convert Spark DataFrame to Pandas DataFrame for saving in csv file
freqItemsets.toPandas().to_csv("../data/frequent-pattern/freqItemsets.csv")
freqItemsets.show(5)

+--------------------+----+
|               items|freq|
+--------------------+----+
|     [{Dragon Ball}]|2679|
|[{Sword Art Onlin...|2560|
|[{Sword Art Onlin...|2188|
|[{Mob Psycho 100 ...|3319|
|[{Mob Psycho 100 ...|2350|
+--------------------+----+
only showing top 5 rows

Display association rules

# Display generated association rules.
associationRules = model.associationRules.withColumn("item_id", monotonically_increasing_id())
associationRules.show()

+--------------------+----------+------------------+------------------+-------------------+-------+
|          antecedent|consequent|        confidence|              lift|            support|item_id|
+--------------------+----------+------------------+------------------+-------------------+-------+
| [11061, 2904, 1575]|    [5114]|0.8343217197924389| 2.042846802734906| 0.1032379379930288|      0|
|[2904, 30276, 511...|    [1575]|0.9611158072696534| 2.532467560327193| 0.1042927903137039|      1|
|       [28851, 9253]|   [32281]|0.8306863301191152|2.3595993671075024|0.13433314988075581|      2|
|[38524, 38000, 16...|   [35760]|0.8943056124539124| 4.506457031186758|0.10011924417538066|      3|
|[30276, 1575, 16498]|    [1535]|0.8014911463187325| 1.643844695168248|0.11832691249312052|      4|
|[30276, 1575, 16498]|    [2904]|0.8993476234855545|2.5873301995618196|0.13277380297193175|      5|
|       [31240, 1535]|   [16498]|0.8017391304347826|1.8257044386422978|0.12685745734727574|      6|
| [6547, 1575, 16498]|    [2904]|0.9199406968124537| 2.646574344016195|0.11383232434415703|      7|
|       [2904, 16498]|    [1575]|0.9522244137628753|2.5090394099922335|0.19927536231884058|      8|
| [13601, 2904, 9253]|    [1575]|0.9561328790459966|2.5193379208119526|0.10296275912676574|      9|
|               [523]|     [199]|0.8210137275607181| 2.363218919568831|0.14263437901302514|     10|
|      [22535, 31240]|   [30276]|0.8005865102639296|2.1534651208727755|0.10016510731975785|     11|
|      [22535, 31240]|   [16498]|0.8203812316715543| 1.868155861657083|0.10264171711612548|     12|
|       [10087, 5114]|   [11741]|0.8808364365511315| 3.964855008786307| 0.1410291689598239|     13|
|       [32937, 1535]|   [30831]| 0.887374749498998| 4.030899799599198| 0.1015410016510732|     14|
| [11741, 5114, 1535]|   [10087]|0.9165668662674651| 3.914754936747465|0.10530177949000183|     15|
|      [32935, 20583]|   [28891]| 0.963907284768212| 5.529343445694842|0.13350761328196661|     16|
|       [19815, 1575]|    [2904]|0.9044759825327511|2.6020839587206894|0.15199046046596953|     17|
|       [31964, 2904]|   [30276]|0.8052631578947368|2.1660446452919864|0.10525591634562466|     18|
|       [31964, 2904]|    [1575]|0.9540350877192982|2.5138103991095564|0.12470188956154835|     19|
+--------------------+----------+------------------+------------------+-------------------+-------+
only showing top 20 rows

print("Number of association rules :", associationRules.count())

Number of association rules : 903

Convert items id to anime name and save them

# Merge associationRules and map_file based on antecedent column to get name of anime's
associationRules = (
    associationRules.select("item_id", explode("antecedent").alias("anime_id"))
    .join(map_file, "anime_id")
    .groupBy("item_id")
    .agg(collect_list(struct("name")).alias("antecedent"))
    .join(associationRules.drop("antecedent"), "item_id")
)
# Merge associationRules and map_file based on consequent column to get name of anime's
associationRules = (
    associationRules.select("item_id", explode("consequent").alias("anime_id"))
    .join(map_file, "anime_id")
    .groupBy("item_id")
    .agg(collect_list(struct("name")).alias("consequent"))
    .join(associationRules.drop("consequent"), "item_id")
    .drop("item_id")
)
associationRules.toPandas().to_csv("../data/frequent-pattern/associationRules.csv")
associationRules.show()

+--------------------+--------------------+------------------+------------------+-------------------+
|          consequent|          antecedent|        confidence|              lift|            support|
+--------------------+--------------------+------------------+------------------+-------------------+
|[{Fullmetal Alche...|[{Hunter x Hunter...|0.8343217197924389| 2.042846802734906| 0.1032379379930288|
|[{Code Geass: Han...|[{Code Geass: Han...|0.9611158072696534| 2.532467560327193| 0.1042927903137039|
|  [{Kimi no Na wa.}]|[{Koe no Katachi}...|0.8306863301191152|2.3595993671075024|0.13433314988075581|
|[{Shingeki no Kyo...|[{Shingeki no Kyo...|0.8943056124539124| 4.506457031186758|0.10011924417538066|
|      [{Death Note}]|[{One Punch Man},...|0.8014911463187325| 1.643844695168248|0.11832691249312052|
|[{Code Geass: Han...|[{One Punch Man},...|0.8993476234855545|2.5873301995618196|0.13277380297193175|
|[{Shingeki no Kyo...|[{Re:Zero kara Ha...|0.8017391304347826|1.8257044386422978|0.12685745734727574|
|[{Code Geass: Han...|[{Angel Beats!}, ...|0.9199406968124537| 2.646574344016195|0.11383232434415703|
|[{Code Geass: Han...|[{Code Geass: Han...|0.9522244137628753|2.5090394099922335|0.19927536231884058|
|[{Code Geass: Han...|[{Psycho-Pass}, {...|0.9561328790459966|2.5193379208119526|0.10296275912676574|
|[{Sen to Chihiro ...|[{Tonari no Totoro}]|0.8210137275607181| 2.363218919568831|0.14263437901302514|
|   [{One Punch Man}]|[{Kiseijuu: Sei n...|0.8005865102639296|2.1534651208727755|0.10016510731975785|
|[{Shingeki no Kyo...|[{Kiseijuu: Sei n...|0.8203812316715543| 1.868155861657083|0.10264171711612548|
|[{Fate/Zero 2nd S...|[{Fate/Zero}, {Fu...|0.8808364365511315| 3.964855008786307| 0.1410291689598239|
|[{Kono Subarashii...|[{Kono Subarashii...| 0.887374749498998| 4.030899799599198| 0.1015410016510732|
|       [{Fate/Zero}]|[{Fate/Zero 2nd S...|0.9165668662674651| 3.914754936747465|0.10530177949000183|
|[{Haikyuu!! Secon...|[{Haikyuu!!: Kara...| 0.963907284768212| 5.529343445694842|0.13350761328196661|
|[{Code Geass: Han...|[{No Game No Life...|0.9044759825327511|2.6020839587206894|0.15199046046596953|
|   [{One Punch Man}]|[{Boku no Hero Ac...|0.8052631578947368|2.1660446452919864|0.10525591634562466|
|[{Code Geass: Han...|[{Boku no Hero Ac...|0.9540350877192982|2.5138103991095564|0.12470188956154835|
+--------------------+--------------------+------------------+------------------+-------------------+
only showing top 20 rows

Display transform

# transform examines the input items against all the association rules and summarize the
# consequents as prediction
transform = model.transform(itemset)
transform.show(10)

+-------+--------------------+--------------------+
|user_id|            anime_id|          prediction|
+-------+--------------------+--------------------+
|      0|[199, 164, 431, 5...|                  []|
|      1|[5114, 9253, 1106...|[38524, 2904, 302...|
|      2|[9253, 11061, 290...|       [5114, 31964]|
|      3|[9253, 32281, 290...|[16498, 4181, 153...|
|      4|[2904, 1575, 1535...|                  []|
|      5|[199, 877, 4224, ...|                  []|
|      6|[5114, 4181, 2904...|                  []|
|      7|[5114, 4181, 199,...|                  []|
|      8|[4181, 578, 10408...|                  []|
|     10|   [1889, 934, 3652]|                  []|
+-------+--------------------+--------------------+
only showing top 10 rows

print("Number of transform :", transform.count())

Number of transform : 21804

Convert items id to anime name and save them

# Merge transform and map_file based on prediction column to get name of anime's
transform = (
    transform.select("user_id", explode("prediction").alias("anime_id"))
    .join(map_file, "anime_id")
    .groupBy("user_id")
    .agg(collect_list(struct("name")).alias("prediction"))
    .join(transform.drop("prediction"), "user_id")
)
transform.toPandas().to_csv("../data/frequent-pattern/transform.csv")
transform.show(10)

+-------+--------------------+--------------------+
|user_id|          prediction|            anime_id|
+-------+--------------------+--------------------+
|      1|[{Shingeki no Kyo...|[5114, 9253, 1106...|
|      2|[{Fullmetal Alche...|[9253, 11061, 290...|
|      3|[{Shingeki no Kyo...|[9253, 32281, 290...|
|     12|[{Code Geass: Han...|[5114, 199, 1575,...|
|     13|[{Code Geass: Han...|[1575, 486, 30, 2...|
|     14|[{Fullmetal Alche...|[9253, 38524, 110...|
|     16|[{JoJo no Kimyou ...|[5114, 9253, 3228...|
|     17|[{One Punch Man},...|[5114, 9253, 2897...|
|     19|[{Death Note}, {B...|[5114, 9253, 3852...|
|     21|[{Clannad: After ...|[9253, 28851, 322...|
+-------+--------------------+--------------------+
only showing top 10 rows

Clustering

Now we want to categorize users who are similar in interests anime into the same clusters.

Selecting number of k

We Guess number of clusters with silhouette score. Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. The silhouette value is a measure of how similar an object is to its own cluster compared to other clusters. Wikipedia

scores = []
inertia_list = np.empty(10)
K = range(2, 10)

for k in K:
    k_means = KMeans(n_clusters=k)
    k_means.fit(to_cluster)
    inertia_list[k] = k_means.inertia_
    scores.append(silhouette_score(to_cluster, k_means.labels_))

Elbow Method

plt.plot(range(0, 10), inertia_list, "-X")

plt.title("Elbow Method")
plt.xticks(np.arange(10))
plt.xlabel("Number of cluster")
plt.ylabel("Inertia")
# Draw vertical line in ax equal's 4
plt.axvline(x=4, color="blue", linestyle="--")

plt.savefig("../charts/Elbow_Method.png")
plt.show()

Results KMeans

plt.plot(K, scores)

plt.title("Results KMeans")
plt.xticks(np.arange(10))
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.axvline(x=4, color="blue", linestyle="--")

plt.savefig("../charts/Results_KMeans.png")
plt.show()

K means clustering

Now that we have the number of clusters, we can use the KMeans algorithm.

clusterer = KMeans(n_clusters=4, random_state=30).fit(to_cluster)
centers = clusterer.cluster_centers_
c_preds = clusterer.predict(to_cluster)

centers

array([[ 1.6859081 , -1.67711621, -0.39729844],
       [-1.72324023,  0.12476707,  0.15343777],
       [ 8.06274357,  0.13311179,  1.11656004],
       [ 1.87270589,  2.52512675, -0.56430143]])

Data points in 3D PCA axis - clustered

fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot(projection="3d")
ax.scatter(
    to_cluster[0],
    to_cluster[2],
    to_cluster[1],
    c=c_preds, # clusters
    cmap="viridis", # change color
    alpha=0.7, # Capacity
    s=24,
)

plt.title("Data points in 3D PCA axis - clustered", fontsize=18)
plt.savefig("../charts/Data_points_in_3D_PCA_axis_clustered.png")
plt.show()

Data points in 2D PCA axis - clustered

plt.scatter(to_cluster[1], to_cluster[0], c=c_preds, cmap="viridis", alpha=0.7, s=24)

for ci, c in enumerate(centers):
    plt.plot(c[1], c[0], "X", markersize=8, color="red", alpha=1)

plt.title("Data points in 2D PCA axis - clustered", fontsize=18)
plt.xlabel("x_values")
plt.ylabel("y_values")

plt.savefig("../charts/Data_points_in_2D_PCA_axis_clustered.png")
plt.show()

Add cluster to user_anime DataFrame

user_anime['cluster'] = c_preds
user_anime.head(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	...	xxxHOLiC	xxxHOLiC Kei	xxxHOLiC Shunmuki	cluster
0	...	0	0	0	1
1	...	0	0	0	1
2	...	0	0	0	1
3	...	0	0	0	0
4	...	1	1	1	1
5	...	0	0	0	1
6	...	0	0	0	3
7	...	0	0	0	1
8	...	0	0	0	1
10	...	0	0	0	1

10 rows × 11410 columns

user_anime.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21804 entries, 0 to 25000
Columns: 11410 entries, "0" to cluster
dtypes: int32(1), int64(11409)
memory usage: 1.9 GB

Characteristic of Each Cluster

In this section, we will calculate information for each cluster and print statistical data.

c0 = user_anime[user_anime["cluster"] == 0].drop("cluster", axis=1).mean()
c1 = user_anime[user_anime["cluster"] == 1].drop("cluster", axis=1).mean()
c2 = user_anime[user_anime["cluster"] == 2].drop("cluster", axis=1).mean()
c3 = user_anime[user_anime["cluster"] == 3].drop("cluster", axis=1).mean()

Functions

Create anime information list

def createAnimeInfoList(animelist):
    genre_list = list()
    episode_list = list()
    score_list = list()
    member_list = list()
    popularity_list = list()
    favorites_list = list()

    for x in anime["name"]:
        if x in animelist:
            for y in anime[anime["name"] == x].genres.values:
                genre_list.append(y)

            episode_list.append(anime[anime["name"] == x].episodes.values.astype(int))
            score_list.append(anime[anime["name"] == x].score.values.astype(float))
            member_list.append(anime[anime["name"] == x].members.values.astype(int))
            popularity_list.append(anime[anime["name"] == x].popularity.values.astype(int))
            favorites_list.append(anime[anime["name"] == x].favorites.values.astype(int))

    # Return data in pandas series form to prevent
    # "the Length of Values error does not match the length of the index"
    return (
        pd.Series(genre_list),
        pd.Series(episode_list),
        pd.Series(score_list),
        pd.Series(member_list),
        pd.Series(popularity_list),
        pd.Series(favorites_list),
    )

Count word

def count_word(df, ref_col, liste):
    keyword_count = dict()
    for s in liste:
        keyword_count[s] = 0
    for liste_keywords in df[ref_col].str.split(","):
        if type(liste_keywords) == float and pd.isnull(liste_keywords):
            continue
        for s in [s for s in liste_keywords if s in liste]:
            if pd.notnull(s):
                keyword_count[s] += 1
    # ______________________________________________________________________
    # convert the dictionary in a list to sort the keywords by frequency
    keyword_occurences = []
    for k, v in keyword_count.items():
        keyword_occurences.append([k, v])
    keyword_occurences.sort(key=lambda x: x[1], reverse=True)
    return keyword_occurences, keyword_count

Make cloud graph

def makeCloud(Dict, name, color, isSave=True):
    words = dict()

    for s in Dict:
        words[s[0]] = s[1]

        wordcloud = WordCloud(
            width=1500,
            height=500,
            background_color=color,
            max_words=20,
            max_font_size=500,
            normalize_plurals=False,
        )
        wordcloud.generate_from_frequencies(words)

    fig = plt.figure(figsize=(20, 8))
    plt.title(name, fontsize=18)
    plt.imshow(wordcloud)
    plt.axis("off")

    if isSave:
        plt.savefig(f"../charts/{name}.png")
    plt.show()

Calculate all genre keywords

animelist = list(c0.index)
data = pd.DataFrame()
data["genre"] = createAnimeInfoList(animelist)[0]

set_keywords = set()
for liste_keywords in data["genre"].str.split(",").values:
    if isinstance(liste_keywords, float):
        continue  # only happen if liste_keywords = NaN
    set_keywords = set_keywords.union(liste_keywords)

Cluster 0

Top 15 anime which will explain characteristic of this cluster

c0.sort_values(ascending=False)[0:15]

Shingeki no Kyojin                       0.751535
One Punch Man                            0.732706
Kimi no Na wa.                           0.707532
Death Note                               0.640196
Koe no Katachi                           0.634056
Boku no Hero Academia 2nd Season         0.623823
No Game No Life                          0.622391
Re:Zero kara Hajimeru Isekai Seikatsu    0.615227
Fullmetal Alchemist: Brotherhood         0.614409
Boku no Hero Academia                    0.609906
Shingeki no Kyojin Season 2              0.609087
Steins;Gate                              0.591281
Shigatsu wa Kimi no Uso                  0.544003
Kimetsu no Yaiba                         0.543185
Boku dake ga Inai Machi                  0.528244
dtype: float64

Favorite genre for this cluster

c0_animelist = list(c0.sort_values(ascending=False)[0:30].index)
c0_data = pd.DataFrame()
(
    c0_data["genre"],
    c0_data["episode"],
    c0_data["score"],
    c0_data["member"],
    c0_data["popularity"],
    c0_data["favorites"],
) = createAnimeInfoList(c0_animelist)

keyword_occurences, dum = count_word(c0_data, "genre", set_keywords)
makeCloud(keyword_occurences[0:10], "Cluster_0", "lemonchiffon")

keyword_occurences[0:5]

[['Action', 17],
 [' Shounen', 16],
 [' Comedy', 12],
 [' Drama', 11],
 [' Super Power', 11]]

Average of each information for anime which user in this cluster like

avg_episodes = int(c0_data["episode"].mean()[0].round())
avg_score = c0_data["score"].mean()[0].round(2)
avg_popularity = int(c0_data["popularity"].mean()[0].round())
avg_member = int(c0_data["member"].mean()[0].round())
avg_favorites = int(c0_data["favorites"].mean()[0].round())

print(f"Cluster 0\nAVG episode : {avg_episodes}\nAVG score : {avg_score}\nAVG popularity : {avg_popularity}\nAVG member : {avg_member}\nAVG favorites : {avg_favorites}\n")

Cluster 0
AVG episode : 23
AVG score : 8.55
AVG popularity : 25
AVG member : 1562339
AVG favorites : 59740

Cluster 1

Top 15 anime which will explain characteristic of this cluster

c1.sort_values(ascending=False)[0:15]

Death Note                            0.371123
Shingeki no Kyojin                    0.265868
Fullmetal Alchemist: Brotherhood      0.251081
Sen to Chihiro no Kamikakushi         0.243270
Code Geass: Hangyaku no Lelouch       0.229013
Code Geass: Hangyaku no Lelouch R2    0.202017
Steins;Gate                           0.196557
Toradora!                             0.192386
One Punch Man                         0.187912
Kimi no Na wa.                        0.184652
Angel Beats!                          0.184197
Howl no Ugoku Shiro                   0.165314
Fullmetal Alchemist                   0.160158
Sword Art Online                      0.158641
Elfen Lied                            0.151513
dtype: float64

Favorite genre for this cluster

c1_animelist = list(c1.sort_values(ascending=False)[0:30].index)
c1_data = pd.DataFrame()
(
    c1_data["genre"],
    c1_data["episode"],
    c1_data["score"],
    c1_data["member"],
    c1_data["popularity"],
    c1_data["favorites"],
) = createAnimeInfoList(c1_animelist)

keyword_occurences, dum = count_word(c1_data, "genre", set_keywords)
makeCloud(keyword_occurences[0:10], "Cluster_1", "lemonchiffon")

keyword_occurences[0:5]

[[' Drama', 17],
 ['Action', 16],
 [' Supernatural', 12],
 [' Comedy', 10],
 [' Adventure', 8]]

Average of each information for anime which user in this cluster like

avg_episodes = int(c1_data["episode"].mean()[0].round())
avg_score = c1_data["score"].mean()[0].round(2)
avg_popularity = int(c1_data["popularity"].mean()[0].round())
avg_member = int(c1_data["member"].mean()[0].round())
avg_favorites = int(c1_data["favorites"].mean()[0].round())

print(f"Cluster 1\nAVG episode : {avg_episodes}\nAVG score : {avg_score}\nAVG popularity : {avg_popularity}\nAVG member : {avg_member}\nAVG favorites : {avg_favorites}\n")

Cluster 1
AVG episode : 27
AVG score : 8.44
AVG popularity : 35
AVG member : 1498743
AVG favorites : 62681

Cluster 2

Top 15 anime which will explain characteristic of this cluster

c2.sort_values(ascending=False)[0:15]

No Game No Life                          0.863184
Shingeki no Kyojin                       0.846600
One Punch Man                            0.844942
Steins;Gate                              0.843284
Angel Beats!                             0.825871
Toradora!                                0.812604
Re:Zero kara Hajimeru Isekai Seikatsu    0.796849
Code Geass: Hangyaku no Lelouch          0.792703
Fullmetal Alchemist: Brotherhood         0.786070
Code Geass: Hangyaku no Lelouch R2       0.770315
Kimi no Na wa.                           0.764511
Death Note                               0.762852
Hataraku Maou-sama!                      0.762023
Boku dake ga Inai Machi                  0.758706
Shokugeki no Souma                       0.753731
dtype: float64

Favorite genre for this cluster

c2_animelist = list(c2.sort_values(ascending=False)[0:30].index)
c2_data = pd.DataFrame()
(
    c2_data["genre"],
    c2_data["episode"],
    c2_data["score"],
    c2_data["member"],
    c2_data["popularity"],
    c2_data["favorites"],
) = createAnimeInfoList(c2_animelist)

keyword_occurences, dum = count_word(c2_data, "genre", set_keywords)
makeCloud(keyword_occurences[0:10], "Cluster_2", "lemonchiffon")

keyword_occurences[0:5]

[[' Supernatural', 14],
 ['Action', 13],
 [' Comedy', 12],
 [' Drama', 11],
 [' School', 11]]

Average of each information for anime which user in this cluster like

avg_episodes = int(c2_data["episode"].mean()[0].round())
avg_score = c2_data["score"].mean()[0].round(2)
avg_popularity = int(c2_data["popularity"].mean()[0].round())
avg_member = int(c2_data["member"].mean()[0].round())
avg_favorites = int(c2_data["favorites"].mean()[0].round())

print(f"Cluster 2\nAVG episode : {avg_episodes}\nAVG score : {avg_score}\nAVG popularity : {avg_popularity}\nAVG member : {avg_member}\nAVG favorites : {avg_favorites}\n")

Cluster 2
AVG episode : 19
AVG score : 8.44
AVG popularity : 33
AVG member : 1481909
AVG favorites : 55440

Cluster 3

Top 15 anime which will explain characteristic of this cluster

c3.sort_values(ascending=False)[0:15]

Code Geass: Hangyaku no Lelouch       0.692277
Death Note                            0.668911
Sen to Chihiro no Kamikakushi         0.659406
Fullmetal Alchemist: Brotherhood      0.651089
Code Geass: Hangyaku no Lelouch R2    0.629703
Steins;Gate                           0.627723
Tengen Toppa Gurren Lagann            0.584158
Toradora!                             0.580198
Bakemonogatari                        0.567129
Mononoke Hime                         0.560396
Shingeki no Kyojin                    0.544950
Suzumiya Haruhi no Yuuutsu            0.539802
Cowboy Bebop                          0.539406
Mahou Shoujo Madoka★Magica            0.538218
Toki wo Kakeru Shoujo                 0.535842
dtype: float64

Favorite genre for this cluster

c3_animelist = list(c3.sort_values(ascending=False)[0:30].index)
c3_data = pd.DataFrame()
(
    c3_data["genre"],
    c3_data["episode"],
    c3_data["score"],
    c3_data["member"],
    c3_data["popularity"],
    c3_data["favorites"],
) = createAnimeInfoList(c3_animelist)

keyword_occurences, dum = count_word(c3_data, "genre", set_keywords)
makeCloud(keyword_occurences[0:10], "Cluster_3", "lemonchiffon")

keyword_occurences[0:5]

[[' Drama', 16],
 ['Action', 16],
 [' Sci-Fi', 10],
 [' Supernatural', 10],
 [' Comedy', 9]]

Average of each information for anime which user in this cluster like

avg_episodes = int(c3_data["episode"].mean()[0].round())
avg_score = c3_data["score"].mean()[0].round(2)
avg_popularity = int(c3_data["popularity"].mean()[0].round())
avg_member = int(c3_data["member"].mean()[0].round())
avg_favorites = int(c3_data["favorites"].mean()[0].round())

print(f"Cluster 3\nAVG episode : {avg_episodes}\nAVG score : {avg_score}\nAVG popularity : {avg_popularity}\nAVG member : {avg_member}\nAVG favorites : {avg_favorites}\n")

Cluster 3
AVG episode : 21
AVG score : 8.48
AVG popularity : 66
AVG member : 1214717
AVG favorites : 52597

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
charts		charts
notebook		notebook
.gitignore		.gitignore
README.md		README.md

	...	xxxHOLiC	xxxHOLiC Kei	xxxHOLiC Shunmuki
0	...	0	0	0
1	...	0	0	0
2	...	0	0	0
3	...	0	0	0
4	...	1	1	1
5	...	0	0	0
6	...	0	0	0
7	...	0	0	0
8	...	0	0	0
10	...	0	0	0

	...	xxxHOLiC	xxxHOLiC Kei	xxxHOLiC Shunmuki	cluster
0	...	0	0	0	1
1	...	0	0	0	1
2	...	0	0	0	1
3	...	0	0	0	0
4	...	1	1	1	1
5	...	0	0	0	1
6	...	0	0	0	3
7	...	0	0	0	1
8	...	0	0	0	1
10	...	0	0	0	1

	...	xxxHOLiC	xxxHOLiC Kei	xxxHOLiC Shunmuki
0	...	0	0	0
1	...	0	0	0
2	...	0	0	0
3	...	0	0	0
4	...	1	1	1
5	...	0	0	0
6	...	0	0	0
7	...	0	0	0
8	...	0	0	0
10	...	0	0	0

	...	xxxHOLiC	xxxHOLiC Kei	xxxHOLiC Shunmuki	cluster
0	...	0	0	0	1
1	...	0	0	0	1
2	...	0	0	0	1
3	...	0	0	0	0
4	...	1	1	1	1
5	...	0	0	0	1
6	...	0	0	0	3
7	...	0	0	0	1
8	...	0	0	0	1
10	...	0	0	0	1

ghasemdev/anime-recommendation-system

Folders and files

Latest commit

History

Repository files navigation

Anime Recommendation System

Introduction

Steps

Libraries

Preprocessing

Read anime dataset

Anime dataset reduction

Anime dataset transaction

Calculate mean for each type

Handel missing values

Write reduced anime dataset

Display information of anime dataset

Read rating dataset

Definition of Like

Calculate mean rating per user

Merge mean rating to rating data frame

Combine two datasets

Write merge dataset

Create Crosstable

Principal Component Analysis (PCA)

Visualization

Data points in 3D PCA axis

Data points in 2D PCA axis

Chernoff faces

Box plots

Score histogram

Pixel-Oriented Visualization

Score quantile plot

Scatter plot matrices

Anime types

Anime sources

Anime ratings

Frequent Pattern Mining

Initialize Spark Library

Prepare Data

FPGrowth Algorithm

Clustering

Selecting number of k

K means clustering

Characteristic of Each Cluster

Functions

Calculate all genre keywords

Cluster 0

Cluster 1

Cluster 2

Cluster 3

About

Topics

Resources

Stars

Watchers

Forks

Languages

	...	xxxHOLiC	xxxHOLiC Kei	xxxHOLiC Shunmuki
0	...	0	0	0
1	...	0	0	0
2	...	0	0	0
3	...	0	0	0
4	...	1	1	1
5	...	0	0	0
6	...	0	0	0
7	...	0	0	0
8	...	0	0	0
10	...	0	0	0

	...	xxxHOLiC	xxxHOLiC Kei	xxxHOLiC Shunmuki	cluster
0	...	0	0	0	1
1	...	0	0	0	1
2	...	0	0	0	1
3	...	0	0	0	0
4	...	1	1	1	1
5	...	0	0	0	1
6	...	0	0	0	3
7	...	0	0	0	1
8	...	0	0	0	1
10	...	0	0	0	1