From anime dataset, We create recommendation system witch use cluster technique.
Recommended anime were extracted from characteristic of cluster.
User was segmented by user anime rating history.
- Preprocessing
- Visualization
- K-Mean clustering
- Characteristic of each cluster
Import all libraries we need for data mining.
# Basic libraries
from random import randint
from chernoff_faces import cface
# Import numpy, pandas and matplot libraries
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from wordcloud import WordCloud
%matplotlib inline
# Machine learning libraries
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Finding spark in jupyter notebook
import findspark
findspark.init()
# Frequent pattern libraries
from pyspark.sql import SparkSession
from pyspark.ml.fpm import FPGrowth
from pyspark.sql.functions import *
# Import HTML lib for changing direction of page
from IPython.display import HTML
from IPython.display import display
HTML('<style>.output{flex-direction:row;flex-wrap:wrap}</style>')
# Setup style of charts
plt.style.use('seaborn')
%config InlineBackend.figure_formats = {'png', 'retina'}
In this section, we are going to pre-process the data, so we will first clean the data and replace the missing values with valid values - The attribute mean for all samples belonging to the same class - and then remove the extra columns, and we will also combine the data sets. And sampling data to reduce data volume.
# Read anime dataset from csv file
anime = pd.read_csv("../data/anime.csv")
# Show the first 3 records
anime.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
MAL_ID | Name | Score | Genres | English name | Japanese name | Type | Episodes | Aired | Premiered | ... | Score-10 | Score-9 | Score-8 | Score-7 | Score-6 | Score-5 | Score-4 | Score-3 | Score-2 | Score-1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Cowboy Bebop | 8.78 | Action, Adventure, Comedy, Drama, Sci-Fi, Space | Cowboy Bebop | カウボーイビバップ | TV | 26 | Apr 3, 1998 to Apr 24, 1999 | Spring 1998 | ... | 229170.0 | 182126.0 | 131625.0 | 62330.0 | 20688.0 | 8904.0 | 3184.0 | 1357.0 | 741.0 | 1580.0 |
1 | 5 | Cowboy Bebop: Tengoku no Tobira | 8.39 | Action, Drama, Mystery, Sci-Fi, Space | Cowboy Bebop:The Movie | カウボーイビバップ 天国の扉 | Movie | 1 | Sep 1, 2001 | Unknown | ... | 30043.0 | 49201.0 | 49505.0 | 22632.0 | 5805.0 | 1877.0 | 577.0 | 221.0 | 109.0 | 379.0 |
2 | 6 | Trigun | 8.24 | Action, Sci-Fi, Adventure, Comedy, Drama, Shounen | Trigun | トライガン | TV | 26 | Apr 1, 1998 to Sep 30, 1998 | Spring 1998 | ... | 50229.0 | 75651.0 | 86142.0 | 49432.0 | 15376.0 | 5838.0 | 1965.0 | 664.0 | 316.0 | 533.0 |
3 rows × 35 columns
# Count of row and column
anime.shape
(17562, 35)
# Remove extra columns from anime dateframe and then rename `MAL_ID` to `Anime_ID`,
# also set `Anime_ID` to index of dataframe
anime = (
anime[
[
"MAL_ID",
"Ranked",
"Popularity",
"Name",
"Genres",
"Type",
"Source",
"Rating",
"Episodes",
"Score",
"Members",
"Favorites",
]
]
.rename(
{
"MAL_ID": "anime_id",
"Ranked": "ranked",
"Popularity": "popularity",
"Name": "name",
"Genres": "genres",
"Type": "type",
"Source": "source",
"Rating": "rating",
"Episodes": "episodes",
"Score": "score",
"Members": "members",
"Favorites": "favorites",
},
axis=1,
)
.set_index("anime_id")
)
# Remove invalid row with `Unknown` ranked
anime = anime[(anime["ranked"] != "Unknown") & (anime["ranked"] != "0.0")]
# Replace missing `Score` and `Episodes` with zero
anime["score"].replace("Unknown", 0.0, inplace=True)
anime["episodes"].replace("Unknown", 0, inplace=True)
# Change the `Ranked`, `Episodes` and `Score` columns to numeric for math operations,
# as well as sort the table by `Ranked`
anime = (
anime.astype({"ranked": "float"})
.astype({"ranked": "int", "episodes": "int", "score": "float"})
.sort_values("ranked")
)
# Calculate mean of `Score` and `Episodes` for each `Type`
group_by_type = anime.groupby("type")
print("✓ Mean of score for each type")
display(mean_scores := group_by_type["score"].mean().round(2))
print("\n✓ Mean of episodes for each type")
display(mean_episodes := group_by_type["episodes"].mean().round().astype(int))
✓ Mean of score for each type
type
Movie 4.42
Music 2.93
ONA 3.58
OVA 4.27
Special 5.14
TV 5.50
Name: score, dtype: float64
✓ Mean of episodes for each type
type
Movie 1
Music 1
ONA 9
OVA 2
Special 2
TV 33
Name: episodes, dtype: int32
# Replace zeroes `Score` with its own category mean
for index in mean_scores.index:
anime["score"].mask(
(anime["type"] == index) & (anime["score"] == 0.0),
mean_scores[index],
inplace=True,
)
# Replace zeroes `Episodes` with its own category mean
for index in mean_episodes.index:
anime["episodes"].mask(
(anime["type"] == index) & (anime["episodes"] == 0),
mean_episodes[index],
inplace=True,
)
# Wright anime dataset to csv file
anime.to_csv("../data/anime_reduce.csv")
anime = pd.read_csv("../data/anime_reduce.csv")
anime.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
anime_id | ranked | popularity | name | genres | type | source | rating | episodes | score | members | favorites | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5114 | 1 | 3 | Fullmetal Alchemist: Brotherhood | Action, Military, Adventure, Comedy, Drama, Ma... | TV | Manga | R - 17+ (violence & profanity) | 64 | 9.19 | 2248456 | 183914 |
1 | 40028 | 2 | 119 | Shingeki no Kyojin: The Final Season | Action, Military, Mystery, Super Power, Drama,... | TV | Manga | R - 17+ (violence & profanity) | 16 | 9.17 | 733260 | 44862 |
2 | 9253 | 3 | 9 | Steins;Gate | Thriller, Sci-Fi | TV | Visual novel | PG-13 - Teens 13 or older | 24 | 9.11 | 1771162 | 148452 |
3 | 38524 | 4 | 63 | Shingeki no Kyojin Season 3 Part 2 | Action, Drama, Fantasy, Military, Mystery, Sho... | TV | Manga | R - 17+ (violence & profanity) | 10 | 9.10 | 1073626 | 40985 |
4 | 28977 | 5 | 329 | Gintama° | Action, Comedy, Historical, Parody, Samurai, S... | TV | Manga | PG-13 - Teens 13 or older | 51 | 9.10 | 404121 | 11868 |
anime.shape
(15798, 12)
anime.describe().round(2)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
anime_id | ranked | popularity | episodes | score | members | favorites | |
---|---|---|---|---|---|---|---|
count | 15798.00 | 15798.00 | 15798.00 | 15798.00 | 15798.00 | 15798.00 | 15798.00 |
mean | 21601.96 | 7896.21 | 8884.46 | 12.46 | 5.88 | 37752.95 | 503.86 |
std | 14671.07 | 4556.74 | 5229.80 | 49.12 | 1.36 | 131649.10 | 4281.61 |
min | 1.00 | 1.00 | 1.00 | 1.00 | 1.85 | 25.00 | 0.00 |
25% | 6514.00 | 3946.50 | 4084.00 | 1.00 | 5.14 | 298.00 | 0.00 |
50% | 23350.00 | 7896.00 | 9195.50 | 2.00 | 6.08 | 1737.00 | 2.00 |
75% | 35454.75 | 11845.75 | 13518.75 | 12.00 | 6.92 | 15706.00 | 31.00 |
max | 48480.00 | 15780.00 | 17560.00 | 3057.00 | 9.19 | 2589552.00 | 183914.00 |
rating = pd.read_csv("../data/rating.csv").rename({"rating": "user_rating"}, axis=1)
rating.head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
user_id | anime_id | user_rating | |
---|---|---|---|
0 | 0 | 430 | 9 |
1 | 0 | 1004 | 5 |
2 | 0 | 3010 | 7 |
3 | 0 | 570 | 7 |
4 | 0 | 2762 | 9 |
5 | 0 | 431 | 8 |
6 | 0 | 578 | 10 |
7 | 0 | 433 | 6 |
8 | 0 | 1571 | 10 |
9 | 0 | 121 | 9 |
rating.shape
(57633278, 3)
rating["user_rating"].describe().round(2)
count 57633278.00
mean 7.51
std 1.70
min 1.00
25% 7.00
50% 8.00
75% 9.00
max 10.00
Name: user_rating, dtype: float64
Because of many users, many differences criteria for rating anime. Then, We decide to find rating mean of each user. Anime which got rating higher than user rating mean will assign as like
rating['user_rating'] > rating['mean_rating'] => User like this anime
# User 922 has a low in rating mean
rating[rating["user_id"] == 922]["user_rating"].mean().round(3)
1.0
# User 99 has a middle in rating mean
rating[rating["user_id"] == 105]["user_rating"].mean().round(3)
5.0
# User 12 has a hight in rating mean
rating[rating["user_id"] == 12]["user_rating"].mean().round(3)
10.0
mean_rating_per_user = rating.groupby(["user_id"]).mean().reset_index()
mean_rating_per_user["mean_rating"] = mean_rating_per_user["user_rating"]
mean_rating_per_user.drop(["anime_id", "user_rating"], axis=1, inplace=True)
mean_rating_per_user.head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
user_id | mean_rating | |
---|---|---|
0 | 0 | 7.400000 |
1 | 1 | 8.058252 |
2 | 2 | 8.333333 |
3 | 3 | 7.603175 |
4 | 4 | 7.652542 |
5 | 5 | 8.162791 |
6 | 6 | 7.073955 |
7 | 7 | 7.908046 |
8 | 8 | 7.611111 |
9 | 10 | 7.750000 |
rating = pd.merge(rating, mean_rating_per_user, on=["user_id", "user_id"])
rating.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
user_id | anime_id | user_rating | mean_rating | |
---|---|---|---|---|
0 | 0 | 430 | 9 | 7.4 |
1 | 0 | 1004 | 5 | 7.4 |
2 | 0 | 3010 | 7 | 7.4 |
3 | 0 | 570 | 7 | 7.4 |
4 | 0 | 2762 | 9 | 7.4 |
# Drop anime's user don't liked
rating = rating.drop(rating[rating["user_rating"] < rating["mean_rating"]].index)
# user 922 favorite only one anime
rating[rating["user_id"] == 922].head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
user_id | anime_id | user_rating | mean_rating | |
---|---|---|---|---|
149818 | 922 | 16870 | 1 | 1.0 |
# user 105 favorite only one anime
rating[rating["user_id"] == 105].head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
user_id | anime_id | user_rating | mean_rating | |
---|---|---|---|---|
13008 | 105 | 249 | 5 | 5.0 |
rating[rating["user_id"] == 12].head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
user_id | anime_id | user_rating | mean_rating | |
---|---|---|---|---|
1246 | 12 | 31964 | 10 | 10.0 |
1247 | 12 | 16335 | 10 | 10.0 |
1248 | 12 | 11021 | 10 | 10.0 |
1249 | 12 | 35062 | 10 | 10.0 |
1250 | 12 | 20785 | 10 | 10.0 |
rating.shape
(30706635, 4)
# Number of users
len(rating["user_id"].unique())
310059
In this section, we decide to reduce size of dataset, because of running time and memory
# Merge anime and rating data frame
mergedata = pd.merge(anime, rating, on=["anime_id", "anime_id"])
# Choice record with User_ID lower than equal 25000, This is a random choice
mergedata = mergedata[mergedata["user_id"] <= 25000]
mergedata.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
anime_id | ranked | popularity | name | genres | type | source | rating | episodes | score | members | favorites | user_id | user_rating | mean_rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5114 | 1 | 3 | Fullmetal Alchemist: Brotherhood | Action, Military, Adventure, Comedy, Drama, Ma... | TV | Manga | R - 17+ (violence & profanity) | 64 | 9.19 | 2248456 | 183914 | 1 | 10 | 8.058252 |
1 | 5114 | 1 | 3 | Fullmetal Alchemist: Brotherhood | Action, Military, Adventure, Comedy, Drama, Ma... | TV | Manga | R - 17+ (violence & profanity) | 64 | 9.19 | 2248456 | 183914 | 6 | 10 | 7.073955 |
2 | 5114 | 1 | 3 | Fullmetal Alchemist: Brotherhood | Action, Military, Adventure, Comedy, Drama, Ma... | TV | Manga | R - 17+ (violence & profanity) | 64 | 9.19 | 2248456 | 183914 | 7 | 10 | 7.908046 |
3 | 5114 | 1 | 3 | Fullmetal Alchemist: Brotherhood | Action, Military, Adventure, Comedy, Drama, Ma... | TV | Manga | R - 17+ (violence & profanity) | 64 | 9.19 | 2248456 | 183914 | 11 | 10 | 8.503106 |
4 | 5114 | 1 | 3 | Fullmetal Alchemist: Brotherhood | Action, Military, Adventure, Comedy, Drama, Ma... | TV | Manga | R - 17+ (violence & profanity) | 64 | 9.19 | 2248456 | 183914 | 12 | 10 | 10.000000 |
# Count of anime in mergedata
len(mergedata["anime_id"].unique())
11411
# Count of anime in actual dataset
len(anime["anime_id"].unique())
15798
mergedata.to_csv('../data/mergedata.csv')
Show detail of anime which each user like
mergedata = pd.read_csv("../data/mergedata.csv")
user_anime = pd.crosstab(mergedata["user_id"], mergedata["name"])
user_anime.columns.name = None
user_anime.index.name = None
user_anime.head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
"0" | "Bungaku Shoujo" Kyou no Oyatsu: Hatsukoi | "Bungaku Shoujo" Memoire | "Bungaku Shoujo" Movie | "Calpis" Hakkou Monogatari | "Eiji" | "Eiyuu" Kaitai | "Kiss Dekiru Gyoza" x Mameshiba Movie | "Parade" de Satie | "R100" x Mameshiba Original Manners | ... | s.CRY.ed Alteration I: Tao | s.CRY.ed Alteration II: Quan | the FLY BanD! | xxxHOLiC | xxxHOLiC Kei | xxxHOLiC Movie: Manatsu no Yoru no Yume | xxxHOLiC Rou | xxxHOLiC Shunmuki | ēlDLIVE | ◯ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 rows × 11409 columns
user_anime.shape
(21804, 11409)
Principal Component Analysis converts our original variables to a new set of variables, which are a linear combination of the original set of variables. Main goal is to reduce dimension of data for clustering and visualize
pca = PCA(n_components=3)
pca.fit(user_anime)
pca_samples = pca.transform(user_anime)
ps = pd.DataFrame(pca_samples)
ps.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
0 | -2.461742 | 0.477328 | -0.081578 |
1 | -1.083951 | -1.401628 | -1.014477 |
2 | -1.693449 | -0.630825 | -0.298041 |
3 | 3.778695 | 0.226784 | 1.138751 |
4 | -1.802398 | 1.464769 | 0.436394 |
ps.shape
(21804, 3)
In this section, we use a variety of graphs to display the data so that we can better understanding
to_cluster = pd.DataFrame(ps[[0, 1, 2]])
fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot(projection="3d")
ax.scatter(to_cluster[0], to_cluster[2], to_cluster[1], s=24)
plt.title("Data points in 3D PCA axis", fontsize=18)
plt.savefig("../charts/Data_points_in_3D_PCA_axis.png")
plt.show()
plt.scatter(to_cluster[1], to_cluster[0], s=24)
plt.xlabel("x_values")
plt.ylabel("y_values")
plt.title("Data points in 2D PCA axis", fontsize=18)
plt.savefig("../charts/Data_points_in_2D_PCA_axis.png")
plt.show()
Display multivariate data in the shape of a human face. The individual parts, such as eyes, ears, mouth and nose represent values of the variables by their shape, size, placement and orientation (wikipedia)
# Reduce dimension for face attributes
pca = PCA(n_components=17)
pca.fit(user_anime)
pca_samples = pca.transform(user_anime)
ps = pd.DataFrame(pca_samples)
ps.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -2.461742 | 0.477327 | -0.081578 | 0.064732 | 0.634798 | 0.142396 | -0.051592 | 0.269147 | -0.324508 | -0.142589 | 0.113462 | -0.245301 | -0.565451 | 0.133959 | -0.538926 | -0.205612 | 0.021235 |
1 | -1.083951 | -1.401628 | -1.014476 | 0.235317 | -0.484953 | 0.098983 | -1.004047 | 0.046567 | 0.459869 | 0.146468 | 0.718675 | 0.356861 | 0.807967 | -0.057999 | -0.275366 | 0.364260 | 0.360791 |
2 | -1.693449 | -0.630825 | -0.298040 | 0.511194 | -0.784856 | 0.568464 | 0.385865 | -0.590062 | 0.612352 | -0.411691 | -0.189001 | 0.241942 | 0.175324 | -0.141344 | 0.211314 | 0.080413 | 0.156759 |
3 | 3.778695 | 0.226784 | 1.138750 | -1.648948 | -0.991642 | -1.601837 | 1.068610 | -0.782450 | -1.796181 | -0.444752 | -0.452496 | -0.774258 | 0.586723 | -0.709391 | 1.154184 | -0.719425 | -0.100087 |
4 | -1.802398 | 1.464768 | 0.436394 | 0.679526 | 0.561298 | -0.313808 | -1.640556 | -0.554039 | -0.152285 | 0.861153 | -0.331548 | -0.356287 | -0.692186 | 0.140269 | 0.834920 | -0.020511 | 0.596688 |
# Change size of figure
fig = plt.figure(figsize=(11, 11))
# Create sixteen face by random selection
for i in range(16):
ax = fig.add_subplot(4, 4, i + 1, aspect="equal")
cface(ax, 0.9, *ps.iloc[randint(0, 21804), :])
ax.axis([-1.2, 1.2, -1.2, 1.2])
ax.set_xticks([])
ax.set_yticks([])
fig.subplots_adjust(hspace=0, wspace=0)
plt.savefig("../charts/Chernoff_faces.png", bbox_inches="tight")
plt.show()
fig, axes = plt.subplots(1, 3, figsize=(12, 7))
fig.suptitle("Box plots", fontsize=18)
anime["score"].plot.box(ax=axes[0])
anime["popularity"].plot.box(ax=axes[1])
rating["user_rating"].plot.box(ax=axes[2])
plt.savefig("../charts/Box_plots.png")
plt.show()
data = anime["score"].values
bins = np.arange(1, 11)
plt.hist(data, bins, histtype="bar", rwidth=0.95)
plt.title("Score histogram", fontsize=18)
plt.xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
plt.xlabel("score")
plt.ylabel("value_count")
plt.savefig("../charts/Score_histogram.png")
plt.show()
Pixel plots are the representation of a 2-dimension data set. In these plots, each pixel refers to a different value in a data set
# Select first thousand data
head_anime = anime.head(1000)
columns = ["ranked", "popularity", "episodes", "score", "members", "favorites"]
# Creating a plot
fig, axes = plt.subplots(1, 6, figsize=(20, 8))
index = 0
for ax in axes:
# plotting a plot
data = head_anime[columns[index]].values.reshape((50, 20))
ax.grid(False)
ax.pcolor(data, cmap="Blues")
# Customizing plot
ax.set_xticks([])
ax.set_yticks([])
ax.set_xlabel(columns[index], fontsize=18, labelpad=20)
index += 1
# Save a plot
plt.savefig("../charts/Pixel_oriented.png")
# Show plot
plt.show()
In statistics, a Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.
data = anime["score"]
fig, ax = plt.subplots(1, 1)
# Create quantile plot
stats.probplot(data, dist="norm", plot=ax)
# Calculate quantiles
median = data.median()
percent_25 = data.quantile(0.25)
percent_75 = data.quantile(0.75)
# Guide lines
plt.plot([-4, 4], [percent_25, percent_25], linestyle="dashed", label="Quartile1")
plt.plot([-4, 4], [median, median], linestyle="dashed", label="Median")
plt.plot([-4, 4], [percent_75, percent_75], linestyle="dashed", label="Quartile3")
# Customizing plot
plt.title("Score quantile plot", fontsize=18)
plt.legend()
plt.savefig("../charts/Score_quantile_plot.png")
plt.show()
Provides a first look at data to see clusters of points, outliers, etc
fig, axes = plt.subplots(2, 2, sharex=True, figsize=(12, 8))
fig.suptitle("Scatter plot", fontsize=18)
fig.supxlabel("Ranked")
group_setting = {
# Type marker color alpha
"Movie": ["X", "#E72626", 0.4],
"Music": ["^", "#48DA4B", 0.2],
"ONA": [".", "#E820D7", 0.2],
"OVA": ["*", "#FFD323", 0.2],
"Special": [",", "#E4337A", 0.2],
"TV": ["+", "#3719CC", 0.4],
}
# plotting a plot
for name in group_by_type.groups:
data = group_by_type.get_group(name)
color = group_setting[name][1]
alpha = group_setting[name][2]
axes[0, 0].scatter(data["ranked"], data["popularity"], marker=".", color=color, alpha=alpha, label=name)
axes[0, 1].scatter(data["ranked"], data["episodes"], marker=".", color=color, alpha=alpha)
axes[1, 0].scatter(data["ranked"], data["favorites"], marker=".", color=color, alpha=alpha)
axes[1, 1].scatter(data["ranked"], data["members"], marker=".", color=color, alpha=alpha )
# Customizing plot
axes[0, 0].set_ylabel("Popularity")
axes[0, 1].set_ylabel("Episodes")
axes[1, 0].set_ylabel("Favorites")
axes[1, 1].set_ylabel("Members")
fig.legend()
plt.savefig("../charts/Scatter_plot_matrices.png")
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(12, 8))
fig.suptitle("Anime types", fontsize=18)
# Bar chart
data = anime["type"].value_counts()
for i in data.index:
axes[0].bar(i, data[i])
axes[0].set_ylabel("Count")
# Pie chart
axes[1].pie(data, explode=(0, 0.1, 0, 0, 0, 0), autopct="%1.1f%%")
plt.savefig("../charts/Anime_types.png")
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(12, 8))
fig.suptitle("Anime sources", fontsize=18)
# Bar chart
data = anime["source"].value_counts()
for i in data.index:
axes[0].bar(i, data[i])
axes[0].tick_params(axis="x", rotation=90)
axes[0].set_ylabel("Count")
# Pie chart
axes[1].pie(data, autopct="%1.1f%%")
plt.savefig("../charts/Anime_sources.png")
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(12, 8))
fig.suptitle("Anime ratings", fontsize=18)
# Bar chart
data = anime["rating"].value_counts()
for i in data.index:
axes[0].bar(i, data[i])
axes[0].tick_params(axis="x", rotation=90)
axes[0].set_ylabel("Count")
# Pie chart
axes[1].pie(data, autopct="%1.1f%%")
plt.savefig("../charts/Anime_ratings.png")
plt.show()
Now, to find repetitive favorite anime's, we can use frequent patterns, find patterns and use them to predict what anime each user may watch. Here we use the Spark library and FPGrowth algorithm and its guide, which is as follows:
minSupport
: the minimum support for an itemset to be identified as frequent. For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.minConfidence
: minimum confidence for generating Association Rule. Confidence is an indication of how often an association rule has been found to be true. For example, if in the transactions itemset X appears 4 times, X and Y co-occur only 2 times, the confidence for the ruleX => Y
is then 2/4 = 0.5. The parameter will not affect the mining for frequent itemsets, but specify the minimum confidence for generating.association
rules from frequent itemsets. numPartitions: the number of partitions used to distribute the work. By default the param is not set, and number of partitions of the input dataset is used.
spark = (
SparkSession.builder.master("local")
.appName("anime_recommendation_system")
.getOrCreate()
)
sc = spark.sparkContext
# Create a map file
map_file = anime[["anime_id", "name"]]
map_file.set_index("anime_id", inplace=True)
map_file.to_csv("../data/frequent-pattern/map_file.csv")
We create a itemset for each user whose favorite anime is in these baskets.
# Group rows with user_id and then convert anime_id to list
df = mergedata.groupby("user_id")["anime_id"].apply(list)
# Write to csv file
df.to_csv("../data/frequent-pattern/itemset.csv")
df
user_id
0 [199, 164, 431, 578, 2236, 121, 2034, 2762, 15...
1 [5114, 9253, 11061, 28851, 32281, 199, 19, 232...
2 [9253, 11061, 2904, 263, 1575, 1535, 30276, 32...
3 [9253, 32281, 2904, 1, 17074, 23273, 1575, 103...
4 [2904, 1575, 1535, 1698, 2685, 1142, 3091, 422...
...
24996 [3470]
24997 [2904, 199, 1, 1575, 164, 431, 1535, 32, 5, 30...
24998 [5114, 38524, 11061, 32935, 37510, 263, 34599,...
24999 [5114, 9253, 11061, 32281, 199, 1, 164, 245, 4...
25000 [32281, 16782, 19111, 1689, 31953, 15051, 3537...
Name: anime_id, Length: 21804, dtype: object
Then read itemset and remove double quotation from itemset
# Read csv file
itemset = (
spark.read.option("header", True)
.option("inferSchema", True)
.csv("../data/frequent-pattern/itemset.csv")
)
# Remove double quotes, brackets and cast to integer array
itemset = itemset.withColumn("anime_id", regexp_replace(col("anime_id"), '"|\[|\]', ""))
itemset = itemset.withColumn("anime_id", split(col("anime_id"), ",").cast("array<int>"))
print(itemset.printSchema())
itemset.show(5)
root
|-- user_id: integer (nullable = true)
|-- anime_id: array (nullable = true)
| |-- element: integer (containsNull = true)
None
+-------+--------------------+
|user_id| anime_id|
+-------+--------------------+
| 0|[199, 164, 431, 5...|
| 1|[5114, 9253, 1106...|
| 2|[9253, 11061, 290...|
| 3|[9253, 32281, 290...|
| 4|[2904, 1575, 1535...|
+-------+--------------------+
only showing top 5 rows
Read map_file from disk
# Read csv file
map_file = (
spark.read.option("header", True)
.option("inferSchema", True)
.csv("../data/frequent-pattern/map_file.csv")
)
map_file.show(5)
+--------+--------------------+
|anime_id| name|
+--------+--------------------+
| 5114|Fullmetal Alchemi...|
| 40028|Shingeki no Kyoji...|
| 9253| Steins;Gate|
| 38524|Shingeki no Kyoji...|
| 28977| Gintama°|
+--------+--------------------+
only showing top 5 rows
Create FPGrowth model from itemset
fpGrowth = FPGrowth(itemsCol="anime_id", minSupport=0.1, minConfidence=0.8)
model = fpGrowth.fit(itemset)
Display frequent itemsets
freqItemsets = model.freqItemsets.withColumn("item_id", monotonically_increasing_id())
freqItemsets.show(5)
+--------------+----+-------+
| items|freq|item_id|
+--------------+----+-------+
| [223]|2679| 0|
| [21881]|2560| 1|
|[21881, 11757]|2188| 2|
| [37510]|3319| 3|
| [37510, 5114]|2350| 4|
+--------------+----+-------+
only showing top 5 rows
print("Number of frequent item set :", freqItemsets.count())
Number of frequent item set : 2477
Convert items id to anime name and save them
# Merge freqItemsets and map_file to get name of anime's
freqItemsets = (
freqItemsets.select("item_id", explode("items").alias("anime_id"))
.join(map_file, "anime_id")
.groupBy("item_id")
.agg(collect_list(struct("name")).alias("items"))
.join(freqItemsets.drop("items"), "item_id")
.drop("item_id")
)
# Convert Spark DataFrame to Pandas DataFrame for saving in csv file
freqItemsets.toPandas().to_csv("../data/frequent-pattern/freqItemsets.csv")
freqItemsets.show(5)
+--------------------+----+
| items|freq|
+--------------------+----+
| [{Dragon Ball}]|2679|
|[{Sword Art Onlin...|2560|
|[{Sword Art Onlin...|2188|
|[{Mob Psycho 100 ...|3319|
|[{Mob Psycho 100 ...|2350|
+--------------------+----+
only showing top 5 rows
Display association rules
# Display generated association rules.
associationRules = model.associationRules.withColumn("item_id", monotonically_increasing_id())
associationRules.show()
+--------------------+----------+------------------+------------------+-------------------+-------+
| antecedent|consequent| confidence| lift| support|item_id|
+--------------------+----------+------------------+------------------+-------------------+-------+
| [11061, 2904, 1575]| [5114]|0.8343217197924389| 2.042846802734906| 0.1032379379930288| 0|
|[2904, 30276, 511...| [1575]|0.9611158072696534| 2.532467560327193| 0.1042927903137039| 1|
| [28851, 9253]| [32281]|0.8306863301191152|2.3595993671075024|0.13433314988075581| 2|
|[38524, 38000, 16...| [35760]|0.8943056124539124| 4.506457031186758|0.10011924417538066| 3|
|[30276, 1575, 16498]| [1535]|0.8014911463187325| 1.643844695168248|0.11832691249312052| 4|
|[30276, 1575, 16498]| [2904]|0.8993476234855545|2.5873301995618196|0.13277380297193175| 5|
| [31240, 1535]| [16498]|0.8017391304347826|1.8257044386422978|0.12685745734727574| 6|
| [6547, 1575, 16498]| [2904]|0.9199406968124537| 2.646574344016195|0.11383232434415703| 7|
| [2904, 16498]| [1575]|0.9522244137628753|2.5090394099922335|0.19927536231884058| 8|
| [13601, 2904, 9253]| [1575]|0.9561328790459966|2.5193379208119526|0.10296275912676574| 9|
| [523]| [199]|0.8210137275607181| 2.363218919568831|0.14263437901302514| 10|
| [22535, 31240]| [30276]|0.8005865102639296|2.1534651208727755|0.10016510731975785| 11|
| [22535, 31240]| [16498]|0.8203812316715543| 1.868155861657083|0.10264171711612548| 12|
| [10087, 5114]| [11741]|0.8808364365511315| 3.964855008786307| 0.1410291689598239| 13|
| [32937, 1535]| [30831]| 0.887374749498998| 4.030899799599198| 0.1015410016510732| 14|
| [11741, 5114, 1535]| [10087]|0.9165668662674651| 3.914754936747465|0.10530177949000183| 15|
| [32935, 20583]| [28891]| 0.963907284768212| 5.529343445694842|0.13350761328196661| 16|
| [19815, 1575]| [2904]|0.9044759825327511|2.6020839587206894|0.15199046046596953| 17|
| [31964, 2904]| [30276]|0.8052631578947368|2.1660446452919864|0.10525591634562466| 18|
| [31964, 2904]| [1575]|0.9540350877192982|2.5138103991095564|0.12470188956154835| 19|
+--------------------+----------+------------------+------------------+-------------------+-------+
only showing top 20 rows
print("Number of association rules :", associationRules.count())
Number of association rules : 903
Convert items id to anime name and save them
# Merge associationRules and map_file based on antecedent column to get name of anime's
associationRules = (
associationRules.select("item_id", explode("antecedent").alias("anime_id"))
.join(map_file, "anime_id")
.groupBy("item_id")
.agg(collect_list(struct("name")).alias("antecedent"))
.join(associationRules.drop("antecedent"), "item_id")
)
# Merge associationRules and map_file based on consequent column to get name of anime's
associationRules = (
associationRules.select("item_id", explode("consequent").alias("anime_id"))
.join(map_file, "anime_id")
.groupBy("item_id")
.agg(collect_list(struct("name")).alias("consequent"))
.join(associationRules.drop("consequent"), "item_id")
.drop("item_id")
)
associationRules.toPandas().to_csv("../data/frequent-pattern/associationRules.csv")
associationRules.show()
+--------------------+--------------------+------------------+------------------+-------------------+
| consequent| antecedent| confidence| lift| support|
+--------------------+--------------------+------------------+------------------+-------------------+
|[{Fullmetal Alche...|[{Hunter x Hunter...|0.8343217197924389| 2.042846802734906| 0.1032379379930288|
|[{Code Geass: Han...|[{Code Geass: Han...|0.9611158072696534| 2.532467560327193| 0.1042927903137039|
| [{Kimi no Na wa.}]|[{Koe no Katachi}...|0.8306863301191152|2.3595993671075024|0.13433314988075581|
|[{Shingeki no Kyo...|[{Shingeki no Kyo...|0.8943056124539124| 4.506457031186758|0.10011924417538066|
| [{Death Note}]|[{One Punch Man},...|0.8014911463187325| 1.643844695168248|0.11832691249312052|
|[{Code Geass: Han...|[{One Punch Man},...|0.8993476234855545|2.5873301995618196|0.13277380297193175|
|[{Shingeki no Kyo...|[{Re:Zero kara Ha...|0.8017391304347826|1.8257044386422978|0.12685745734727574|
|[{Code Geass: Han...|[{Angel Beats!}, ...|0.9199406968124537| 2.646574344016195|0.11383232434415703|
|[{Code Geass: Han...|[{Code Geass: Han...|0.9522244137628753|2.5090394099922335|0.19927536231884058|
|[{Code Geass: Han...|[{Psycho-Pass}, {...|0.9561328790459966|2.5193379208119526|0.10296275912676574|
|[{Sen to Chihiro ...|[{Tonari no Totoro}]|0.8210137275607181| 2.363218919568831|0.14263437901302514|
| [{One Punch Man}]|[{Kiseijuu: Sei n...|0.8005865102639296|2.1534651208727755|0.10016510731975785|
|[{Shingeki no Kyo...|[{Kiseijuu: Sei n...|0.8203812316715543| 1.868155861657083|0.10264171711612548|
|[{Fate/Zero 2nd S...|[{Fate/Zero}, {Fu...|0.8808364365511315| 3.964855008786307| 0.1410291689598239|
|[{Kono Subarashii...|[{Kono Subarashii...| 0.887374749498998| 4.030899799599198| 0.1015410016510732|
| [{Fate/Zero}]|[{Fate/Zero 2nd S...|0.9165668662674651| 3.914754936747465|0.10530177949000183|
|[{Haikyuu!! Secon...|[{Haikyuu!!: Kara...| 0.963907284768212| 5.529343445694842|0.13350761328196661|
|[{Code Geass: Han...|[{No Game No Life...|0.9044759825327511|2.6020839587206894|0.15199046046596953|
| [{One Punch Man}]|[{Boku no Hero Ac...|0.8052631578947368|2.1660446452919864|0.10525591634562466|
|[{Code Geass: Han...|[{Boku no Hero Ac...|0.9540350877192982|2.5138103991095564|0.12470188956154835|
+--------------------+--------------------+------------------+------------------+-------------------+
only showing top 20 rows
Display transform
# transform examines the input items against all the association rules and summarize the
# consequents as prediction
transform = model.transform(itemset)
transform.show(10)
+-------+--------------------+--------------------+
|user_id| anime_id| prediction|
+-------+--------------------+--------------------+
| 0|[199, 164, 431, 5...| []|
| 1|[5114, 9253, 1106...|[38524, 2904, 302...|
| 2|[9253, 11061, 290...| [5114, 31964]|
| 3|[9253, 32281, 290...|[16498, 4181, 153...|
| 4|[2904, 1575, 1535...| []|
| 5|[199, 877, 4224, ...| []|
| 6|[5114, 4181, 2904...| []|
| 7|[5114, 4181, 199,...| []|
| 8|[4181, 578, 10408...| []|
| 10| [1889, 934, 3652]| []|
+-------+--------------------+--------------------+
only showing top 10 rows
print("Number of transform :", transform.count())
Number of transform : 21804
Convert items id to anime name and save them
# Merge transform and map_file based on prediction column to get name of anime's
transform = (
transform.select("user_id", explode("prediction").alias("anime_id"))
.join(map_file, "anime_id")
.groupBy("user_id")
.agg(collect_list(struct("name")).alias("prediction"))
.join(transform.drop("prediction"), "user_id")
)
transform.toPandas().to_csv("../data/frequent-pattern/transform.csv")
transform.show(10)
+-------+--------------------+--------------------+
|user_id| prediction| anime_id|
+-------+--------------------+--------------------+
| 1|[{Shingeki no Kyo...|[5114, 9253, 1106...|
| 2|[{Fullmetal Alche...|[9253, 11061, 290...|
| 3|[{Shingeki no Kyo...|[9253, 32281, 290...|
| 12|[{Code Geass: Han...|[5114, 199, 1575,...|
| 13|[{Code Geass: Han...|[1575, 486, 30, 2...|
| 14|[{Fullmetal Alche...|[9253, 38524, 110...|
| 16|[{JoJo no Kimyou ...|[5114, 9253, 3228...|
| 17|[{One Punch Man},...|[5114, 9253, 2897...|
| 19|[{Death Note}, {B...|[5114, 9253, 3852...|
| 21|[{Clannad: After ...|[9253, 28851, 322...|
+-------+--------------------+--------------------+
only showing top 10 rows
Now we want to categorize users who are similar in interests anime into the same clusters.
We Guess number of clusters with silhouette score. Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. The silhouette value is a measure of how similar an object is to its own cluster compared to other clusters. Wikipedia
scores = []
inertia_list = np.empty(10)
K = range(2, 10)
for k in K:
k_means = KMeans(n_clusters=k)
k_means.fit(to_cluster)
inertia_list[k] = k_means.inertia_
scores.append(silhouette_score(to_cluster, k_means.labels_))
Elbow Method
plt.plot(range(0, 10), inertia_list, "-X")
plt.title("Elbow Method")
plt.xticks(np.arange(10))
plt.xlabel("Number of cluster")
plt.ylabel("Inertia")
# Draw vertical line in ax equal's 4
plt.axvline(x=4, color="blue", linestyle="--")
plt.savefig("../charts/Elbow_Method.png")
plt.show()
Results KMeans
plt.plot(K, scores)
plt.title("Results KMeans")
plt.xticks(np.arange(10))
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.axvline(x=4, color="blue", linestyle="--")
plt.savefig("../charts/Results_KMeans.png")
plt.show()
Now that we have the number of clusters, we can use the KMeans algorithm.
clusterer = KMeans(n_clusters=4, random_state=30).fit(to_cluster)
centers = clusterer.cluster_centers_
c_preds = clusterer.predict(to_cluster)
centers
array([[ 1.6859081 , -1.67711621, -0.39729844],
[-1.72324023, 0.12476707, 0.15343777],
[ 8.06274357, 0.13311179, 1.11656004],
[ 1.87270589, 2.52512675, -0.56430143]])
Data points in 3D PCA axis - clustered
fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot(projection="3d")
ax.scatter(
to_cluster[0],
to_cluster[2],
to_cluster[1],
c=c_preds, # clusters
cmap="viridis", # change color
alpha=0.7, # Capacity
s=24,
)
plt.title("Data points in 3D PCA axis - clustered", fontsize=18)
plt.savefig("../charts/Data_points_in_3D_PCA_axis_clustered.png")
plt.show()
Data points in 2D PCA axis - clustered
plt.scatter(to_cluster[1], to_cluster[0], c=c_preds, cmap="viridis", alpha=0.7, s=24)
for ci, c in enumerate(centers):
plt.plot(c[1], c[0], "X", markersize=8, color="red", alpha=1)
plt.title("Data points in 2D PCA axis - clustered", fontsize=18)
plt.xlabel("x_values")
plt.ylabel("y_values")
plt.savefig("../charts/Data_points_in_2D_PCA_axis_clustered.png")
plt.show()
Add cluster to user_anime DataFrame
user_anime['cluster'] = c_preds
user_anime.head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
"0" | "Bungaku Shoujo" Kyou no Oyatsu: Hatsukoi | "Bungaku Shoujo" Memoire | "Bungaku Shoujo" Movie | "Calpis" Hakkou Monogatari | "Eiji" | "Eiyuu" Kaitai | "Kiss Dekiru Gyoza" x Mameshiba Movie | "Parade" de Satie | "R100" x Mameshiba Original Manners | ... | s.CRY.ed Alteration II: Quan | the FLY BanD! | xxxHOLiC | xxxHOLiC Kei | xxxHOLiC Movie: Manatsu no Yoru no Yume | xxxHOLiC Rou | xxxHOLiC Shunmuki | ēlDLIVE | ◯ | cluster | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
10 rows × 11410 columns
user_anime.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21804 entries, 0 to 25000
Columns: 11410 entries, "0" to cluster
dtypes: int32(1), int64(11409)
memory usage: 1.9 GB
In this section, we will calculate information for each cluster and print statistical data.
c0 = user_anime[user_anime["cluster"] == 0].drop("cluster", axis=1).mean()
c1 = user_anime[user_anime["cluster"] == 1].drop("cluster", axis=1).mean()
c2 = user_anime[user_anime["cluster"] == 2].drop("cluster", axis=1).mean()
c3 = user_anime[user_anime["cluster"] == 3].drop("cluster", axis=1).mean()
Create anime information list
def createAnimeInfoList(animelist):
genre_list = list()
episode_list = list()
score_list = list()
member_list = list()
popularity_list = list()
favorites_list = list()
for x in anime["name"]:
if x in animelist:
for y in anime[anime["name"] == x].genres.values:
genre_list.append(y)
episode_list.append(anime[anime["name"] == x].episodes.values.astype(int))
score_list.append(anime[anime["name"] == x].score.values.astype(float))
member_list.append(anime[anime["name"] == x].members.values.astype(int))
popularity_list.append(anime[anime["name"] == x].popularity.values.astype(int))
favorites_list.append(anime[anime["name"] == x].favorites.values.astype(int))
# Return data in pandas series form to prevent
# "the Length of Values error does not match the length of the index"
return (
pd.Series(genre_list),
pd.Series(episode_list),
pd.Series(score_list),
pd.Series(member_list),
pd.Series(popularity_list),
pd.Series(favorites_list),
)
Count word
def count_word(df, ref_col, liste):
keyword_count = dict()
for s in liste:
keyword_count[s] = 0
for liste_keywords in df[ref_col].str.split(","):
if type(liste_keywords) == float and pd.isnull(liste_keywords):
continue
for s in [s for s in liste_keywords if s in liste]:
if pd.notnull(s):
keyword_count[s] += 1
# ______________________________________________________________________
# convert the dictionary in a list to sort the keywords by frequency
keyword_occurences = []
for k, v in keyword_count.items():
keyword_occurences.append([k, v])
keyword_occurences.sort(key=lambda x: x[1], reverse=True)
return keyword_occurences, keyword_count
Make cloud graph
def makeCloud(Dict, name, color, isSave=True):
words = dict()
for s in Dict:
words[s[0]] = s[1]
wordcloud = WordCloud(
width=1500,
height=500,
background_color=color,
max_words=20,
max_font_size=500,
normalize_plurals=False,
)
wordcloud.generate_from_frequencies(words)
fig = plt.figure(figsize=(20, 8))
plt.title(name, fontsize=18)
plt.imshow(wordcloud)
plt.axis("off")
if isSave:
plt.savefig(f"../charts/{name}.png")
plt.show()
animelist = list(c0.index)
data = pd.DataFrame()
data["genre"] = createAnimeInfoList(animelist)[0]
set_keywords = set()
for liste_keywords in data["genre"].str.split(",").values:
if isinstance(liste_keywords, float):
continue # only happen if liste_keywords = NaN
set_keywords = set_keywords.union(liste_keywords)
Top 15 anime which will explain characteristic of this cluster
c0.sort_values(ascending=False)[0:15]
Shingeki no Kyojin 0.751535
One Punch Man 0.732706
Kimi no Na wa. 0.707532
Death Note 0.640196
Koe no Katachi 0.634056
Boku no Hero Academia 2nd Season 0.623823
No Game No Life 0.622391
Re:Zero kara Hajimeru Isekai Seikatsu 0.615227
Fullmetal Alchemist: Brotherhood 0.614409
Boku no Hero Academia 0.609906
Shingeki no Kyojin Season 2 0.609087
Steins;Gate 0.591281
Shigatsu wa Kimi no Uso 0.544003
Kimetsu no Yaiba 0.543185
Boku dake ga Inai Machi 0.528244
dtype: float64
Favorite genre for this cluster
c0_animelist = list(c0.sort_values(ascending=False)[0:30].index)
c0_data = pd.DataFrame()
(
c0_data["genre"],
c0_data["episode"],
c0_data["score"],
c0_data["member"],
c0_data["popularity"],
c0_data["favorites"],
) = createAnimeInfoList(c0_animelist)
keyword_occurences, dum = count_word(c0_data, "genre", set_keywords)
makeCloud(keyword_occurences[0:10], "Cluster_0", "lemonchiffon")
keyword_occurences[0:5]
[['Action', 17],
[' Shounen', 16],
[' Comedy', 12],
[' Drama', 11],
[' Super Power', 11]]
Average of each information for anime which user in this cluster like
avg_episodes = int(c0_data["episode"].mean()[0].round())
avg_score = c0_data["score"].mean()[0].round(2)
avg_popularity = int(c0_data["popularity"].mean()[0].round())
avg_member = int(c0_data["member"].mean()[0].round())
avg_favorites = int(c0_data["favorites"].mean()[0].round())
print(f"Cluster 0\nAVG episode : {avg_episodes}\nAVG score : {avg_score}\nAVG popularity : {avg_popularity}\nAVG member : {avg_member}\nAVG favorites : {avg_favorites}\n")
Cluster 0
AVG episode : 23
AVG score : 8.55
AVG popularity : 25
AVG member : 1562339
AVG favorites : 59740
Top 15 anime which will explain characteristic of this cluster
c1.sort_values(ascending=False)[0:15]
Death Note 0.371123
Shingeki no Kyojin 0.265868
Fullmetal Alchemist: Brotherhood 0.251081
Sen to Chihiro no Kamikakushi 0.243270
Code Geass: Hangyaku no Lelouch 0.229013
Code Geass: Hangyaku no Lelouch R2 0.202017
Steins;Gate 0.196557
Toradora! 0.192386
One Punch Man 0.187912
Kimi no Na wa. 0.184652
Angel Beats! 0.184197
Howl no Ugoku Shiro 0.165314
Fullmetal Alchemist 0.160158
Sword Art Online 0.158641
Elfen Lied 0.151513
dtype: float64
Favorite genre for this cluster
c1_animelist = list(c1.sort_values(ascending=False)[0:30].index)
c1_data = pd.DataFrame()
(
c1_data["genre"],
c1_data["episode"],
c1_data["score"],
c1_data["member"],
c1_data["popularity"],
c1_data["favorites"],
) = createAnimeInfoList(c1_animelist)
keyword_occurences, dum = count_word(c1_data, "genre", set_keywords)
makeCloud(keyword_occurences[0:10], "Cluster_1", "lemonchiffon")
keyword_occurences[0:5]
[[' Drama', 17],
['Action', 16],
[' Supernatural', 12],
[' Comedy', 10],
[' Adventure', 8]]
Average of each information for anime which user in this cluster like
avg_episodes = int(c1_data["episode"].mean()[0].round())
avg_score = c1_data["score"].mean()[0].round(2)
avg_popularity = int(c1_data["popularity"].mean()[0].round())
avg_member = int(c1_data["member"].mean()[0].round())
avg_favorites = int(c1_data["favorites"].mean()[0].round())
print(f"Cluster 1\nAVG episode : {avg_episodes}\nAVG score : {avg_score}\nAVG popularity : {avg_popularity}\nAVG member : {avg_member}\nAVG favorites : {avg_favorites}\n")
Cluster 1
AVG episode : 27
AVG score : 8.44
AVG popularity : 35
AVG member : 1498743
AVG favorites : 62681
Top 15 anime which will explain characteristic of this cluster
c2.sort_values(ascending=False)[0:15]
No Game No Life 0.863184
Shingeki no Kyojin 0.846600
One Punch Man 0.844942
Steins;Gate 0.843284
Angel Beats! 0.825871
Toradora! 0.812604
Re:Zero kara Hajimeru Isekai Seikatsu 0.796849
Code Geass: Hangyaku no Lelouch 0.792703
Fullmetal Alchemist: Brotherhood 0.786070
Code Geass: Hangyaku no Lelouch R2 0.770315
Kimi no Na wa. 0.764511
Death Note 0.762852
Hataraku Maou-sama! 0.762023
Boku dake ga Inai Machi 0.758706
Shokugeki no Souma 0.753731
dtype: float64
Favorite genre for this cluster
c2_animelist = list(c2.sort_values(ascending=False)[0:30].index)
c2_data = pd.DataFrame()
(
c2_data["genre"],
c2_data["episode"],
c2_data["score"],
c2_data["member"],
c2_data["popularity"],
c2_data["favorites"],
) = createAnimeInfoList(c2_animelist)
keyword_occurences, dum = count_word(c2_data, "genre", set_keywords)
makeCloud(keyword_occurences[0:10], "Cluster_2", "lemonchiffon")
keyword_occurences[0:5]
[[' Supernatural', 14],
['Action', 13],
[' Comedy', 12],
[' Drama', 11],
[' School', 11]]
Average of each information for anime which user in this cluster like
avg_episodes = int(c2_data["episode"].mean()[0].round())
avg_score = c2_data["score"].mean()[0].round(2)
avg_popularity = int(c2_data["popularity"].mean()[0].round())
avg_member = int(c2_data["member"].mean()[0].round())
avg_favorites = int(c2_data["favorites"].mean()[0].round())
print(f"Cluster 2\nAVG episode : {avg_episodes}\nAVG score : {avg_score}\nAVG popularity : {avg_popularity}\nAVG member : {avg_member}\nAVG favorites : {avg_favorites}\n")
Cluster 2
AVG episode : 19
AVG score : 8.44
AVG popularity : 33
AVG member : 1481909
AVG favorites : 55440
Top 15 anime which will explain characteristic of this cluster
c3.sort_values(ascending=False)[0:15]
Code Geass: Hangyaku no Lelouch 0.692277
Death Note 0.668911
Sen to Chihiro no Kamikakushi 0.659406
Fullmetal Alchemist: Brotherhood 0.651089
Code Geass: Hangyaku no Lelouch R2 0.629703
Steins;Gate 0.627723
Tengen Toppa Gurren Lagann 0.584158
Toradora! 0.580198
Bakemonogatari 0.567129
Mononoke Hime 0.560396
Shingeki no Kyojin 0.544950
Suzumiya Haruhi no Yuuutsu 0.539802
Cowboy Bebop 0.539406
Mahou Shoujo Madoka★Magica 0.538218
Toki wo Kakeru Shoujo 0.535842
dtype: float64
Favorite genre for this cluster
c3_animelist = list(c3.sort_values(ascending=False)[0:30].index)
c3_data = pd.DataFrame()
(
c3_data["genre"],
c3_data["episode"],
c3_data["score"],
c3_data["member"],
c3_data["popularity"],
c3_data["favorites"],
) = createAnimeInfoList(c3_animelist)
keyword_occurences, dum = count_word(c3_data, "genre", set_keywords)
makeCloud(keyword_occurences[0:10], "Cluster_3", "lemonchiffon")
keyword_occurences[0:5]
[[' Drama', 16],
['Action', 16],
[' Sci-Fi', 10],
[' Supernatural', 10],
[' Comedy', 9]]
Average of each information for anime which user in this cluster like
avg_episodes = int(c3_data["episode"].mean()[0].round())
avg_score = c3_data["score"].mean()[0].round(2)
avg_popularity = int(c3_data["popularity"].mean()[0].round())
avg_member = int(c3_data["member"].mean()[0].round())
avg_favorites = int(c3_data["favorites"].mean()[0].round())
print(f"Cluster 3\nAVG episode : {avg_episodes}\nAVG score : {avg_score}\nAVG popularity : {avg_popularity}\nAVG member : {avg_member}\nAVG favorites : {avg_favorites}\n")
Cluster 3
AVG episode : 21
AVG score : 8.48
AVG popularity : 66
AVG member : 1214717
AVG favorites : 52597