- PatternMining
- Table of Contents
- Overview
- Setup
- Project Structure
- Frequent Patterns & Assosication Rules
PatternMining is a Python and Jupyter-based project focused on analyzing gaming data (CSV format) to uncover frequent patterns, perform clustering, and conduct classification. It utilizes advanced data mining techniques to identify association rules, apply various clustering methods, and categorize data using different classification techniques. Currently, the project is in development, with continuous enhancements and feature additions.
Ensure you have Python installed on your system. This project requires the following Python packages:
- pandas
- mlxtend
- matplotlib
- scikit-learn
- seaborn
- numpy
- warnings
- sklearn
- Clone the repository or download the project files.
- Create and activate a virtual environment (optional but recommended).
- Install the required packages:
pip install -r requirements.txt
PatternMining/
│
├── .git/
│
├── .ipynb_checkpoints/
│
├── best_rules/
│
├── datasets/
│
├── .gitignore
│
├── git.ipynb
│
├── main.ipynb
│
├── README.md
│
├── requirements.txt
Description
- .git/: Contains the git version control system files.
- .ipynb_checkpoints/: Stores the checkpoints of the Jupyter notebooks, which help in recovering unsaved work.
- best_rules/: Directory for storing the best association rules identified during the analysis.
- datasets/: Directory for storing dataset files used in the project.
- .gitignore: Specifies files and directories to be ignored by git.
- git.ipynb: Jupyter notebook for git-related operations.
- main.ipynb: Main Jupyter notebook for the project where data analysis and pattern mining tasks are performed.
- README.md: Provides an overview of the project, setup instructions, and other relevant information.
- requirements.txt: Lists the Python dependencies needed for the project.
This section explains the steps involved in finding frequent patterns and association rules from gaming data. The process involves loading the dataset, preprocessing the data, and applying pattern mining techniques to extract meaningful insights.
-
Loading Dataset
- The dataset is loaded using the
pandas
library, which reads the CSV file into a DataFrame for further processing.
data = pd.read_csv(f'{project_path}/datasets/data_processed.csv')
- The dataset is loaded using the
-
Dropping Unnecessary Columns
- To focus on relevant data, unnecessary columns are dropped from the DataFrame. This step is crucial to reduce noise and improve the accuracy of the pattern mining process.
data.drop(columns=['img', 'title', 'Processed_title', 'Stemmed_title', 'Lemmatized_title', 'Processed_genres', 'Stemmed_genres', 'Lemmatized_genres', 'score'], inplace=True)
-
Making Numerical Columns Categorical
- Numerical columns that should be treated as categorical variables are converted. This step ensures that the pattern mining algorithms correctly interpret these columns.
data[column] = pd.cut(data[column], bins=3, labels=['low', 'medium', 'high'])
-
Encoding Values
- Categorical values are encoded into a suitable format for analysis. This involves converting categorical data into numerical data using techniques like one-hot encoding.
data_one_hot = pd.get_dummies(data)
-
Finding Frequent Items
- The Apriori algorithm is applied to find frequent itemsets in the dataset. This algorithm identifies sets of items that appear together frequently in the data.
frequent_itemsets = apriori(data_one_hot, min_support=min_support_threshold, use_colnames=True)
-
Association Rules
- Association rules are generated from the frequent itemsets. These rules help identify interesting relationships between items in the dataset.
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence_threshold)
This section presents a comparison of results obtained from different methods used for finding frequent patterns and association rules. The comparison includes tables and visualizations for better understanding.
Method | Support | Confidence | avg Lift | avg leverage | avg conviction | avg zhangs_metric | rules count |
---|---|---|---|---|---|---|---|
Method 1 | 0.005 | 0.50 | 1.91 | 0.003 | 6.40 | 0.280 | 2321850 |
Method 2 | 0.005 | 0.70 | 1.62 | 0.003 | 9.09 | 0.267 | 1695209 |
Method 3 | 0.01 | 0.60 | 1.52 | 0.007 | 10.22 | 0.265 | 813235 |
Method 4 | 0.01 | 0.80 | 1.47 | 0.007 | 15.73 | 0.260 | 567158 |
Method 5 | 0.02 | 0.70 | 1.47 | 0.014 | 20.58 | 0.268 | 231592 |
Method 6 | 0.02 | 0.90 | 1.44 | 0.014 | 29.87 | 0.266 | 174757 |
Method 7 | 0.05 | 0.80 | 1.43 | 0.030 | 36.12 | 0.290 | 63002 |
Method 8 | 0.05 | 0.90 | 1.42 | 0.030 | 42.59 | 0.286 | 55019 |
Method 9 | 0.08 | 0.80 | 1.41 | 0.038 | 40.30 | 0.303 | 41045 |
Method 10 | 0.10 | 0.90 | 1.36 | 0.048 | 55.62 | 0.307 | 21811 |
Method 11 | 0.20 | 0.90 | 1.34 | 0.072 | 36.71 | 0.360 | 7556 |
Method 12 | 0.30 | 0.90 | 1.24 | 0.091 | 31.33 | 0.423 | 2931 |
Method 13 | 0.50 | 0.90 | 1.23 | 0.115 | 30.64 | 0.565 | 1259 |
Method 14 | 0.70 | 0.90 | 1.08 | 0.045 | 4.55 | 0.314 | 88 |
Method 15 | 0.80 | 0.90 | 1.00 | 0.000 | 0.94 | 0.412 | 16 |
Method 16 | 0.90 | 0.90 | 1.00 | 0.000 | 1.00 | 0.583 | 11 |
This section lists the best association rules identified during the analysis. These rules highlight the most significant relationships found in the gaming data based on support, confidence, lift, leverage, conviction, and Zhang's metric.
Top Association Rules Table
Antecedents | Consequents | Support | Confidence | Lift | Leverage | Conviction | Zhang's Metric |
---|---|---|---|---|---|---|---|
pal_sales_medium, na_sales_medium | total_sales_medium | 0.68 | 0.99 | 0.01 | 0.95 | 0.50 | 0.90 |
pal_sales_medium, user ratings count_low, na_sales_medium | total_sales_medium | 0.68 | 0.99 | 0.01 | 0.95 | 0.50 | 0.90 |
pal_sales_medium, na_sales_medium | user ratings count_low, total score_low | 0.68 | 0.99 | 0.89 | 0.93 | 0.33 | 0.87 |
total score_low | metascore_count_low | 0.93 | 0.93 | 0.00 | 0.00 | 0.00 | 1.0 |
user ratings count_low, na_sales_medium | total_sales_medium, total score_low | 0.71 | 0.97 | 1.0 | 1.0 | 0.92 | 0.93 |
total_sales_medium | na_sales_medium, total score_low | 0.71 | 0.97 | 0.99 | 0.99 | 0.99 | 0.93 |
The table above includes the antecedents and consequents of the top association rules along with their respective support, confidence, lift, leverage, conviction, and Zhang's metric. These metrics are defined as follows:
- Support: The proportion of transactions in the dataset that contain the antecedent.
- Confidence: The likelihood that the consequent is present when the antecedent is present.
- Lift: The ratio of the observed support to that expected if the antecedent and consequent were independent.
- Leverage: The difference between the observed frequency of a rule and the expected frequency if the antecedent and consequent were independent.
- Conviction: The measure of the strength of an association rule, indicating how often the rule makes an incorrect prediction.
- Zhang's Metric: A measure that considers both support and confidence, giving a more balanced view of the rule's interestingness.
Explanation of Top Rules
-
Rule 1:
{pal_sales_medium, na_sales_medium} -> {total_sales_medium}
- Support: 0.68
- Confidence: 0.99
- Lift: 0.01
- Leverage: 0.95
- Conviction: 0.50
- Zhang's Metric: 0.90
- Explanation: This rule indicates that when
pal_sales_medium
andna_sales_medium
are present,total_sales_medium
is almost always present with a confidence of 99%. The high leverage value of 0.95 suggests a strong association between these variables.
-
Rule 2:
{pal_sales_medium, user ratings count_low, na_sales_medium} -> {total_sales_medium}
- Support: 0.68
- Confidence: 0.99
- Lift: 0.01
- Leverage: 0.95
- Conviction: 0.50
- Zhang's Metric: 0.90
- Explanation: This rule shows that the presence of
pal_sales_medium
,user ratings count_low
, andna_sales_medium
strongly indicates the presence oftotal_sales_medium
.
-
Rule 3:
{pal_sales_medium, na_sales_medium} -> {user ratings count_low, total score_low}
- Support: 0.68
- Confidence: 0.99
- Lift: 0.89
- Leverage: 0.93
- Conviction: 0.33
- Zhang's Metric: 0.87
- Explanation: When
pal_sales_medium
andna_sales_medium
are present,user ratings count_low
andtotal score_low
are also likely to be present with a confidence of 99%.
-
Rule 4:
{total score_low} -> {metascore_count_low}
- Support: 0.93
- Confidence: 0.93
- Lift: 0.00
- Leverage: 0.00
- Conviction: 0.00
- Zhang's Metric: 1.0
- Explanation: This rule shows that
total score_low
is almost always associated withmetascore_count_low
, with a high support and confidence of 93%.
-
Rule 5:
{user ratings count_low, na_sales_medium} -> {total_sales_medium, total score_low}
- Support: 0.71
- Confidence: 0.97
- Lift: 1.0
- Leverage: 1.0
- Conviction: 0.92
- Zhang's Metric: 0.93
- Explanation: This rule indicates a strong association between
user ratings count_low
,na_sales_medium
, andtotal_sales_medium
,total score_low
.
-
Rule 6:
{total_sales_medium} -> {na_sales_medium, total score_low}
- Support: 0.71
- Confidence: 0.97
- Lift: 0.99
- Leverage: 0.99
- Conviction: 0.99
- Zhang's Metric: 0.93
- Explanation: When
total_sales_medium
is present,na_sales_medium
andtotal score_low
are also very likely to be present, with a high confidence of 97%.