Week 11
In Week 11, our primary focus was on finalizing the models, preparing for the presentation with the creation of Dashboards and slides, and initiating the calculation of Return on Investment (ROI).
-
Model Evaluation Metrics:
- We successfully completed the evaluation of all three models (Regression, K-means clustering, Random Forest) using appropriate metrics such as precision, recall, F1 for categorical models, and R2, MSE, RMSE for numerical models.
-
Model Selection:
- After a thorough comparison of model performances, we have identified the model that demonstrated the best results across our chosen metrics. The details of this process can be found in Week 10 update.
-
Dashboards:
- The process of building Dashboards has been initiated to visually represent key findings and insights from the models. These will provide a clear and concise overview for stakeholders.
-
Slides:
- Initial slides for the presentation have been drafted. These will serve as a foundation for conveying our approach, methodology, and the impact of our models.
- Initiating ROI Analysis:
- We have started the ROI calculation process, considering the potential impact of our models on business outcomes. The goal is to quantify the return on investment, taking into account the expected benefits and costs associated with model implementation.
-
Complete ROI Analysis:
- Finalize the ROI calculation, incorporating all relevant factors such as increased revenue, cost of implementation, and potential cost savings.
-
Refine Dashboards and Slides:
- Continue refining and enhancing the dashboards and presentation slides for a polished and informative delivery.
-
Prepare for Stakeholder Meeting:
- Anticipate any questions or feedback from stakeholders and be prepared to address them during the upcoming meeting.
Our progress is on track, and we are excited to present the comprehensive results of our analysis in the upcoming weeks.
Week 10
At week 10, we start some preprocessing, model building, and model selection.- Modeling Building:
- preprocessing: SMOTE and one-hot encoding
- Build the Model: we successfully built 3 models: 1.regression 2.K-means clustering 3.random forest
- evaluate the model performance by either precision/recall/F1 (categorial) OR R2/MSE/RMSE (numerical)
- compare the models and choose the one with best performances
- details here week 10 update
- Finalize all the models with evaluation metrics
- Start to build the Dashboards & make slides
- Calculate ROI
Week 9
At week 9, we start feature engineering and model selection.- Feature Engineering:
-
deal with outliers
-
we decide to select the following features, and they could be divided into 2 categories:
-
categorical (make sure they are classified into tens of categories): style/ standardized_color/ standardized_size/ vendor/ brand
-
numerical(standardize them for further model building) : retail price/ packsize
-
-
details here week 9 update
-
- Deal with the imbalanced dataset using SMOTE - Synthetic Minority Over-sampling Technique
- Build the Model: we plan to build 3 models: 1.regression 2.random forest 3.K-means clustering
- compare the model metrics and derive insights from the models about products with which features would be purchased/ or returned.
Week 8
At week 8, we having finished data cleaning and at the middle of EDA (stage 3).- EDA:
-
We looked into three categorical features (color, style, size) from skuinfo in the joined table. They all have many categories and some categories are hard to interprete based on their name alone. Therefore, we generalized those categories either by marking them as the "other" group or drop them directly.
-
details here week 8 update
-
- Continue with stage 3: Continue to work on categorical and numerical features.
- We have outlined our strategy for a supervised learning problem aimed at predicting the likelihood of a product being purchased, returned, or not purchased. In this approach, we designate X by grouping the data using SKUID and aggregating the profit for each product. The profit calculation involves subtracting the cost from the retail price and then multiplying by the quantity, representing the potential profit each product could generate for Dillard. Additionally, we intend to define Y as a categorical variable indicating whether a product falls into the categories of being purchased, returned, or not purchased.
- Work more on feature engineering and the prediction model.
Week 7
At week 7, we are at end of stage 2 - Data Cleaning and the start of stage 3 - EDA.-
Data Cleaning:
- Select a subset of data, inner join 3 tables trnsact+skuinfo+sksinfo, details here JOINED_TABLE
- Check Data Duplicates& Outliers& Null values: drop N/A values, check outliers and duplicates.
- Browse data and do some simple EDA.
-
EDA:
- details here week 7 update
- Continue with stage 2: Data Cleaning. Continue to work on outliers.
- We planned to do a supervised learning problem which tries to predict whether a product would be purchased or returned (or possibly not being purchased). X is set as groupby SKUID and sum the profit of each product, which is (retail price-cost)*quantity and it means what profit each product could bring to Dillard. And we planned to set Y as a categorical variable that whether a product would be purchased or returned (or possibly not being purchased).
- Work towards feature engineering and the prediction model
Week 6
At week 6, we are at stage 2 - Data Cleaning and choose the general direction to work towards-
We choose our general research direction: select a subset of Black Friday sales data (by selecting saledate='2004-11-24') and we will investigate the best-seller and worst-seller products and their features by EDA. Then, we want to work towards a classification model to classify what features made products sold well, especially for Black Friday. So Dillard could recommend products with these features to customers in the following Black Fridays to increase sales. Also, we will complicate our model selections in the following weeks, including the add-on of some other models and model validation. But building a classification model would be our first step.
-
Data Cleaning:
- Export a Subset: select a subset of Black Friday sales data and export the file for us to do some EDA (select * from group_2.trnsact where saledate='2004-11-24';). Please visit: ESD_Dataset_Black_Friday_subset. Meanwhile, we update this new table in our database as well.
- Deal with null values and some basic EDA about the TRNSACT table: details here week 6 update
- Continue with stage 2: Data Cleaning. Focus more removing null values, standardize data type, EDA, and data visualization.
- Inner join TRNSACT with other tables with another table called skstinfo to find the retail price to see whether we should drop or keep these transactions which has orgprice=0
- Work towards feature selection and build the classification model
Week 5
At week 5, we are at stage 2 - Data Cleaning-
We changed the datatype of each column of 5 tables to make sure the datatype of each column is correct.
-
For the table of skuinfo, in the column of packsize, there are some pack sizes that do not make sense, including "G", "N/A", "Bizarre", "Promo test," and so on. In order to maintain the sku on record, we use the mode, which is 1, to replace these strange values. For detailed script, please visit: here.
-
This week, we spent a lot of time exploring the project research topic we want to explore:
-
According to the products' different features (e.g. color, style, vendor, brand) to do a classification model so that we could use these classifications on products to predict the preferences of different customers.
-
Another possible direction is to help Dillards increase sales by investing in products combined with the features that are most welcomed and which ones are least welcomed. We could apply different discounts accordingly to increase sales revenue. This could be conducted through machine learning models like a random forest model.
-
-
Continue with stage 2: Data Cleaning. Determine the project research direction and clean the datasets accordingly (e.g. drop unnecessary columns, select a subset of data to work with).
-
After filtering the data we need, check the data thoroughly to see whether there are some bizarre and null values that don't make any sense, drop those rows, or replace them with the mean/mode depending on the variable type/distribution of the data.
-
Choose the ML/clustering model we are going to work with.
-
Start with EDA and analyze and investigate data sets and summarize their main characteristics, employed with data visualization.
Week 4
-
This week, we performed basic data cleaning and imported the dataset into the PostgreSQL server.
-
we also made some summary statistics about the dataset. See more details here.
-
We are going to continue working on data cleaning and understanding of the data, including basic EDA process.
-
After having a decent understanding of the dataset, then we can proceed to brainstorm interesting questions related to machine learning so that we can work on them further for the rest of the weeks.
This project is to investigate a machine learning related question with the Dillard's point-of-sale dataset.
Suggested process to undertake:
- Understand the data
- Perform data exploration (number of SKUs, number of items per basket, number of stores, most frequently purchased items, busiest,stores, etc)
- Find a machine learning related question to address
- Feature selection and engineering
- Modeling
- Dashboards and story telling
- ROI – make appropriate assumptions (support numbers used by using the web)
Every Friday by 5 pm each team has to post a few-paragraph weekly update. (The paragraph should be about the work from the previous week and not a summary from the beginning. It must also include the tasks for next week with clear goals.) If you miss it, 2% of the total grade is subtracted for each team member. You must posted it on github as part of a single document that has a date clearly visible for each update. The readme file on GitHub must have the project description with the tasks (what is already on canvas pertaining to your project). The first update must be done on Oct 13 at 5 pm.
Project deliverables posted on GitHub:
- Deck of slides for a 15 minute presentation (Audience: chief data scientists, i.e., someone that is a manager but can also be technically dangerous); Assume that the audience does not know anything about the project.
- Final report not to exceed 5 pages (it can have an appendix). Code should not be part of the report. Audience: Chief data scientists
- An ROI analysis must be included in the report.
- Scripts developed
- Technical merit 30%
- Data visualizations 25%
- Final presentation 20%
- Final report (structure, content) 25%
Everything is due on Friday, December 5 at 5 pm. (At that time the repository will be cloned.)