This project aims to build and test a predictive model for car crashes using consumer complaints data from Honda. The analysis leverages SAS Enterprise Miner (SAS EM) and Python to create a decision tree model that predicts the likelihood of car crashes. The key features considered include car model, manufacturing year, presence of anti-brake system, cruise control, miles per hour, overall mileage, complaint topics, and sentiment.
This project was submitted as part of a class project I submitted for STAT 656 : Applied Analytics taught by Edward Jones at Texas A&M University - College Station in 2018.
Note: SAS Enterprise Miner has paid subscription which was provided through my school. If you are interested in replicating the same SAS results, please reach out and I can provide the ".sas7bdat" files used in this project
The dataset consists of 5330 complaints, with approximately 10.7% reporting a crash. Key attributes include car make, model year, anti-brake system presence, cruise control, miles per hour, and overall mileage. Complaints are categorized into seven topics based on description, and sentiment analysis is performed to determine the average sentiment score for each complaint.
-
Topic Analysis : Data preprocessing involved handling outliers and imputing missing values. Text mining identified seven major topics from customer complaints: Airbag, Brake and Acceleration, Headlight, Mileage, SRS Light, Tire, and Transmission.
-
Text Sentiment: Sentiment analysis was conducted using a predefined sentiment-words list, yielding an average sentiment score of -1.24, indicating predominantly negative feedback.
-
10-Fold Cross Validation: Various decision tree depths were tested to find the optimal configuration. For the data merged with the Text Topic node, a depth of 15 was optimal, while a depth of 14 was best for the Text Cluster result.
-
Final Model Using 70/30 Cross Validation: The best-performing model was the Non-HP decision tree using the Text Topic approach with the Entropy configuration for nominal targets.
-
Topic Analysis: NLTK was used for text analytics. The complaints were preprocessed to handle synonyms and tokenized. TF-IDF was used to obtain the term-frequency matrix, and TruncatedSVD defined seven topics.
-
Sentiment Analysis: Sentiment scores were calculated without removing stop words to retain sentiment expressions.
-
Decision Tree Model Analysis: Attributes were encoded, and 10-fold cross-validation was used to select the best tree depth. The decision tree with a depth of 8 had the best performance.
Python was used to scrape news articles related to Takata, identifying significant negative sentiment and public discourse due to safety issues with Takata airbags.
-
SAS EM Decision Tree: The Non-HP decision tree using the Entropy configuration and Text Topic approach had the best performance, with low misclassification rate and high specificity, precision, and accuracy, but lower recall.
-
Python Decision Tree: The decision tree with a depth of 8 showed optimal performance with favorable metrics.
-
The Airbag topic was found to be the most significant predictor of crashes.
-
Given the negative press surrounding Takata, it is recommended that Honda reassesses its relationship with Takata and enforces stricter quality assurance standards to protect its brand image.
-
This project demonstrates the effectiveness of using both SAS EM and Python for predictive modeling and text analytics in identifying key factors contributing to car crashes.
Supporting images and additional documentation are included in the APPENDICES - HONDA COMPLAINTS.doc file to provide further insights into the analysis methods and results.
Feel free to reach out if you have any questions/suggestions or ideas for a project. Thanks for reading!