Creating an approch to kickstart EDA on a dataset with many Text, Numerical, Categorical, and Datetime features like TED Talks and with limited Domain Knowledge. The idea, approach and code are very generic, and so would apply to almost any dataset.
Contents:
- Text Preprocessing
- 200+ Feature Creation - mostly on Text columns with basic NLP like character/token count, POS and NER tags, and Sentiment
- Understanding relation among columns by
- Visualizing Correlation as Interactive Graphs (currently, unweighted)
- Feature Clustering based on Correlation
- n-grams and Keyphrase extraction
- A Talks Recommendation Engine
- Topic Modelling and Text Clustering
Please find other input/intermediate files here