This repository forms my final project submission for the General Assembly Data Science Essentials course. I utilised NLP and KMeans to identify topics within a Twitter dataset.
The repository comprises of two jupyter notebooks and two output files:
- eda-rea-v-liv-2018.ipynb - This contains the project brief (outline of the project and aspriations) along with some exploratory data analysis which provides insights to the dataset I selected.
- nlp-rea-v-liv-2018.ipynb - This contains the final project report which includes NLP and KMeans. There are two .txt files that are outputs of some of the code written in this file, these were used to identify parameters that resulted in strong/clear clustering. This combination of parameters was then further explored and visualised as seen in the jupyter notebook.
- 2020-09-03 1913-TF-IDF.txt - Contains the results of clustering data with TF-IDF features using KMeans and various parameters.
- 2020-09-04 0651-COUNT-VEC.txt - Contains the results of clustering data with count vectorized features using KMeans and various parameters.