This is more of a cheat sheet, than a serious project with high goals. Data is Reddit posts for one year with #wot hashtag posts_Reddit_wot_en.csv. Text vectorization was done with four main methods: BoW, TF-IDF, PV-DM, PV-DBOW,
Clusterization method is always K-means++, just because i believe modification of it makes little impact compared to change of vectorization technique. Visualization is performed via: MDS, PCA
git clone git@github.com:bluella/Text-clusterization-overview.git
cd Text-clusterization-overview
virtualenv -p /usr/bin/python3.7 tco_env
source ./tco_env/bin/activate
pip install -r requirements.txt
You are good to go!
TF-IDF has shown best results among other vectorization methods. BoW is a bit less accurate. PV-DM and PV-DBOW deliveres really weird results. Pephaps because of small dataset size, which is not appropriate to proper model learning. PCA visualization seems to comply more with real outcome than MDS.
-
Use pretrained model for PV-DM with help of fasttext or else
This project is licensed under the MIT License - see the LICENSE.md file for details
Heavy loads of code were taken from the following resources: