This is a step by step tutorial for text analyst who want an easy start to basic and and common techniques in NLP, Text Analysis, Machine Learning, Topic Modelling and Corpus Linguistics. The tutorial is pat of the "Visualise My Corpus" UCREL and DSG Seminar and Tutorial as well as the "Data Visualisation Workshop for Critical Computational Discourse" at the Data Science Institute at Lancaster University, UK.
You can find the Arabic-customised version of this tutorial here: https://github.com/drelhaj/NLP_ML_Visualization_Tutorial/tree/master/Arabic_Tutorial
Dr Mahmoud El-Haj https://www.lancaster.ac.uk/staff/elhaj
If you have attended the 'Visualise My Corpus' talk before here are the introductory slides: https://www.lancaster.ac.uk/staff/elhaj/docs/visualise_my%20_corpus.pdf
A step by step presentation of the tutorials: https://youtu.be/g6tUQxIVesA
The repository is made up of 6 tutorials as follow:
- 1- Visualaization using SpaCy: a basic introduction to using SpaCy and to visualise part of speech tagging and named entity recognition.
- 2- Topic Modelling: Using LDA and LDAvis to display an interactive topic model.
- 3- Word Clouds: an introduction to creating word clouds using basic word frequency and more towards focusing on other part of speech tags.
- 4- Machine Learning: a basic introduction to SVM and Naive Bayse, this a simple classifier and the results are shown in a confusion matrix.
- 5- Word Usage: show word usage in terms of frequency over a period of time
- 6- Word Embeddings: a gentle start to word embeddings using gensim and visualising the vectors using TSNE and PCA.
You need Jupyter to run the notebooks https://jupyter.org/. Check the 0_Visualisation_Setup.ipynb for the required python packages. (https://github.com/drelhaj/NLP_ML_Visualization_Tutorial/blob/master/0_Visualisation_Setup.ipynb)