HiddenMarkovModel
Project 2: Hidden Markov Model
- Set up a new git repository in your GitHub account
- Pick a text corpus dataset such as https://www.kaggle.com/kingburrito666/shakespeare-plays or from https://github.com/niderhoff/nlp-datasets
- Choose a programming language (Python, C/C++, Java)
- Formulate ideas on how machine learning can be used to learn word correlations and distributions within the dataset
- Build a Hidden Markov Model to be able to programmatically
- Generate new text from the text corpus
- Perform text prediction given a sequence of words
- Document your process and results
- Commit your source code, documentation and other supporting files to the git repository in GitHub GRAPHICAL MODELS
- link to download data: https://www.kaggle.com/therohk/examine-the-examiner?select=examiner-date-tokens.csv
- The Examiner - Spam Clickbait Catalog
- 6 Years of Crowd Sourced Journalism
- This dataset contains the headlines of 3.08 million Articles written by ~21000 authors over six years. While the Examiner was never praised for its quality, it consistently churned out 1000s of articles per day over several years. At their height in 2011, The Examiner was ranked highly in search results and had enormous shares on social media. At a certain point, it was the tenth largest site on mobile and was attracting twenty million unique visitors a month. As a platform driven towards advert revenue, most of their content was rushed, unsourced and factually sparse. It still manages to paint a colourful picture about the trending topics over a long period of time.
- which are a collection of english that which are too common and would affect our model as these words are mostly repeated in sentences, they have higher probabilities and lesser meaning while we generate new text, also there are risks of infinite looping during text generation such as in my case I faced this:
King of King of King of King of
- We need to remove all the special characters
- Convert to lower case
- then we remove the stop words from headlines
- we can use this function to calculate immediately next word's frequencies or also the second next following word's frequencies
- we can use this function to calculate immediately next word's probabilities or also the second next following word's probabilities
-Here by using above methods, we need next word probabilities and second next word probabilities
- It takes a line as input with atleast two words and by using the last word and penultimate word we predict the following word using next word probabilities and second next word probabilities respectively. This continues until we reach the no of words to be predicted limit, which is passed as a parameter
- Input:
10 free
Output:10 free food fun wine events weekend week april
- Input:
top pop
Output:top pop music awards 2011 2010 season 2 episode 2 recap spoilers
We could generate some good predictions although some parts of it doesn't make sense but still we have a good sentence formation
- If we specify how many headlines we need along with how many how many words, it randomly generates new headlines for us
-Input: 4 lines and 7 words each -Output:
incomplete untruthful shares trade secrets new jersey city book
62nd annual national primetime day park show weekend 2
disagreements says report may 1 2010 part 2 3
wwii pilots wins first national chicago round 1 2