HiddenMarkovModel

Project 2: Hidden Markov Model

Probabilistic states and transitions

Set up a new git repository in your GitHub account
Pick a text corpus dataset such as https://www.kaggle.com/kingburrito666/shakespeare-plays or from https://github.com/niderhoff/nlp-datasets
Choose a programming language (Python, C/C++, Java)
Formulate ideas on how machine learning can be used to learn word correlations and distributions within the dataset
Build a Hidden Markov Model to be able to programmatically
Generate new text from the text corpus
Perform text prediction given a sequence of words
Document your process and results
Commit your source code, documentation and other supporting files to the git repository in GitHub GRAPHICAL MODELS

Data:

link to download data: https://www.kaggle.com/therohk/examine-the-examiner?select=examiner-date-tokens.csv
The Examiner - Spam Clickbait Catalog
6 Years of Crowd Sourced Journalism
This dataset contains the headlines of 3.08 million Articles written by ~21000 authors over six years. While the Examiner was never praised for its quality, it consistently churned out 1000s of articles per day over several years. At their height in 2011, The Examiner was ranked highly in search results and had enormous shares on social media. At a certain point, it was the tenth largest site on mobile and was attracting twenty million unique visitors a month. As a platform driven towards advert revenue, most of their content was rushed, unsourced and factually sparse. It still manages to paint a colourful picture about the trending topics over a long period of time.

Step-1 Setup environment

Step-2 Importing NEWS headlines data to dataframe

Step-3 Pre-processing data

Building stop words library

which are a collection of english that which are too common and would affect our model as these words are mostly repeated in sentences, they have higher probabilities and lesser meaning while we generate new text, also there are risks of infinite looping during text generation such as in my case I faced this: King of King of King of King of

Processing each line of headlines data

We need to remove all the special characters
Convert to lower case
then we remove the stop words from headlines

Step-4 Calculating Word Frequencies

we can use this function to calculate immediately next word's frequencies or also the second next following word's frequencies

Step-5 Calculating Probabilities

we can use this function to calculate immediately next word's probabilities or also the second next following word's probabilities

Step-6 Building probability distributions

-Here by using above methods, we need next word probabilities and second next word probabilities

Step-7 Complete the sentence Method

It takes a line as input with atleast two words and by using the last word and penultimate word we predict the following word using next word probabilities and second next word probabilities respectively. This continues until we reach the no of words to be predicted limit, which is passed as a parameter

Results of Auto Sentence Completion

Input: 10 free Output: 10 free food fun wine events weekend week april
Input: top pop Output: top pop music awards 2011 2010 season 2 episode 2 recap spoilers

We could generate some good predictions although some parts of it doesn't make sense but still we have a good sentence formation

Step-8 Generate New Headlines method

If we specify how many headlines we need along with how many how many words, it randomly generates new headlines for us

Results of Generating Random Headlines

-Input: 4 lines and 7 words each -Output:

incomplete untruthful shares trade secrets new jersey city book
62nd annual national primetime day park show weekend 2
disagreements says report may 1 2010 part 2 3
wwii pilots wins first national chicago round 1 2

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HiddenMarkovModel

Project 2: Hidden Markov Model

Probabilistic states and transitions

Data:

Step-1 Setup environment

Step-2 Importing NEWS headlines data to dataframe

Step-3 Pre-processing data

Building stop words library

Processing each line of headlines data

Step-4 Calculating Word Frequencies

Step-5 Calculating Probabilities

Step-6 Building probability distributions

Step-7 Complete the sentence Method

Results of Auto Sentence Completion

We could generate some good predictions although some parts of it doesn't make sense but still we have a good sentence formation

Step-8 Generate New Headlines method

Results of Generating Random Headlines

Again, we have randomly generated headlines with good sentence structures with few errors in it.

About

Releases

Packages

Languages

License

SFLazarus/HiddenMarkovModel

Folders and files

Latest commit

History

Repository files navigation

HiddenMarkovModel

Project 2: Hidden Markov Model

Probabilistic states and transitions

Data:

Step-1 Setup environment

Step-2 Importing NEWS headlines data to dataframe

Step-3 Pre-processing data

Building stop words library

Processing each line of headlines data

Step-4 Calculating Word Frequencies

Step-5 Calculating Probabilities

Step-6 Building probability distributions

Step-7 Complete the sentence Method

Results of Auto Sentence Completion

We could generate some good predictions although some parts of it doesn't make sense but still we have a good sentence formation

Step-8 Generate New Headlines method

Results of Generating Random Headlines

Again, we have randomly generated headlines with good sentence structures with few errors in it.

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages