POSTagging

Introduce

This is a simple POS Tagging sample.

Implemented net word architecture:

A Bidirectional LSTM layer and a Dense/Fully-Connected layer on top
Using a GRU instead of the LSTM
Adding an additional LSTM layer
Adding an additional dense layer

Data

With data set given: dependency_treebank

Preprocess Data

In file PreprocessData.py, files were read by sentence and word. After reading, pickled all the data and store them into SSD as pickled_data.pkl. In this step, we store all of the data, and do some simple classification by ourself, which including:

Get total word set
Get total token set
Get train data X
Get train token Y
Get validation data X
Get validation token Y
Get test data X
Get test token Y
Replace "&" to ",", since we find in training token set there's a "&" but in validation and test token set there's not.

Preprocess GloVe file

In file PickleGloVe.py, we pickled GloVe dictionary and store it into SSD. In this way, we can read and get GloVe data faster. GloVe chosen: glove-wiki-gigaword-300.txt

Paddomh Data

In file PaddingSample.py, all data needed was paded and stored into SSD.

Automatically padding length is 100. We first tokenized dictionaries called Total_words abd Total_tokens, and constructed 4 maps, there're:

word2int: e.g word2int["joy"] = 1
int2word: e.g int2word[1] = "joy"
token2int: e.g token2int["NN"] = 2
int2token: e.g int2token[2] = "NN"

After we have these 4 maps, we can easily tokenize all of the data we need. Data we need are:

encoded_train_X: smaples for training
encoded_train_Y: tags for training
encoded_validation_X: smaples for validation
encoded_validation_Y: tags for validation
encoded_test_X: smaples for test
encoded_test_Y: tags for test

With all of these data, we store them into SSD as padded_samples:

padded_samples = {"train_X": train_X, "train_Y": train_Y, "validation_X": validation_X, "validation_Y": validation_Y, "test_X": test_X, "test_Y": test_Y, "MAX_SEQ_LENGTH": 100, "int2token": int2token, "int2word": int2word}

Embedding

We embedding with GloVe in 300 dimension.

Read all the dictionary we need from SSD, there are data we stored during preprocesing data:

train_words
validation_words
test_words

Following intro:

We have V1 = GloVe dictionary
Find words exists in train_words but not in V1, which we call it OOV1.
Have a new dctionary V2 = OOV1 + part of V1(those words exsits both in train_words and V1)
Find words exists in validation_words but not in V2, which we call it OOV2.
Have a new dctionary V3 = OOV2 + part of V2(those words exsits both in train_words and V2)
Find words exists in validation_words but not in V3, which we call it OOV3.
Have a new dctionary V4 = OOV3 + part of V3(those words exsits both in train_words and V3)
After get the dictionary, we turn them into matrixs to get embedded weights

Steps in embedding look a little bit strange. Howecer, we given OOV words vectors with uniform random array by code np.random.uniform(size=(1, EMBEDDING_DIMENSION))

Net work architecture

In file RnnModel.py, we can built several network:

Bidirectional_LSTM_Model
Gru_Model
LSTM_Model
Bidirectional_LSTM_Model with 2 dense layer

New a instance of POSTaggingModel, and input general parameters.

POSTaggingModel: N_TOKENS, VOCABULARY_SIZE, EMBEDDING_SIZE, MAX_SEQ_LENGTH, train_X, train_Y, validation_X, validation_Y, embedding_weights, batch_size, epoch, name

Use POSTaggingModel.buildModel(type) to build a model. type is a private list fixed as: ["GRU Model", "Bidirectional LSTM Model", "LSTM Model", "Bidirectional LSTM Model with 2 Dense", "2 LSTM Model"] Only type in this list can build model.

In POSTaggingModel.buildModel(type), it will compile model and print the summary for model.

After this, use model.fit() to train model.

After train, use model.evaluateModel() to test model and save report.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

POSTagging

Introduce

Data

Preprocess Data

Preprocess GloVe file

Paddomh Data

Embedding

Net work architecture

Files

README.md

Latest commit

History

README.md

File metadata and controls

POSTagging

Introduce

Data

Preprocess Data

Preprocess GloVe file

Paddomh Data

Embedding

Net work architecture