By Claire Donovan and Barry Plunkett
Our constrained task was to produce a named entity recognition model, which could achieve
The model identified four kinds of named entities separately:
- PER: Person
- LOC: Location
- ORG: Organization
- MISC: Miscellaneous For each type of entity, we also attempted to distinguish between the beginning of an entity (e.g. B-PER) and the "inside", meaning any token included in the named entity except the first, (e.g. I-PER). The model assigned an "O" tag to non-entity tokens.
Below, we highlight some features we implemented, which we believed could store latent predictive signal about. We also attempt to describe our intuition for why each feature could be important:
- Word: Simply the unprocessed word given by each token. The intuition for this feature's importance is straightforward: Some words are probably more likely than others to be named entities. For example, Pablo is probably more likely to be a named entity than truck. We included this feature for offset words in a symmetric two-token window around the token in question. For this feature and for others, we believed including information about surrounding tokens could help classify ambiguous words that don't necessarily refer to named entities. For example, "Los", meaning "the" in spanish is probably not part of a named entitity unless the next word is "Angelos" or something similar. A similar argument can be made for looking at past words for tokens like "de", meaning of, since it's probably not part of a named entity unless the previous word was a name.
- Lemma: The dictionary form of a token. In other words, the token with inflectional endings and prefixes removed. Lemmas also remove the conjugations of verbs. For example, the lemma of "went" is "go." We determined lemmas for each word by referencing a list of 400,000 Spanish word-lemma tokens available here We included lemmas in a two word symmetric window around the token in question for the same reason we included words: Some lemmas are probably more likely to be named entities than others. We didn't think lemmas would be redundant with words because we thought they might help the model generalize patterns across different tokens with the same lemmas. Additionally, we were concerned that having two sources of variance (syntax and semantics) might undercut the value of the word feature. We hoped including lemmas, which are primarily influenced by semantics, would help the model distinguish these signals. Based on the same intuition that generalizing across morphologies could help identify patterns among named entities, we also extracted prefixes and suffixes from each target word as features.
- Part of Speech (POS): Part of Speech for each word in a 2-word symmetric window around the target token. We simply extracted POS from the NLTK corpus. We included POS because we believed words with certain parts of speech would be more likely than others to be named entities. For example, proper nouns are probably included in named entities more often than prepositions or conjunctions. We included this feature for a symmetric window around the current token because like our word and lemma heuristics, the information encoded by part of speech is likely context dependent. For instance, if a preposition is surrounded by two proper nouns, it's probably more likely to be in a named entity phrase than if the same preposition surrounded by a verb and a noun.
- First Letter capitalized: This feature is pretty self-explanatory. We encoded whether the first letter of a word is capitalized as a simple 0-1 indicator feature for each word. We thought this would be useful because named entities tend to have the first letter of their words capitalized--names, places, and famous events all usually have capitalized words, for example. Of course, we thought the value of these feature might be undermined by words that are capitalized due to grammatical convention, even though they're not part of a named entity. In particular, we were concerned about words occuring at the start of sentences, so we also considered:
- An indicator feature for whether a word was at the start of a sentence
- A numeric feature for the index of a token in a sentence
- Entire word Capitalized: Again, self-explanatory. We encoded whether the entire token was capitalized with an indicator feature. We thought this would be valuable because acronyms, which tend to assigned to government entities, firms, stock-tickers, and other official organizations are usually fully capitalized and refer to named entities. It's rare for words to be fully capitalized and not be named entities (unless they're one letter words at the start of a sentence).
Below, we highlight several of the algorithms we used to for our classifier and describe why we chose them:
- Unnregularized Perceptron: A simple baseline to compare more complex non-linear models to. Unregularized perceptron models efficiently find boundaries that separate linear data, so thought this would function as a valuable baseline.
- Multi-Layer Perceptron: A more complex model capable of learning non-linear functions. We wanted to include an MLP in our tested models because with such complex, high dimensional data, we expected to observe some non-linearities in the mappings from tokens to labels. Depending on depth, MLPs can have far more weights than an ordinary perceptron, so they require more training data to produce an accurate model. Their capacity to learn non-linearities also makes them vulnerable to overfitting, but inclusion of a regularization parameter selected by cross-validation mitigates this risk. (Spoiler: We never managed to successfully implement an MLP due to sparsity)
- Logistic Regression Classifier: Logistic regression assumes class probabilities of instances can be represented with a squashed general linear function, and selects the linear model that minimizes log loss over these functions. Since logistic regression is learning a linear model, we hypothesized that its results would be similar to the unregularized perceptron. There is one important distinction: We included
$L_2$ regularization in the logistic regression objective function to reduce variance. We tested several values of inverse regularization strength (.01, .1, 1, 10, 100) with cross-validation. - Adaboost: The adaboost algorithm learns a strong classifier from numerous weak classifiers. We thought this could be useful because in our mental model of the problem, each of features could function as a weak decision rule, which when combined with others, could be quite powerful. In confirmation of this hypothesis, we found that one of the papers from the shared task used a multi-class version of adaboost. For our weak classifiers, we simply used sets of decision stumps. We tried training 100 and 300 and 600 decision stumps.
First, we used the baseline perceptron and logistic regression classifiers to compare combinations and select a few optimal combinations for the remaining classifiers. Here's a key for the following table:
- C represents the inverse strength of the regularizer All: word, lemma, prefix, suffix, first-cap, all-cap, sentence index, first-word-of-sentence, part of speech, symmetric offset width of 2
- No_Sent: All without sentence index feature
- No_Affixes: All without affix features
- No_cap: All without features without capitalization related features
- No_lemma: All features without lemmas
- All_narrow: extracts offset features only for 1 token window around word
- All_wide: extracts offset features by 3 tokens window around word
Feature Set | LogReg C = .01 | LogReg C = .1 | LogReg C = 1 | LogReg C = 10 | LogReg C = 100 | Perceptron |
---|---|---|---|---|---|---|
All | 48.17 | 57.86 | 66.30 | 67.25 | 66.02 | 43.40 |
No_Sent | 48.31 | 58.00 | 66.05 | 67.33 | 65.84 | 52.67 |
No_lemma | 46.45 | 54.58 | 63.85 | 67.43 | 66.26 | 55.23 |
No_Affixes | 47.51 | 57.28 | 65.61 | 66.72 | 65.33 | 56.99 |
All_narrow | 42.71 | 53.98 | 62.06 | 64.85 | 64.03 | 53.03 |
All_wide | 49.02 | 59.21 | 66.24 | 68.01 | 67.29 | 57.12 |
In general, some regularization seems to improve model performance. This is intuitive given the size of the data set in comparison to the sample space. With only 8000 sentences, we can't expect our sample to generalize well to other similarly small samples, when the syntactic and semantic variance of sentences is enormous. Removing the sentence index feature also seems to improve the performance of the unregularized model. This suggests most of the information encoded by sentence index is fairly noisy. As a result, we remove sentence index for the remaining feature sets. Removing affixes doesn't seem to have any affect. Removing lemmas and affixes don't seem to have any major affect on the model's performance either. Notably, the unregularized model improves significantly, which is consistent with the model overfitting noise in the affix and lemma dimensions. Finally, shrinking the context window for each word appeared to harm of all models and growing the context window appeared to improve all models. |
After completing these experiments, we decided to include all of the features except token index in sentence because every other feature marginally improved or at least did not harm our models. Removing the sentence index on the other hand slightly improved our best model. Of course, the same could be said for removing lemma, but we chose to include this feature in our later models anyway because 1) the change was almost negligible and 2) our hypothesis for the value of lemmas seemed strong. Additionally, since growing the context window to three words bettered all of our models by at least .5, we decided to include a wider context window in our final model.
We trained three adaboost classifiers on these features. For each classifier, we used decision stumps (decision trees with depth = 1), since these are the conventional choice for Adaboost's weak learners. Unfortunately and to our surprise, none of these models even beat the baseline. Initially, we believed we weren't allowing the
Number of Decision Stumps | F-Score |
---|---|
100 | 19.24 |
300 | 22.14 |
600 | 19.10 |
After Adaboost with 100 decision stumps failed in spectacular fashion, we hypothesized that we simply needed more decision stumps to improve the final classifier composed of linear combinations of the stumps. The slight improvement with 300 stumps strenghthened this hypothesis, but when 600 stumps resulted in worse performance, we were stumped ourselves. In reflection, we believe it's possible the weighted final classifier was significantly overfitting the training data, or that decision stumps couldn't produce adequately strong classifiers with such sparse input data. |
Training a neural net with 1 hidden layer of logistic sigmoid hidden units turned out to be extremely computationally intensive. Since the input data had over 22000 highly sparse features, we could afford to train a neural net with very few hidden units (n = 5). Even with this extremely small number of hidden units, the model took more than 4 hours to train on Biglab. And unsurprisingly, with 5 hidden units it was a less than stellar classifier (F = 14.12) To reduce the computational cost, we attempted two dimension reduction techniques implemented by scikit: principal component analysis and truncated singular value decomposition. Unfortunately, the scikit learn implementations of both threw errors when we tried using them because the data was too sparse for them. As a result, we weren't able to implement MLPs to uncover potential non-linearities and report the results of logistic regression c = 10 as our submission model.
This model seems to struggle with any named entities that contain three or more words. It seems to detect very few three word named entities. Fortunately, these are very rare relative to two and one word entities, but this is still a significant source of error. The most obvious way to rectify this failure is to implement a gazetteer of particularly long Spanish named entities.
Finally, we also attempted a 2-pass model. In the first pass, we trained a logistic regression model with all of our aforementioned features on the training data. Then we used this model to predict named entity tags for the test data. Afterward, we trained another model with the same features as the first model plus features for the gold standard named entity tags of the previous two words. Since we couldn't use gold standard of the test data as a feature, we substituted in the predictions made by the first model to populate this feature for the test data, then used the gold-standard trained model to classify each token. This approach was motivated by the textbook's suggestion to make multiple passes over the data and the CONLL paper linked in the feature description section. Unfortunately, this approach tanked the F score for every classifier we tried, including logistic regression with c = 10 (50.12) , a logistic regression with c = 1 (42.14), Adaboost with 100 decision stumps (21.45) and and Adaboost with 600 decision stumps (20.12). Our hypothesis for this failure is that we dramatically overfit the noise in all of our features by training the first and second pass on the same set of features, apart from the history. Since the predicted tags were trained on the same set of features as the rest of the second past model, the predicted tags tended to make the same mistakes as the rest of the features in the first pass model. As a result, combining both in the second pass model, only served to emphasize these failures, instead of counteracting them. We hoped using gold standard labels when training would have prevented this sort of feature "self-cannibalization", but it did not evidently. A more sophisticated approach would entail selecting some subset of our features to include in each model. But we didn't have time for this sort of analysis.