Skip to content

Spellcorrect exercise part2

jasonbaldridge edited this page Feb 18, 2013 · 7 revisions

If you haven't completed it already, do SpellCorrect-Exercise.

After you go through all the steps yourself, have a look at a solution that covers all of the steps

Step one: rank candidates using cosine similarity

For each spelling error, compute the cosine between it and all the candidates returned from the inverted index. Then output the top 20 candidates based on the values obtained.

Example output:

> run-main appliednlp.spell.SpellingCorrector "This Facebook app shows that she is there favorite acress in tonw" /usr/share/dict/words /tmp/masc_vocab.txt
[info] Running appliednlp.spell.SpellingCorrector This Facebook app shows that she is there favorite acress in tonw /usr/share/dict/words /tmp/masc_vocab.txt
Detecting spelling errors in: This Facebook app shows that she is there favorite acress in tonw
ERROR: acress
  Candidates: professed acoin apocrenic adherescence accouple acronym crumblingness gleesomeness unvarnishedness trest
  Top 20 (cress,0.7302967433402214) (acres,0.7302967433402214) (Paracress,0.6804138174397718) (actress,0.6172133998483676) (acre,0.6123724356957946) (acupress,0.5773502691896258) (Press,0.5477225575051661) (tress,0.5477225575051661) (Dress,0.5477225575051661) (press,0.5477225575051661) (dress,0.5477225575051661) (acred,0.5477225575051661) (crestless,0.5443310539518174) (creatress,0.5443310539518174) (acridness,0.5443310539518174) (acrestaff,0.5443310539518174) (acceptress,0.5163977794943223) (restress,0.5163977794943223) (sacredness,0.5163977794943223) (pennycress,0.5163977794943223)
ERROR: tonw
  Candidates: Teutondom belton metapeptone tokened atoningly histon toploftical Eciton toxolysis toxicosis
  Top 20 (ton,0.5773502691896258) (tonk,0.5) (tone,0.5) (tons,0.5) (tony,0.5) (tong,0.5) (tongs,0.4472135954999579) (toned,0.4472135954999579) (tonal,0.4472135954999579) (tonga,0.4472135954999579) (tonic,0.4472135954999579) (tones,0.4472135954999579) (tonus,0.4472135954999579) (toner,0.4472135954999579) (tonsor,0.4082482904638631) (tonish,0.4082482904638631) (tonkin,0.4082482904638631) (tongue,0.4082482904638631) (tonjon,0.4082482904638631) (tonous,0.4082482904638631)
[success] Total time: 14 s, completed Feb 4, 2013 4:20:29 PM

Step two: Refactor the code

The code is now become interesting enough that we can start to think about organizing it with some classes, traits, and objects. Add the following trait to your code:

trait CandidateGenerator {
  def apply(typo: String): Set[String]
}

Next, write a class VectorSpaceCandidateGenerator that extends CandidateGenerator. It must take an argument in the constructor that specifies how many candidates should be generated. Otherwise, it can take whichever arguments you feel are necessary in the constructor for it to do its job in the apply method. All the code you need for this is already done in the previous problem, so this is a matter of organization.

Tip: the vector space candidate generator will need the inverted index, so you must do one of the following:

  • build it in the main method and pass it in as an argument
  • pass in the vocabulary to the constructor of VectorSpaceCandidateGenerator and build it in the body of the VectorSpaceCandidateGenerator class
  • create a companion object called VectorSpaceCandidateGenerator that has an apply method that gets the vocabulary as an argument and then builds up the inverted index and creates an instance of VectorSpaceCandidateGenerator directly

My preference is for the last option: it keeps the code required for the vector space stuff in an object that is related to the class that uses it (thus not polluting the main method as the first option would), and it keeps the VectorSpaceCandidateGenerator class free of setup code (thus the code in the class definition is exclusively focused on what that class does, not how it is created).

Step three: Add edit distance candidates

Another way to generate candidates is simple to take all strings that are a single edit from the typo and then filter them through the vocabulary. Peter Norvig wrote a very nice blog post about spelling correction with a beautifully concise way to do this: see How to Write a Spelling Corrector. Read that post, and then add a new class EditDistanceCandidateGenerator that extends CandidateGenerator. In its constructor, it should take an argument distance that indicates how many edits to use in collecting candidates: either one or two. You could do this with an Int, but let's make it so that ONLY one OR two are possible values. We could do this with a Boolean, of course, but this is a good opportunity to learn about sealed traits, which are traits that have a fixed set of implementing classes or objects.

sealed trait NumEdits
object OneEdit extends NumEdits
object TwoEdits extends NumEdits

There are now two objects (note that they are not classes) that can be used where a NumEdits argument is required. That means that in the apply method for the EditDistanceCandidateGenerator class, you can match on the distance argument and then do the right thing.

Hint: To get all the alphabetic characters easily, try 'A' to 'Z' and 'a' to 'z' in the REPL.

Note: You are welcome to look at and adapt the Scala implementation of Norvig's spelling corrector (just the portion of it that gets the edit candidates). It's quite elegant. It uses expressions that will be unfamiliar to many, but it can be teased apart with a bit of effort that we'll discuss in class.

Modify the main method so that it gets the vector space candidates (top 20) and the candidates that are one and two edits away from the typo. The output should now look something like the following.

> run-main appliednlp.spell.SpellingCorrector "This Facebook app shows that she is there favorite acress in tonw" /usr/share/dict/words /tmp/masc_vocab.txt
[info] Running appliednlp.spell.SpellingCorrector This Facebook app shows that she is there favorite acress in tonw /usr/share/dict/words /tmp/masc_vocab.txt
Detecting spelling errors in: This Facebook app shows that she is there favorite acress in tonw
ERROR: acress
  VS: Dress Paracress Press acceptress acre acred acres acrestaff acridness actress acupress creatress cress crestless dress pennycress press restress sacredness tress
  ED1: access acres across actress caress cress
  ED2: Access Acres Across Dress Press Stress abbess access acre acreak acream acred acres acrose across actless actress acupress acyesis address adless adpress afresh aggress agrees airless areas arrest arses assess atresy aureus awless ayless cares caress carest carless cess chess crass cress crest crews criss cross dress duress egress hatress ogress press siress stress tress
ERROR: tonw
  VS: ton tonal tone toned toner tones tong tonga tongs tongue tonic tonish tonjon tonk tonkin tonous tons tonsor tonus tony
  ED1: ton tone tong tonk tons tony tow town
  ED2: Bon Boni Con Conn Cons Don Donn Dont Dow Down Fon Fong Gona Gond Gone Gow Hon Hong How Ione Ioni Jon Jong Joni Know Kong Lone Long Lonk Low Mon Mona Mono Mont Nona None Now Ron Rong Snow Son Song Sony Stone Tong Tony Town Won Wong Wow Zone atone atony bon bond bone bong bonk bony bouw bow con cond cone conk conn cony cow don done dong donk dont dow down enow eon fono fons font fow gon gone gong gony gown hone hong honk how ion jong jow know kon kona lone long low lown mon mona mone mong monk mono mow mown non none now on ona one ons oons oont ow own pon pond pone pong pont pony pow rond rone row snow son sond song sonk sons sow sown stond stone stong stony stow tan tana tane tang tanh tank taw tawn ten tend teng tens tent tew thaw thew thon thone thong thow tin tind tine ting tink tint tiny to toa toad toat tobe toby tock toco tod tode tody toe toed toes toff toft tofu tog toga togs togt toho toi toil toit toke toko tol told tole toll tolt tolu tom tomb tome ton tonal tone toned toner tones tong tonga tongs tonic tonk tons tonus tony too took tool toom toon toop toot top tope toph topi topo tops tor tora torc tore torn toro tort toru tory tos tosh toss tost tosy tot tote toto toty tou toug toup tour tout tow towan towd town towns towny towy tox toxa toy toys toze tron trona tronc trone trow tun tuna tund tune tung tunk tuno tunu tuny tw two tynd vow won wone wong wont wow yon yond yont yow zone
[success] Total time: 16 s, completed Feb 6, 2013 12:17:51 PM

Step four: Generate API documentation

One of the advantages of using a build system like SBT is that it is easy to produce browsable API documentation. On the SBT command line type doc, and then point a browser to the target/api/index.html file. E.g. for me it is:

file:///Users/jbaldrid/devel/applied-nlp/target/api/index.html

Once you have done this, go back to your code and change the CandidateGenerator trait to include documentation that can be picked up by ScalaDoc and added to the browsable documentation.

/**
 * Candidate generators produce valid words from the vocabulary that
 * are close (by some measure) to the typo.
 */
trait CandidateGenerator {

  /**
   * Produce a set of candidates for the typo.
   *
   * @param typo the typo that we need candidates for
   * @return the set of candidates as determined by this candidate generator
   */
  def apply(typo: String): Set[String]
}

"Recompile" the documentation using the doc command in SBT and then reload the index.html file and click on the CandidateGenerator link on the left hand side. You'll see the "Candidate generators produce..." line at the top of the page. Next click on the link for the apply method, and you'll see the documentation about the parameters and the return values.

This is an easy way to document your code. You'll be thanked not only by anyone using your software, but also by your future self when you come back to it after a year.

Step five: Creating and using a unigram language model

We have lots of candidates, but we want to pick the best one. As we discussed (and also discussed in Norvig's article), this involves two components: an error model P(typo|candidate) and a language model P(candidate). Let's first focus on the latter. To get a unigram language model, we simply need to read in a text file (the bigger the better) and count the number of tokens of each word type. The probability of a word is then its count divided by the number of tokens in the entire text. Let's use the all-written.txt file from the MASC corpus that I showed how to create in the tutorial about Unix pipelines. Put that file in the /tmp directory.

Now, modify the main method in SpellingCorrector so that it computes a unigram language model from all-written.txt. Then, in the part that suggests candidates, have it output the word that has the highest probability of all the candidates.

Here's how it should look.

[success] Total time: 25 s, completed Feb 6, 2013 2:08:04 PM
> run-main appliednlp.spell.SpellingCorrector "This Facebook app shows that she is there favorite acress in tonw" /usr/share/dict/words /tmp/masc_vocab.txt /tmp/all-written.txt
[info] Running appliednlp.spell.SpellingCorrector This Facebook app shows that she is there favorite acress in tonw /usr/share/dict/words /tmp/masc_vocab.txt /tmp/all-written.txt
Detecting spelling errors in: This Facebook app shows that she is there favorite acress in tonw
ERROR: acress
  VS: Dress Paracress Press acceptress acre acred acres acrestaff acridness actress acupress creatress cress crestless dress pennycress press restress sacredness tress
  ED1: access acres across actress caress cress
  ED2: Access Acres Across Dress Press Stress abbess access acre acreak acream acred acres acrose across actless actress acupress acyesis address adless adpress afresh aggress agrees airless areas arrest arses assess atresy aureus awless ayless cares caress carest carless cess chess crass cress crest crews criss cross dress duress egress hatress ogress press siress stress tress
  Best: tress
ERROR: tonw
  VS: ton tonal tone toned toner tones tong tonga tongs tongue tonic tonish tonjon tonk tonkin tonous tons tonsor tonus tony
  ED1: ton tone tong tonk tons tony tow town
  ED2: Bon Boni Con Conn Cons Don Donn Dont Dow Down Fon Fong Gona Gond Gone Gow Hon Hong How Ione Ioni Jon Jong Joni Know Kong Lone Long Lonk Low Mon Mona Mono Mont Nona None Now Ron Rong Snow Son Song Sony Stone Tong Tony Town Won Wong Wow Zone atone atony bon bond bone bong bonk bony bouw bow con cond cone conk conn cony cow don done dong donk dont dow down enow eon fono fons font fow gon gone gong gony gown hone hong honk how ion jong jow know kon kona lone long low lown mon mona mone mong monk mono mow mown non none now on ona one ons oons oont ow own pon pond pone pong pont pony pow rond rone row snow son sond song sonk sons sow sown stond stone stong stony stow tan tana tane tang tanh tank taw tawn ten tend teng tens tent tew thaw thew thon thone thong thow tin tind tine ting tink tint tiny to toa toad toat tobe toby tock toco tod tode tody toe toed toes toff toft tofu tog toga togs togt toho toi toil toit toke toko tol told tole toll tolt tolu tom tomb tome ton tonal tone toned toner tones tong tonga tongs tonic tonk tons tonus tony too took tool toom toon toop toot top tope toph topi topo tops tor tora torc tore torn toro tort toru tory tos tosh toss tost tosy tot tote toto toty tou toug toup tour tout tow towan towd town towns towny towy tox toxa toy toys toze tron trona tronc trone trow tun tuna tund tune tung tunk tuno tunu tuny tw two tynd vow won wone wong wont wow yon yond yont yow zone
  Best: town

Step six: Use a bigram language model

Add a bigram language model and use that to predict the correct candidate rather than the unigram model. Output predictions using both the unigram language model and the bigram one. Use the unigram model as a backoff model when the bigram language model doesn't have any entries for the previous word.

Here's what your output should look like.

> run-main appliednlp.spell.SpellingCorrector "This Facebook app shows that she is there favorite acress in tonw" /usr/share/dict/words /tmp/masc_vocab.txt /tmp/all-written.txt
[info] Running appliednlp.spell.SpellingCorrector This Facebook app shows that she is there favorite acress in tonw /usr/share/dict/words /tmp/masc_vocab.txt /tmp/all-written.txt
Detecting spelling errors in: This Facebook app shows that she is there favorite acress in tonw
ERROR: acress
  VS: Dress Paracress Press acceptress acre acred acres acrestaff acridness actress acupress creatress cress crestless dress pennycress press restress sacredness tress
  ED1: access acres across actress caress cress
  ED2: Access Acres Across Dress Press Stress abbess access acre acreak acream acred acres acrose across actless actress acupress acyesis address adless adpress afresh aggress agrees airless areas arrest arses assess atresy aureus awless ayless cares caress carest carless cess chess crass cress crest crews criss cross dress duress egress hatress ogress press siress stress tress
  Best based on unigram LM: across
ERROR: tonw
  VS: ton tonal tone toned toner tones tong tonga tongs tongue tonic tonish tonjon tonk tonkin tonous tons tonsor tonus tony
  ED1: ton tone tong tonk tons tony tow town
  ED2: Bon Boni Con Conn Cons Don Donn Dont Dow Down Fon Fong Gona Gond Gone Gow Hon Hong How Ione Ioni Jon Jong Joni Know Kong Lone Long Lonk Low Mon Mona Mono Mont Nona None Now Ron Rong Snow Son Song Sony Stone Tong Tony Town Won Wong Wow Zone atone atony bon bond bone bong bonk bony bouw bow con cond cone conk conn cony cow don done dong donk dont dow down enow eon fono fons font fow gon gone gong gony gown hone hong honk how ion jong jow know kon kona lone long low lown mon mona mone mong monk mono mow mown non none now on ona one ons oons oont ow own pon pond pone pong pont pony pow rond rone row snow son sond song sonk sons sow sown stond stone stong stony stow tan tana tane tang tanh tank taw tawn ten tend teng tens tent tew thaw thew thon thone thong thow tin tind tine ting tink tint tiny to toa toad toat tobe toby tock toco tod tode tody toe toed toes toff toft tofu tog toga togs togt toho toi toil toit toke toko tol told tole toll tolt tolu tom tomb tome ton tonal tone toned toner tones tong tonga tongs tonic tonk tons tonus tony too took tool toom toon toop toot top tope toph topi topo tops tor tora torc tore torn toro tort toru tory tos tosh toss tost tosy tot tote toto toty tou toug toup tour tout tow towan towd town towns towny towy tox toxa toy toys toze tron trona tronc trone trow tun tuna tund tune tung tunk tuno tunu tuny tw two tynd vow won wone wong wont wow yon yond yont yow zone
  Best based on unigram LM: town
ERROR: acress
  VS: Dress Paracress Press acceptress acre acred acres acrestaff acridness actress acupress creatress cress crestless dress pennycress press restress sacredness tress
  ED1: access acres across actress caress cress
  ED2: Access Acres Across Dress Press Stress abbess access acre acreak acream acred acres acrose across actless actress acupress acyesis address adless adpress afresh aggress agrees airless areas arrest arses assess atresy aureus awless ayless cares caress carest carless cess chess crass cress crest crews criss cross dress duress egress hatress ogress press siress stress tress
  Best based on bigram LM: access
ERROR: tonw
  VS: ton tonal tone toned toner tones tong tonga tongs tongue tonic tonish tonjon tonk tonkin tonous tons tonsor tonus tony
  ED1: ton tone tong tonk tons tony tow town
  ED2: Bon Boni Con Conn Cons Don Donn Dont Dow Down Fon Fong Gona Gond Gone Gow Hon Hong How Ione Ioni Jon Jong Joni Know Kong Lone Long Lonk Low Mon Mona Mono Mont Nona None Now Ron Rong Snow Son Song Sony Stone Tong Tony Town Won Wong Wow Zone atone atony bon bond bone bong bonk bony bouw bow con cond cone conk conn cony cow don done dong donk dont dow down enow eon fono fons font fow gon gone gong gony gown hone hong honk how ion jong jow know kon kona lone long low lown mon mona mone mong monk mono mow mown non none now on ona one ons oons oont ow own pon pond pone pong pont pony pow rond rone row snow son sond song sonk sons sow sown stond stone stong stony stow tan tana tane tang tanh tank taw tawn ten tend teng tens tent tew thaw thew thon thone thong thow tin tind tine ting tink tint tiny to toa toad toat tobe toby tock toco tod tode tody toe toed toes toff toft tofu tog toga togs togt toho toi toil toit toke toko tol told tole toll tolt tolu tom tomb tome ton tonal tone toned toner tones tong tonga tongs tonic tonk tons tonus tony too took tool toom toon toop toot top tope toph topi topo tops tor tora torc tore torn toro tort toru tory tos tosh toss tost tosy tot tote toto toty tou toug toup tour tout tow towan towd town towns towny towy tox toxa toy toys toze tron trona tronc trone trow tun tuna tund tune tung tunk tuno tunu tuny tw two tynd vow won wone wong wont wow yon yond yont yow zone
  Best based on bigram LM: town
Clone this wiki locally