Skip to content

TopicModel Exercise

jasonbaldridge edited this page Mar 4, 2013 · 2 revisions

For this exercise, we'll compute topics for a dataset using Mallet as a command line tool, and then as an API. The main points of the exercise are to:

  • show how easy it is to get topics for a corpus
  • explore some parameterizations and see their effect on the topics that are found
  • do integration of a tool like Mallet, with a somewhat poorly documented API, into your code

Step one: find and prepare a data set

You need a data set that either already is in a directory with one text document per file, or that you can process to obtain that form. For example:

Step two: install Mallet and run it from the command line.

Compute topics by following the instructions for command-line usage for computing topic models with Mallet. Explore some of the different parameterizations (especially the number of topics) to see how the topics vary.

Step three: use Mallet as an API

Create a project that uses Mallet as a dependency and compute topics using Mallet as an API rather from the command-line. This means that you can have a topic model as a first-class object in your application, rather than having to obtain it indirectly via computing it on the command-line, reading in output from it, etc.

This is not hard, but it requires you to do some of the sleuthing that is necessary for working with code like this in the real world.

Also, you can just wimp out and look at Mallet's page on Topic Modeling for Java Developers and convert the code given there to Scala.