IMPORTANT: No longer in active development.
Idio's Spotlight Model Editor allows you to manually tweak dbpedia spotlight's models. Thus it allows you to manually:
-
Add new Surface Forms
-
Add new Topics
-
Create associations between surface forms and dbpedia uris
-
Remove associations between surface forms and dbpedia uris
-
Make surface forms spottable
-
Make surface forms unspottable
-
Modify the context vectors
-
The default branches(Development/Master) work on Spotlight 0.6 models. If you downloaded your model from: http://spotlight.sztaki.hu/downloads/version-0.1/ then it is an 0.6 model.
-
The branch feature/code-clean-up-0-7 works on Spotlight 0.7 models. If you downloaded your model from: http://spotlight.sztaki.hu/downloads/ then it is an 0.7 model.
In order to use the Model Editor, you will need:
- (Oracle) Java 1.7
- Scala 2.9.x (If you want to run it interactively from a terminal)
- Compiling Spotlight Model Editor (this tool) from source (see below)
- A pre-computed language model (downloaded from here )
We also recommend using IntelliJ, for editing the code. See below, for instructions on how to set up a project.
We assume that you have the correct versions of Java and mvn in your system.
The language models consume a lot of computational resources, so in these instructions we use the model for
Turkish (located in the tr
folder). Feel free to play with other languages, if you have a big machine.
- Clone this repo
- go to the repo's folder
- do
mvn package appassembler:assemble
- call
sh target/bin/model-editor explore path-to-model/en/model/ 20
it should print the stats for 20 surface forms
Step 3 generates a jar with all the dependencies in target
folder. Then it generates a script with default values for calling the jar. The script calls the jar with default values for the heap (15g). If you want to override this value you can modify: (i) the pom appassembler-maven-plugin
settings in the pom, or (ii) call the jar directly java -xmx.. -jar ...
followed by the commands shown in this readme.
- Get IntelliJ
- Go to
File
>Import Project
->Select POM Project
- Give enough RAM to run the project. Go to
Preferences
->Compiler
and add '-Xmx5G' to 'Aditional VM options', - Navigate to the
SpotlightModelReader
class, right clickMain
and selectrun scala console
, enjoy
start by freeing as much ram as possible.
Each of the following tools addressing a command
refers to calling the jar/script as follows
using the generated script:
sh target/bin/model-editor <command> <subcommand> arg1 arg2
using the generated jar:
java -Xmx15g -Xms15g -jar target/idio-spotlight-model-0.1.0-jar-with-dependencies.jar <command> <subcommand> arg1 arg2
- command:
explore
- arg1: path to dbpedia spotlight model,
/mnt/share/spotlight/en/model
- arg2: number of surface forms
- result: outputs arg2 number of SurfaceForms with their respective candidates, priors and statistics
example:
sh target/bin/model-editor explore path-to-turkish/tr/model/ 40
All topic related actions are carried out using the topic
command followed by one of the following subcommands:
search
: checking if a topic is in the storescheck-context
: printing the context of a topicclean-set-context
: cleaning and setting the context of a topic
- command:
topic
- subcommand:
search
- arg1: path to dbpedia spotlight model,
/mnt/share/spotlight/en/model
- arg2: dbpediaURI
- result: looks for a given
DbpediaId
in the Model and returns whether that topic exists or not in the model
i.e :
sh target/bin/model-editor topic search path/to/model Michael_Schumacher
- command:
topic
- subcommand:
check-context
- arg1: path to dbpedia spotlight model,
/mnt/share/spotlight/en/model
- arg2: piped separated list of dbpediaUris
example:
sh target/bin/model-editor topic check-context /mnt/share/spotlight/en/model Barack_Obama\|United_States
- command:
topic
- subcommand:
clean-set-context
- arg1:
pathToSpotlightModel/model
- arg2: pathToFile
- result: The context words and counts for the topics in the file will be cleared. The specified context Words will be stemmed and added with their respective counts to the context vector of the given topics.
each line of the given input file should be like:
dbpediaUri <tab> contextWordsSeparatedByPipe <tab> countsSeparatedByPipe
the size of contextWordsSeparatedByPipe
and countsSeparatedByPipe
should be the same
example:
sh target/bin/model-editor topic clean-set-context /mnt/share/spotlight/en/model folder/fileWithContextChanges
All surface forms related actions are carried out using the surfaceform
command followed by one of the following subcommands:
stats
: printing stats of a surface formcandidates
: printing the list of candidates of a surface formmake-spottable
: making surfaceforms spottablemake-unspottable
: making surfaceforms unspottablecopy-candidates
: adding to asurfaceformA
all candidates of asurfaceFormB
- subcommand:
surfaceform
- subcommand:
stats
- arg1: path to dbpedia spotlight model,
/mnt/share/spotlight/en/model
- arg2: surfaceForm
- result: outputs statistics of the given surfaceForm
example :
sh target/bin/model-editor surfaceform stats ~/Downloads/tr/model/ evrimleri
outputs statistics for the surface form evrimleri
- command:
surfaceform
- subcommand:
candidates
- arg1: path to dbpedia spotlight model,
/mnt/share/spotlight/en/model
- arg2: surfaceForm
- result: outputs the candidate topics of a surface form
example :
sh target/bin/model-editor surfaceform candidates ~/Downloads/tr/model/ evrimleri
would check the candidate topics for the surface form evrimleri
- command:
surfaceform
- subcommand:
make-unspottable
- arg1: path to dbpedia spotlight model,
/mnt/share/spotlight/en/model
- arg2:
- list of Surface Forms separated by
|
. i.e:how\|How\|Hello\ World
- file containing a surfaceForm per line ( if option
-f
is passed)
- list of Surface Forms separated by
- result: Each
SF
won't be spottable anymore
sh target/bin/model-editor surfaceform make-unspottable path/to/model surfaceForm1\|surfaceForm2\|
sh target/bin/model-editor surfaceform make-unspottable path/to/model pathTo/File/withSF -f
-
command:
surfaceform
-
subcommand:
copy-candidates
-
arg1: path to dbpedia spotlight model,
/mnt/share/spotlight/en/model
-
arg2: path to file containing pairs of surfaceForm. each line should be :
``` <originSurfaceForm> <tab> <destinySurfaceForm> ```
-
result: copies the candidate topics from each
originSurfaceForm
as candidates topics todestinySurfaceForm
example:
sh target/bin/model-editor surfaceform copy-candidates path/to/model pathToFile
- command:
surfaceform
- subcommand:
make-spottable
- arg1: path to dbpedia spotlight model,
/mnt/share/spotlight/en/model
- arg2:
- list of Surface Forms separated by
|
. i.e:how\|How\|Hello\ World
- file containing a surfaceForm per line ( if option
-f
is passed)
- list of Surface Forms separated by
- result: Each
SF
will be spottable
example:
sh target/bin/model-editor surfaceform make-spottable path/to/model surfaceForm1\|surfaceForm2\|
sh target/bin/model-editor surfaceform make-spottable path/to/model pathTo/File/withSF -f
All surface forms related actions are carried out using the association
command followed by one of the following subcommands:
remove
- command:
association
- subcommand:
remove
- arg1:
pathToSpotlightModel/model
- arg2: pathToInputFile
- result: All associations between SFs and Topics in the given input file will be deleted from the model.
Every line in the input file describes an association which will be deleted, each line should follow the format:
dbpediaURI <tab> Surface Form
example:
sh target/bin/model-editor association remove /mnt/share/spotlight/en/model /path/to/file/file_with_associations
When updating the model with lots of SF
, Topics
and Context Words
best is to do it from a file.
each line of the file should follow the format:
dbpedia_id <tab> surfaceForm1|surfaceForm2... <tab> contextW1|contextW2... <tab> contextW1Counts|ContextW2Counts
Before doing actual changes to the model it might be useful to see how many SF
,dbpedia topics
and links between those two are missing.
sh target/bin/model-editor file-update check path/to/en/model path_to_file/with/model/changes
.
make sure you have enough ram to hold all the models that should be around 15g. do:
sh target/bin/model-editor file-update all path/to/en/model path_to_file/with/model/changes
If you don't have enough ram you can update the SF
and DbpediaTopics
in one step and the Context Words
in other, this will require less memory.
- go to the model folder and rename
context.mem
tocontext2.mem
this will avoid the jar to avoid loading thecontext store
- calling the following command will update the
surfaceform store
,resource store
andcandidate store
:sh target/bin/model-editor file-update all path/to/en/model path_to_file/with/model/changes
. - a new file
path_to_file/with/model/changes_just_context
will be generated after running the previous command.This file contains dbpediaIds(internal model's indexes) to contextWords, and it can be processed in the following step. - rename
context2.mem
tocontext.mem
, and rename every other file in the model folder to something else.( if this is not done, the stores will be loaded and they will consume all your ram) - calling the following will update the
context store
:
sh target/bin/model-editor file-update context-only path/to/en/model path_to_file/with/model/changes_just_context
- rename all files to their usual conventions and enjoy a fresh baked model
steps 1-4 could be applied while ignoring 5 and 6 when:
- wanting to add
SFs
- wanting to link
SFs
with already existingDbpedia Topic
steps 5-6 could be applied while ignoring 1-4 when:
- wanting to add Context words to a
Dbpedia Topic
Important:
step 1-4
will only addSF
andDbpedia Topics
if they dont exist.step 1-4
will make all specifiedSF
spottablestep 5-6
Only ADDS context words to the context of a dbpedia Topic.
Best way to play the models and modify them is to use the scala console.
- make sure your scala is 2.9.X
- start a scala console by doing:
JAVA_OPTS="-Xmx15000M -Xms15000M" scala
Once you start a scala console you can use it like ipython
to create instances of the scala classes we have, to load the models, check if dbpedia id's exist, add new dbpedia ids, add new surface forms etc..
do: :cp pathTo/ModelEditor.jar
This will load the classes inside the model editor. After that you should be able to play with the classes inside the jar.
Example:
var spotlightModel = org.idio.dbpedia.spotlight.Main.getSpotlightModel( "/Users/dav009/Downloads/tr/model/")
spotlightModel.showSomeSurfaceForms()
spotlightModel.getStatsForSurfaceForm("evrimleri")
spotlightModel.searchForDBpediaResource("ikimono_gakari_dbpedia_uri")
spotlightModel.addNew("ikimono_gakari_sf","ikimono_gakari_dbpedia_uri",1,Array())
spotlightModel.exportModels("/new/path/of/folder/model/")
tools/explore.scala
contains a script which can be preloaded into the scala terminal. It imports the classes and stores needed to play with the model at a low level.
In order to use it:
-
do
JAVA_OPTS="-Xmx9000M -Xms9000M" scala
note: Adjust the Java heap options to your needs, If you are using all the stores use around 15g -
once you are in the scala console do:
:load tools/explore.scala
. this will preload the objects:resStore
: resource storesfStore
: surface form storecandidateMap
: candidate storetokenStore
: token type storecontextStore
: context token store
If you are interested in Knowledge Mining, NLP or Software Engineering you should take a look at our jobs page. We're always on the lookout for awesome people to join our team.
Copyright 2014 Idio
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0