Skip to content

Indra modules responsible for pre-processing text, and indexing and loading semantic vector models.

License

Notifications You must be signed in to change notification settings

Lambda-3/IndraIndexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

INDRA INDEXER

INDRA Indexer is divided into two modules:

  1. indra-preprocessing
  2. indra-index

indra-preprocessing

The corpus pre-processor is responsible for defining the tokenisation strategy and the tokens’ subsequent transformations. It defines, for example, if United States of America corresponds to aunique token or to multiple. Stem and lowercase are two other popular transformations also supported by the pre-processor, whose full list is shown in the below table.

Parameter Description/Options
input format Wikipedia-dump format or plain texts from one or multiple files.
language 14 supported languages.
set of stopwords a set of tokens to be removed.
set of multi-word expressions set of sequences of tokens that should be considered a unique token.
apply lowercase lowercase the tokens.
apply stemmer applies the Poter Stemmer in the tokens.
remove accents remove the accents of words.
replace numbers replaces numbers for the place holder .
min set a minimum acceptable token size.
max set a maximum acceptable token size.

indra-index

The indra-index module is responsible for generating word embedding models and loading them into the Indra data sources. It defines a unified interface to generate predictive-based (e.g. Skip-gram and GloVe) and count-based (e.g. LSA and ESA) models whose implementation comes from the libraries DeepLearning4J and S-Space respectively. In addition to the unification of the interface, indra-index integrates the corpus preprocessor module.

The final generated model stores the set of applied transformations as a metadata information. During the consumption, Indra applies the same set of options to guarantee consistence. For instance, admitting that a given model was generated by applying the stemmer and lowercase to the tokens.

Indra loads the generated models into three types of data sources: annoy indexes (for dense vectors models), Lucene indexes (for sparse vectors models) or Mongo indexes (deprecated).

Citing Indra

Please cite Indra, if you use it in your experiments or project.

@InProceedings{indra,
author="Sales, Juliano Efson and Souza, Leonardo and Barzegar, Siamak and Davis, Brian and Freitas, Andr{\'e} and Handschuh, Siegfried",
title="Indra: A Word Embedding and Semantic Relatedness Server",
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
month     = {May},
year      = {2018},
address   = {Miyazaki, Japan},
publisher = {European Language Resources Association (ELRA)},
}

Contributors (alphabetical order)

  • Andre Freitas
  • Brian Davis
  • Juliano Sales
  • Leonardo Souza
  • Siamak Barzegar
  • Siegfried Handschuh

About

Indra modules responsible for pre-processing text, and indexing and loading semantic vector models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published