Skip to content

KorpInstallation

eaxelson edited this page Dec 19, 2018 · 1 revision

Getting a corpus to Korp

This page tells how to install Korp to your own machine and convert a corpus to a format understood by Korp tools.

You probably need:

  • Kielipankki-konversio: (for converting your corpus to vrt format and) for converting the corpus in vrt format to format usable by korp backend
  • Korp backend: for performing searchs in the corpus
  • Korp frontend: for a graphical user interface that communicates with korp backend

Converting corpora to korp format with Kielipankki-konversio tools

See Kielipankki's and Språkbanken's instructions. Also see Kielipankki's technical instructions.

Kielipankki tools can be fetched from their Github repo. They depend on CWB tools (including cwb-perl) that must be installed before and made visible to Kielipankki tools with

export CWB_BINDIR=insert_cwb_installation_dir_here

When compiling and installing CWB, set variable PLATFORM in config.mk to unix.

See CWB pages to get also cwb-perl installed. In directory /korp/cwb-perl/CWB, run

perl Makefile.PL --config /usr/local/cwb-3.4.12/bin/cwb-config
make
make install # specify installation directory '/usr/local/cwb-3.4.12/bin/' ?

If perl complains about missing HTML::Entities, run (as super user) cpan and execute

install HTML::Entities

NOTE: spaces in attributes can be problematic so avoid using them. Also & and < signs can be problematic, so they must be escaped as told in Kielipankki's technical instructions.

TODO: the following warnings should be handled, although they are not that dangerous:

korp-make-corpus-package.sh: Warning: Korp frontend directory not found
korp-make-corpus-package.sh: Warning: No readme file included
korp-make-corpus-package.sh: Warning: No documentation included
korp-make-corpus-package.sh: Warning: No conversion scripts included

Package is created in pkgs/CORPUSNAME directory.

An example:

# process vrt files, someting like:
KORP_MAKE="/full/path/to/Kielipankki-konversio/scripts/korp-make"
export CWB_BINDIR=/usr/local/cwb-3.4.12/bin/
# mkdir registry
${KORP_MAKE} --corpus-root=${location_of_vrt_files} --log-file=log --no-lemgrams --no-logging --verbose --input-attributes "${empty_or_space_separated_attribute_names_without_word}" ${name_of_the_corpus} ${vrt_files}

# attributes for eduskunta ("word" is the first attribute by default):
"aid tref1 tref2 tv1 tv2 ref lemma pos msd dephead deprel nertag"

# fix paths, something like (corpusdir is for example "usr/lib/cgi-bin/corpora"):
perl -i -pe 's/^HOME .*/HOME '${corpusdir}'\/data\/'${corpusname}'/;' registry/${name_of_the_corpus} # something like
perl -i -pe 's/^INFO .*/INFO '${corpusdir}'\/data\/'${usr_lib_cgi_bin_corpora}'\/\.info/;' registry/${name_of_the_corpus} # something like

# copy generated files, something like: 
sudo cp registry/* /usr/lib/cgi-bin/corpora/registry/
sudo cp -R data/* /usr/lib/cgi-bin/corpora/data/

Korp backend

Both Korp backend and frontend are based on Språkbanken's Korp tools. Fetch the backend from Kielipankki's github repo (private repository).

For dependencies, see Språkbanken's documentation. Note that the CWB dependencies are probably already met if you installed them for Kielipankki-konversio tools.

In korp_config.py, you probably have to modify at least variables (also in auth.cgi and korp_download.cgi?)

CQP_EXECUTABLE
CWB_SCAN_EXECUTABLE
CWB_REGISTRY
AUTH_SERVER
CACHE_DIR (can be empty string)
RESTRICTED_SENTENCES_CORPORA_FILE (can be empty string)
LOG_FILE

To make Korp backend available via a web browser (TODO: in which address?), you must start apache. Before running apache, execute a2enmod cgi to allow cgi scripts. The command will also symlink /cgi-bin/ to /usr/lib/cgi-bin/ (TODO: this assumes that you have korp on this directory, but it could and probably should be elsewhere). Also modify apache's configuration file (probably located at /etc/apache2/apache2.conf) so that it only accepts connections from localhost. This is done by changing Require all granted to Require local for directories /usr/share/ and /var/www/:

  <Directory /usr/share>
        AllowOverride None
        Require local
  </Directory>

  <Directory /var/www/>
        Options Indexes FollowSymLinks
        AllowOverride None
        Require local
  </Directory>

Also make sure that korp.cgi has rights to write to the log directory and file.

Make sure that the HOME and INFO paths are correct in file corpora/registry/CORPUSNAME after you have run korp-make:

# path to binary data files
HOME ...
# optional info file (displayed by "info;" command in CQP)
INFO ...

Korp backend searchs:

korp.cgi?command=query&cqp=[word=%22korpusar%22]&corpus=TESTCORPUS&start=0&end=0&defaultcontext=1%20sentence&indent=2

./korp.cgi command="query&defaultcontext=1+sentence&show=sentence&show_struct=text_url&cache=true&start=0&end=9&corpus=TEST_ASD_FI%7CTEST_ASD_SV&incremental=true&cqp=%5Bword+%3D+%22Lipponen%22%5D&defaultwithin=sentence&loginfo=lang%3Dfi+search%3Dsimple"

Korp frontend

Get Korp frontend from CSC's github repo. It is forked from Språkbanken's repo. See instructions and dependencies on Språkbanken's repo. Note that in 'Local setup for Ubuntu', the command npm install must be performed as super user (in the korp fronted directory), and the command sudo gem install compass must be run finally so that compass is found.

In app/config.js, you must set the URL's of the cgi scripts (/cgi-bin/korp/ by default) and locally_available_corpora (most corpora listed here will not be available). Also modify settings.corporafolders, settings.corpora and locally_available_corpora if you wish to add corpora.

Run grunt serve and go to http://localhost:9000.