-
Notifications
You must be signed in to change notification settings - Fork 0
KorpInstallation
This page tells how to install Korp to your own machine and convert a corpus to a format understood by Korp tools.
You probably need:
- Kielipankki-konversio: (for converting your corpus to vrt format and) for converting the corpus in vrt format to format usable by korp backend
- Korp backend: for performing searchs in the corpus
- Korp frontend: for a graphical user interface that communicates with korp backend
See Kielipankki's and Språkbanken's instructions. Also see Kielipankki's technical instructions.
Kielipankki tools can be fetched from their Github repo.
They depend on CWB tools (including cwb-perl
) that must be installed before and made visible to Kielipankki tools with
export CWB_BINDIR=insert_cwb_installation_dir_here
When compiling and installing CWB, set variable PLATFORM
in config.mk
to unix
.
See CWB pages to get also cwb-perl
installed. In directory /korp/cwb-perl/CWB
, run
perl Makefile.PL --config /usr/local/cwb-3.4.12/bin/cwb-config
make
make install # specify installation directory '/usr/local/cwb-3.4.12/bin/' ?
If perl complains about missing HTML::Entities
, run (as super user) cpan
and execute
install HTML::Entities
NOTE: spaces in attributes can be problematic so avoid using them. Also &
and <
signs can be problematic,
so they must be escaped as told in Kielipankki's technical instructions.
TODO: the following warnings should be handled, although they are not that dangerous:
korp-make-corpus-package.sh: Warning: Korp frontend directory not found
korp-make-corpus-package.sh: Warning: No readme file included
korp-make-corpus-package.sh: Warning: No documentation included
korp-make-corpus-package.sh: Warning: No conversion scripts included
Package is created in pkgs/CORPUSNAME
directory.
An example:
# process vrt files, someting like:
KORP_MAKE="/full/path/to/Kielipankki-konversio/scripts/korp-make"
export CWB_BINDIR=/usr/local/cwb-3.4.12/bin/
# mkdir registry
${KORP_MAKE} --corpus-root=${location_of_vrt_files} --log-file=log --no-lemgrams --no-logging --verbose --input-attributes "${empty_or_space_separated_attribute_names_without_word}" ${name_of_the_corpus} ${vrt_files}
# attributes for eduskunta ("word" is the first attribute by default):
"aid tref1 tref2 tv1 tv2 ref lemma pos msd dephead deprel nertag"
# fix paths, something like (corpusdir is for example "usr/lib/cgi-bin/corpora"):
perl -i -pe 's/^HOME .*/HOME '${corpusdir}'\/data\/'${corpusname}'/;' registry/${name_of_the_corpus} # something like
perl -i -pe 's/^INFO .*/INFO '${corpusdir}'\/data\/'${usr_lib_cgi_bin_corpora}'\/\.info/;' registry/${name_of_the_corpus} # something like
# copy generated files, something like:
sudo cp registry/* /usr/lib/cgi-bin/corpora/registry/
sudo cp -R data/* /usr/lib/cgi-bin/corpora/data/
Both Korp backend and frontend are based on Språkbanken's Korp tools. Fetch the backend from Kielipankki's github repo (private repository).
For dependencies, see Språkbanken's documentation. Note that the CWB dependencies are probably already met if you installed them for Kielipankki-konversio tools.
In korp_config.py
, you probably have to modify at least variables (also in auth.cgi
and korp_download.cgi
?)
CQP_EXECUTABLE
CWB_SCAN_EXECUTABLE
CWB_REGISTRY
AUTH_SERVER
CACHE_DIR (can be empty string)
RESTRICTED_SENTENCES_CORPORA_FILE (can be empty string)
LOG_FILE
To make Korp backend available via a web browser (TODO: in which address?), you must start apache. Before running apache, execute a2enmod cgi
to allow cgi scripts.
The command will also symlink /cgi-bin/
to /usr/lib/cgi-bin/
(TODO: this assumes that you have korp on this directory, but it could and probably should be elsewhere).
Also modify apache's configuration file (probably located at /etc/apache2/apache2.conf
) so that it only accepts connections from localhost.
This is done by changing Require all granted
to Require local
for directories /usr/share/
and /var/www/
:
<Directory /usr/share>
AllowOverride None
Require local
</Directory>
<Directory /var/www/>
Options Indexes FollowSymLinks
AllowOverride None
Require local
</Directory>
Also make sure that korp.cgi
has rights to write to the log directory and file.
Make sure that the HOME
and INFO
paths are correct in file corpora/registry/CORPUSNAME
after you have run korp-make
:
# path to binary data files
HOME ...
# optional info file (displayed by "info;" command in CQP)
INFO ...
Korp backend searchs:
korp.cgi?command=query&cqp=[word=%22korpusar%22]&corpus=TESTCORPUS&start=0&end=0&defaultcontext=1%20sentence&indent=2
./korp.cgi command="query&defaultcontext=1+sentence&show=sentence&show_struct=text_url&cache=true&start=0&end=9&corpus=TEST_ASD_FI%7CTEST_ASD_SV&incremental=true&cqp=%5Bword+%3D+%22Lipponen%22%5D&defaultwithin=sentence&loginfo=lang%3Dfi+search%3Dsimple"
Get Korp frontend from CSC's github repo. It is forked from Språkbanken's repo.
See instructions and dependencies on Språkbanken's repo. Note that in 'Local setup for Ubuntu', the command npm install
must be performed as super user (in the korp fronted directory),
and the command sudo gem install compass
must be run finally so that compass
is found.
In app/config.js
, you must set the URL's of the cgi scripts (/cgi-bin/korp/
by default) and locally_available_corpora
(most corpora listed here will not be available).
Also modify settings.corporafolders
, settings.corpora
and locally_available_corpora
if you wish to add corpora.
Run grunt serve
and go to http://localhost:9000
.