Clojure wrapper for the Japanese Morphological Analyzer MeCab.
A minimal wrapper around the SWIG-generated Java bindings for MeCab. Currently tested with all varieties of UniDic and IPAdic, although other dictionaries are planned.
clj-mecab requires you to have MeCab (0.996) installed (the mecab-config
binary is used to find your MeCab configuration) and on your path.
On Debian:
apt get install mecab mecab-utils libmecab-java libmecab-jni unidic-mecab
On MacOS:
brew install mecab mecab-unidic
Note that you will need to manually install Maven dependencies on MacOS (see next section).
You also need to have the Java JNI (SWIG) bindings for the version of MeCab you have installed on your system installed in your local Maven repository (~/.m2
).
This can be accomplished by:
mvn install:install-file -DgroupId=org.chasen -DartifactId=mecab -Dpackaging=jar -Dversion=0.996 -Dfile=/usr/share/java/mecab/MeCab.jar -DgeneratePom=true
Where /usr/share/java/mecab/MeCab.jar
should point to the generated jar on your system.
You will also need to manually download cmecab-java and install it into your local Maven repo:
wget https://github.com/takscape/cmecab-java/releases/download/2.1.0/cmecab-java-2.1.0.tar.gz
tar xzf cmecab-java-2.1.0.tar.gz
mvn install:install-file -DgroupId=net.moraleboost.cmecab-java -DartifactId=cmecab-java -Dpackaging=jar -Dversion=2.1.0 -Dfile=cmecab-java-2.1.0/cmecab-java-2.1.0.jar -DgeneratePom=true
MeCab depends on CRF++, so first install that.
wget http://crfpp.googlecode.com/files/CRF%2B%2B-0.58.tar.gz
tar xzf CRF++-0.58.tar.gz
cd CRF++-0.58 && ./configure && make -j4 && make install && cd ..
Next, install MeCab.
wget http://mecab.googlecode.com/files/mecab-0.996.tar.gz
tar xzf mecab-0.996.tar.gz
cd mecab-0.996 && ./configure --with-charset=utf8 --enable-utf8-only && make -j4 && make install && cd ..
And at least one dictionary:
-
IPAdic:
wget http://mecab.googlecode.com/files/mecab-ipadic-2.7.0-20070801.tar.gz tar xzf mecab-ipadic-2.7.0-20070801.tar.gz cd mecab-ipadic-2.7.0-20070801 && ./configure --with-charset=utf8 && make -j4 && make install && cd ..
-
UniDic:
curl -O https://unidic.ninjal.ac.jp/unidic_archive/cwj/2.3.0/unidic-cwj-2.3.0.zip unzip -x unidic-cwj-2.3.0.zip cd unidic-cwj-2.3.0 && install -d $(mecab-config --dicdir)/unidic-cwj && install -m 644 dicrc *.bin *.dic $(mecab-config --dicdir)/unidic-cwj && cd ..
Interactive use:
(require '[clj-mecab.parse :as mecab])
(mecab/parse-sentence "こんにちは、世界!")
[{:orth "こんにちは", :f-type "*", :i-type "*", ...} {:orth "、", :f-type "*", :i-type "*", ...} {:orth "世界", :f-type "*", :i-type "*", ...} ...]
- For some yet unknown reason, calling .getSurface on a Node object will not work (empty string) the first time, but will the second time. Currently this means that :orth is not generated when using IPAdic. UniDic provides the surface node in the features array and is unaffected. Probably same issue as taku910/mecab#26
Copyright © 2013-2020 Bor Hodošček
Distributed under the Eclipse Public License, the same as Clojure, as well as the 3-clause BSD license.