This aim of this project is to make analysing the contents of a Japanese ebook easy and streamline the process for non-technical users. You can analyse an ebook, and see information including, but not limited to:
- The length of the book in words/characters
- The number of unique words/characters
- The number of unique words/characters only used once
- The percentage of unique words/characters only used once
- Occurence information for all words/characters that appear in the book
- Frequency distribution histograms for words
For text processing, we use MeCab and its python3
bindings
(https://pypi.org/project/mecab-python3/). We also use the
mecab-ipadic-neologd neologism dictionary
to include neologisms, e.g. new words that often originated from the internet.
- Clone repository:
git clone https://github.com/christofferaakre/japanese-ebook-analysis.git
- Make sure you have
mecab
andmecab-ipadic-neologd
set up on your system. If you're on linux or Windows WSL with theapt
packace manager, just run thesetup.sh
script:./setup.sh
and then restart your shell. Otherwise, for Linux or Mac, see http://www.robfahey.co.uk/blog/japanese-text-analysis-in-python/ for a good guide on how to set it up. If installing on Mac, you will currently need to change the installation path tousr/lib/mecab/dic/mecab-ipadic-neologd
using the-p
flag. Trying to set up the development environment on Windows without WSL is not recommended. - Install python dependencies:
pip install -r requirements.txt
- Install other dependencies (these all need to be in your system path):
pandoc
- Run
./app.py
(unix) orpython3 app.py
to start the flask dev server
Currently, the project is not deployed anywhere, so to use the service, you will need to follow the steps below in the development section first to get the server running.
- Upload a file containing japanese text to the server. Currently allowed file formats:
.epub
.txt
with utf-8 encoding. If you have a.txt
file with different encoding, there are many ways to change the encoding. I use this: https://github.com/Frederick-S/aozora-bunko-utf8, There is a japanese text in the public domain in this repository that you can use for testing purposes:data/sejain.txt
.
- The server will redirect you to a page showing you information about the ebook. You can then also click the 'See more details' button to see all the generated data, including a list of all the words used together with how many occurences there are for each word, and the same for the characters as well.
I'm very happy for any happy contributions! Before contributing, please
have a look at
CONTRIBUTING.md.
To see what needs work on, have a look at the repo's
Issues
and its
Pull requests.
Feel free to submit your own issue or pull request about a new feature or anything else. When submitting a pull request, don't be afraid to modify any of the files; I'm not very attached to the coding style used in the repo.
Please report bugs in the Issues section of the repo with the 'bug' label. Try to make your bug report as descriptive as possible to make it easy for others to track down the cause and potential fix of your issue.