Live survey of off-the-shelf language identification tools for python
./datasets/tatoeba-sentences-2021-06-05/download
Available benchmarks:
- fasttext
- fasttext-compressed
- gcld3
- langdetect
- langid
- pycld2
Available datasets:
- tatoeba-sentences-2021-06-05
- tatoeba-sentences-2021-06-05-common-48
- open-subtitles-v2018-100k-per-lang
On the host machine.
python run.py <benchmark_name>
In docker:
docker build -t bench .
docker run -v `pwd`:/src -t -i bench python /src/run.py <benchmark_name>
python analyze.py --correctness
python analyze.py --timings
python get_memory_usage.py <benchmark_names>
# e.g. python get_memory_usage.py fasttext
# e.g. python get_memory_usage.py fasttext-compressed
It will print memory usage in MB (bytes/1024/1024).