A tiny Python no-string package for performing translation of a massive CSV
/JSONL
files that
natively provides support of pre-annotated fixed-spans that are invariant for translator.
The out-of-the box features of the bulk-translate
are:
- ✅ Support of the
spans
for annotation / optional translation. - ✅ Native Implementation of two translation modes:
fast-mode
: exploits extra chars that could be used for grouping all the text parts into single batch with further deconstruction.accurate
: performs individual translation of each text part.
- ✅ No strings: you're free to adopt any LM / LLM backend.
- Support
googletrans
by default.
- Support
From PyPI:
pip install bulk-translate
or latest version from here:
pip install git+https://github.com/nicolay-r/bulk-translate
NOTE: Spans supports only in JSON-lines format.
NOTE: Requires
source_iter
package installation.
For the following test.tsv
example data with annotated entities enclosed in square brackets:
python -m bulk_translate.translate \
--src "test/data/test.tsv" \
--prompt "{text}" \
--adapter "dynamic:models/googletrans_310a.py:GoogleTranslateModel" \
--output "test-translated.jsonl" \
%%m \
--src "auto" \
--dest "ru"
The pipeline construction components were taken from AREkit [github]