translate rich-text documents between human languages, online or offline
solve two conflicting goals:
- preserve the original structure of the document, including whitespace, newlines, indents
- preserve sentences across structure boundaries like
<p><b>hello</b> world.</p>
- translate text documents: html, epub, odt, docx, pdf, rtf, latex
- translate video subtitles: srt, vtt
sentences may be broken by
- newlines in the source text
- markup tags
this is a problem, because when we feed sentence-parts to translators, then the translators will return worse quality than in the case, where we feed full sentences to translators
our solution is using two translations:
- a "splitted" translation
- a "joined" translation
the "splitted" translation serves as a "sourcemap", it has the correct positions of sentence-parts, but the translation has worse quality, because sentences are broken into sentence-parts.
the "joined" translation provides the translated sentences, with better quality than the "splitted" translation, but the locations of sentence-parts are lost.
currently, we align the two translations with a "character diff":
git diff --word-diff=color --word-diff-regex=. --no-index \
$(readlink -f translation.joined.txt) \
$(readlink -f translation.splitted.txt) |
sed -E $'s/\e\[32m.*?\e\[m//g; s/\e\\[[0-9;:]*[a-zA-Z]//g' |
tail -n +6 >translation.aligned.txt
- produce sourcemap of translation argos-translate#372
- Prohibit the translation of pieces of text in Google Translate
- argos-translate - Open-source offline translation library written in Python
- argos-translate-files
- argos-translate-html - too simple, no merging of "splitted" and "joined" translations
- subtitlestranslator.com