py3TranslateLLM.py uses Artificial Intelligence (AI) to translate files.
The focus is on producing the highest quality local Large Language Model (LLM) translations possible, but there is also support for batches using Neural Machine Translation (NMT) models and certain cloud translation APIs.
More specifically, this Python program is a CLI wrapper for the following translation engines:
And provides interoperability for the following formats:
- Comma separated value (.csv).
- Open Office XML (.xlsx).
- Plain text.
- These translation engines:
- DeepL API (Free).
- DeepL API (Pro).
- Sugoi Offline Translator from Sugoi Toolkit v4-9.
- These file formats:
- Microsoft Excel 97/2000/XP/2003 (.xls).
- OpenDocument spreadsheet (.ods).
- Generic OpenAI compatible web servers.
- Example: https://github.com/vllm-project/vllm
- Certain cloud based NMT translation engines: Google Translate, Google Cloud NMT, Bing Translate, Microsoft Azure NMT, Yandrex, etc.
Not Planned:
- Microsoft Excel 95 (.xls).
- This might end up being supported anyway.
- Sugoi's DeepL, Sugoi Translator Premium, Sugoi Papago.
- Parsing arbitrary file types. Only spreadsheets and plain text files are natively supported.
- To support arbitrary input (.doc, .srt, .epub, .ks, .json) see Regarding Scope for help converting arbtrary data types.
Undetermined:
- Which cloud based LLMs py3TranslateLLM should incorporate.
- Which ones that can be used for translation are expected to be long lived and have unlimited use APIs?
- Or is web hooking any of them worthwhile?
- OpenAI's GPT. For now, consider:
- DazedMTL - Supports OpenAI's LLM models like v3.5 Turbo, v4.0 Turbo, GPT-4o.
- py3TranslateLLM should unofficially work on older Python versions like 3.4.
- Older than 3.7 is tricky because dictionaries became ordered in 3.7 and the order might be important for cache, especially cache.rebuildCache().
- Older than 3.4 might be tricky because:
pathlib
, which containsPath
that is used by py3TranslateLLM to create folders, was not included in the Python standard library before 3.4.- The exist_ok parameter was not added to pathlib.Path().mkdir() until 3.5.
- Same with
pip
. - 3.4 already requires using an older
openpyxl
version. Using even older versions might incorporate even more already fixed bugs. - It is unlikely any
deepl-python
version that supports 3.4 still works with DeepL's contemporary API. - Minor: The
chardet
library requires 3.7+.- Minor because that library is optional and text encodings should always be in utf-8 or manually specified anyway.
- https://artificialanalysis.ai
- Guide: huggingface's chatbot-arena-leaderboard. Examples:
- Mixtral 8x7b v0.1.
- Yi-34B-Chat.
- Tulu-2-DPO.
- Google's Gemma.
- Random: Japanese StableLM Instruct Gamma 7B.
- Notes:
- The size of the model is also the RAM requirements to load it into memory.
- The more it fits into GPU memory (VRAM), the better the performance.
- Not all models listed on leaderboard are compatible with KoboldCpp. See KoboldCpp's documentation for compatible model formats.
Current version: 2024.10.03-alpha
Warning: py3TranslateLLM is currently undergoing active development. The project in the alpha stages. Alpha means core functionality is currently under development.
- Install Python 3.7+. For Windows 7, use this repository.
- Make sure the Python version matches the architecture of the host machine:
- For 64-bit Windows, download the 64-bit (amd64) installer.
- For 32-bit Windows, download the 32-bit installer.
- During installation, make sure
Add to Path
is selected. - Open an command prompt.
python --version
#Check to make sure Python 3.7+ is installed.python -m pip install --upgrade pip
#Optional. Update pip, python's package manager program.
- Make sure the Python version matches the architecture of the host machine:
- Download py3TranslateLLM using one of the following methods:
- Download the latest project archive:
- Click on the green
< > Code
button at the top -> Download ZIP.
- Click on the green
- Git: #Requires
git
to be installed.- Open an administrative command prompt.
- Navigate to a directory that supports downloading and arbitrary file execution.
git clone https://github.com/gdiaz384/py3TranslateLLM
- Download from last stable release:
- Click on on "Releases" at the side (desktop), or bottom (mobile), or here.
- Download either of the archive formats (.zip or .tar.gz).
- Download the latest project archive:
- If applicable, extract py3TranslateLLM to a directory that supports arbitrary file execution.
- Open an administrative command prompt.
cd /d py3TranslateLLM
#change directory to enter thepy3TranslateLLM
folder.pip install -r resources/requirements.txt
pip install -r resources/optional.txt
python py3TranslateLLM.py --help
Install/configure these other projects as needed:
- DeepL:
- DeepL API support is implemented using their Python library.
- It can be installed with:
pip install -r resources/requirements.txt
or separately with:pip install deepl
- Usage of the DeepL API, both Free and Pro, requires an account and credit card verification.
- For the DeepL API, an API key is needed. It must be in one of the following places:
- TODO: Put stuff here.
- It can be installed with:
- DeepL Web and DeepL's native clients do not seem to have usage limits, and the Windows client at least does not require an account. They might do IP bans after a while.
- All usage of DeepL's translation services and is governed by their Terms of Use.
- DeepL API support is implemented using their Python library.
- LLM support is currently implemented using the KoboldCpp's API which requires KoboldCpp:
- CPU/Nvidia GPUs: KoboldCpp, FAQ.
- AMD GPUs: KoboldCpp-ROCM.
- Developed and tested using
KoboldCpp_nocuda
v1.65 which implements KoboldCpp API v1. - fairseq is a library released by Facebook/Meta for data training.
- To install and use fairseq outside of Jpn->Eng translation, refer to fairseq's documentation and obtain an appropriately trained model.
- Sugoi NMT only does Jpn->Eng translation. It requires the Sugoi Offline Translator model which is part of the Sugoi Toolkit.
- Sugoi NMT is a wrapper for fairseq that comes preconfigured with a Japanese->English dictionary.
- DL: here or here.
- Reccomended: Remove some of the included spyware.
- Open:
Sugoi-Translator-Toolkit\Code\backendServer\Program-Backend\Sugoi-Japanese-Translator\main.js
- Comment out
/* */
or delete the analytics.
- Open:
- Tested using Sugoi Offline Translator 4.0 which is part of Sugoi Toolkit 6.0-8.0. v5-8 of the Sugoi Toolkit contain the v4 model in the native PyTorch format. v9+ contain the v4 model in CTranslate2 format.
TODO:
Syntax for executable file (.exe):
py3TranslateLLM.exe [options]
Syntax for Python3 script (.py):
python py3TranslateLLM.py [options]
python3 py3TranslateLLM.py [options]
Usage:
py3TranslateLLM --help
py3TranslateLLM -h
py3TranslateLLM koboldcpp --address=http://192.168.1.100 --port=5001
py3TranslateLLM deepl_api_free [options]
py3TranslateLLM deepl_api_pro [options]
py3TranslateLLM deepl_web [options]
py3TranslateLLM sugoi [options] #sugoi is an alias for fairseq
py3TranslateLLM fairseq [options]
TODO: This section.
Parameter | Description | Example(s) |
---|---|---|
-te , --translationEngine |
The engine used for translation. | cacheOnly , koboldcpp , deepl-api-free , deepl-api-pro , py3translationserver , sugoi |
-a , --address |
A valid network address including the protocol but not the port number. | --address=http://192.168.1.100 , -a=http://localhost |
-port , --port |
The port number associated with the --address listed above. |
--port=5001 , --port=8080 , --port=443 |
Variable name | Description | Examples |
---|---|---|
fileToTranslate |
The file to translate. Should be a spreadsheet or a plaintext.txt. | myFile.txt , mySpreadsheet.xlsx |
languageCodesFile |
Contains the list of supported languages. | resources/ languageCodes.csv |
Variable name | Description | Examples |
---|---|---|
py3TranslateLLM.ini |
This file may be used instead of the CLI to specify input options. Keys in the key=value pairs are case sensitive. | py3TranslateLLM.ini , renamedBinary.ini |
outputFile |
The name and path of the file to use as output. Will be same as input with 'translated.xlsx' appended if not specified. Specify a spreadsheet to output all data. Specify a .txt file to dump the raw translated output only the preferred translation. | None , output.csv , myFolder/ output.xlsx |
promptFile |
This file has the prompt for the LLM. Only needed if using an LLM. | resources/ templates/ prompt.Mixtral8x7b.example.txt |
revertAfterTranslationDictionary |
Entries will be submitted to the translation engine but then replaced back to the original text after translation. | resources/ templates/ characterNamesDictionary_example.csv |
preTranslationDictionary |
Entries will be replaced prior to submission to the translation engine. | preTranslationDictionary.csv |
postTranslationDictionary |
Entries will be replaced after translation. | postTranslationDictionary.csv |
postWritingToFileDictionary |
After the translated text has been written back to a text file, the file will be opened again to perform these replacements. | postWritingToFileDictionary.csv |
sceneSummaryPrompt |
Experimental feature. This file has the prompt used to generate summaries using an LLM. Using this disables translation. | sceneSummaryPrompt.txt |
- LLMs supports many different sources of information when translating text. However, since much of this information is dynamic, known only at runtime, or constantly changing, py3TranslateLLM supports replacing certain {keywords} in the instructions to the LLM at runtime. This should ensure the highest quality translation possible while also supporting a very large degree of automation by making it possible to use the same set of LLM instructions,
prompt.txt
,memory.txt
, andsceneSummary.txt
, for an entire dataset. - LLM translations always require a
prompt.txt
.prompt.txt
should include the main instructions to the LLM, e.g. translate this text, as well as examples to help the LLM understand how to format the desired output. See resources\templates* for examples.- Guides:
- Microsoft's prompt engineering guide.
- Nvidia's developer blog LLM introduction.
- Examples:
- To optionally improve translation quality, it is also recommended to always use
memory.txt
.memory.txt
should include background information that may or may not be directly relevant to the immediate translation. Examples include a description of the source content (e.g. story, dialogue, novel, subtitles, game), translations for character names, information about the characters and their relationships to one another, and a description of what is happening in the scene currently being translated.
- Experimental Feature: To optionally improve translation quality, consider using
sceneSummary.txt
.sceneSummary.txt
is aprompt.txt
with instructions to generate a summary of the current scene being translated so that summary can be inserted intoprompt.txt
and/ormemory.txt
prior to translating individual lines.
- Keywords:
Variable | Scope | Description |
---|---|---|
{untranslatedText} |
prompt.txt | The current line prior to translation. |
{sourceLanguage} |
All | The source language specified at the command prompt. The literal text is the first entry in languageCodes.csv . |
{targetLanguage} |
All | The target language specified at the command prompt. The literal text is the first entry in languageCodes.csv . |
{history} |
prompt.txt | The rolling history buffer of previously untranslated/translated entry pairs. This gets formatted according to the LLM instruction type: chat, instruct, autocomplete. |
{scene} |
sceneSummary.txt | The current batch of untranslated lines to use when generating a summary. |
{scene} |
prompt.txt, memory.txt | The summary generated from the current untranslated lines. |
Note: If the data inserted in the above variables is not formatted properly for a given model, especially {history}
, then update the code at engine.py appropriately or [open an issue] to request support for a specific LLM model.
- This xkcd is my life.
- Backups of the imported data are written to backups/[date]/* prior to data processing. Use
--backups
,-bk
to disable this feature. - If interrupted, translated entries are still available in the local cache. Running the same command as-is will therefore skip previously translated data starting from the last time cache was written to disk.
- Alternatively, use one of the backup files created under backups/[date]/* to continue with minimal loss of translated data. Resuming from save data in this folder after interruptions is not automatic. Technically
--resume
(-r
) exists for this reason, but only backup files with today and yesterday's date are checked.
- Alternatively, use one of the backup files created under backups/[date]/* to continue with minimal loss of translated data. Resuming from save data in this folder after interruptions is not automatic. Technically
- The second column in the spreadsheets is reserved for the speakerName of the current line. If present, the speakerName is automatically used for LLM translations.
- By default, backups of fileToTranslate are made at most once every 9 minutes. To alter this behavor change
defaultMinimumSaveIntervalForMainSpreadsheet
inpy3TranslateLLM.py
. - By default, cache is written at most once every 5 minutes. To alter this behavior change
defaultMinimumSaveIntervalForCache
inpy3TranslateLLM.py
. - By default, sceneSummaryCache is written at most once every 5 minutes. To alter this behavior change
defaultMinimumSaveIntervalForSceneSummaryCache
inpy3TranslateLLM.py
. - Settings can be specified during runtime from the command prompt/terminal/CLI and/or using
py3TranslateLLM.ini
. See Regarding Settings Files for more information. - Aside: LLaMA stands for Large Language Model Meta AI. Wiki.
- Therefore Local LLaMA is about running AI on a local PC.
- Many features have not been implemented yet.
- Most features have not been tested yet.
- The design concept behind pyTranslateLLM is to produce the highest quality machine and AI translations possible for dialogue and narration by providing NMT and LLM models the information they need to translate to the best of their ability. This includes but is not limited to:
- Bunding the untranslated language strings into paragraphs to increase the context of the translated text.
- For LLMs and DeepL, providing them with the history of previously translated text to ensure proper flow of dialogue.
- For LLMs and DeepL, identifying any speakers by name, sex and optionally other metrics like age and occupation.
- For LLMs and DeepL, providing other arbitrary bits of information in the prompt.
- Supporting dictionaries that allow removing and/or substituting strings that should not be translated prior to forming paragraphs and prior to submitting text for translation. Examples in-line text that should be removed or altered: [r] [repage] [heart].
- This should help the LLM/NMT understand the submitted text as contiguous 'paragraphs' better.
- Tip: To automate this, use escapeLibrary.py during parsing.
- Other translation techniques omit one or all of the above. Providing this information should dramatically increase the translation quality when translating languages that are heavily sensitive to context, like Japanese, where much or most of the meaning of the language is not found in the spoken or written words but rather in the surrounding context in which the words are spoken.
- Aside: For Japanese in particular, context is very important as it is often the only way to identify who is speaking and whom they are talking about.
- If translating from context light languages, like English where most of the meaning of the language is found within the language itself, then there should not be any or only small differences in translation quality. For such languages, use a translation engine that supports batch translations for the maximum possible speed.
- In addition, substution dictionaries are supported at every step of the translation workflow to fine tune input and output and deal with common mistakes. This should result in a further boost in translation quality.
- The intent is to increase the productivity of translators by cutting down the time required for the most time consuming aspect of creating quality dialogue translations, the editing phase, by providing the highest quality MTL baseline possible from which to start editing and providing multiple translation engines for easy cross referencing.
- This program was written as part of a workflow meant to complement other automated parsing and script extraction programs meaning that compatibility with such programs and openness required to adjust workflows as needed are part of the core design concept.
- This program focuses on translating dialogue that has been input into spreadsheets.
- Other programs can be used to find and parse small bits of untranslated text in text files and images and handle how to reinsert them.
- While it is not the emphasis of this program, submitting translations in batches to some NMT models is also supported.
- While it is not the emphasis of this program, there is some code to help translate plain.txt files. This only works in line-by-line mode.
- For more complicated input, parse the output using a parsing program that can convert it to a spreadsheet format, like .csv, before using py3TranslateLLM. Examples:
- py3AnyText2Spreadsheet. Supports parsering many common formats via templates but intended more for DIY non-regex parsing.
- SExtractor. Supports regex.
- fileTranslate. Supports regex.
- Consider writing a parser yourself. Assuming plain text files, it should not take more than an afternoon to write a parser in Python due to Python's very large standard library, [available templates], and a very large amount of third party libraries readily availble on PyPi.org.
- Tip: After parsing, use pyexcel, documentation or [chocolate.py] to export the data to and out of spreadsheets easily.
- pyexcel and chocolate are wrapper libraries for openpyxl and other libraries that focuses on providing i/o or data structure manipulation for the various spreadsheet formats.
- Note that pyexcel has a plugin system for various formats and requires those plugins to also be installed in addition to the implemented base libraries. See their installation section for a lack of guidance on how to install them.
- py3TranslateLLM uses spreadsheets for its internal data structures. LibreOffice and other spreadsheet manipulation programs can be used to read/write them directly. For more information, see: "Regarding Open Office XML".
- For the spreadsheet formats, .csv, .xlsx, .xls, .ods, the following apply:
- The first row, 1st, is reserved for headers and is always ignored for data processing otherwise.
- The first column is reserved for the source content for translation.
- Multiple lines within a cell for the first column, called 'paragraphs,' are allowed.
- New lines will not be preserved in the output cell. If this behavior is desired, regenerate them dynamically when writing to the output files as needed. Word wrap is outside the scope of this project.
- The second column, 2nd, is reserved for the character speaking.
- Feel free to add the speaker if a speaker could not be automatically determined during parsing.
- The third column, 3rd, and columns after it are used for metadata or the translation engines. Currently that is KoboldCpp, DeepL API Free, DeepL API Pro, py3translationServer, Sugoi.
- Label the first cell in a column to reserve that column for that translation engine. One translation engine per column.
- To reserve it for metadata, call it
metadata
or similar. - If the current translation engine does not exist as a column, then it will be added dynamically as needed.
- KoboldCPP translation engines are in the format
koboldcpp/[modelName]
, therefore changing the model mid-translation will result in a completely new column because different models produce different output. - py3translationServer columns are in the format py3translationServer/[modelName] which can result in a completely new column when gaming models because different models produce different output.
- For CTranslate2, the model name sometimes takes the form py3translationServer/model.bin if the model.bin file was explicitly used when invoking py3translationServer. This behavior can result in collisions between different models and potentially different languages as well. As a workaround, if using py3translationServer + CTranslate2, then specify loading the model by using the folder name of the model instead and name the folder in a descriptive way.
- Example:
b100_model/model.bin
and invoke usingb100_model
only. - This behavior of py3translationServer may change in the future.
- KoboldCPP translation engines are in the format
- The order of the translation engine columns (4+) only matters when writing back to files (.txt). The column furthest to the right will be preferred.
- .csv files:
- Must use a comma
,
as a delimiter. - Entries containing:
- new line character(s)
\n
,\r\n
- comma(s)
,
- must be quoted using two double quotes
"
. Example:"Hello, world!"
- new line character(s)
- Single quotes
'
are not good enough. Use double quotes"
- Entries containing more than one double quote
"
within the entry must escape those quotes.- Use a backlash
\
like:"\"Hello, world!\""
- Or use double quotes
""
like:"""Hello, world!"""
TODO: Test this.
- Use a backlash
- Whitespace is ignored for
languageCodes.csv
and for .csv's that contain the untranslated text. - Whitespace is preserved for all of the dictionaries.
- Must use a comma
- .xls is quite old and supports a maximum of ~65,000 rows, which is relatively small. Source. Consider using any other format.
- Microsoft's documentation.
- Settings read from the command prompt take priority over what is specified in the
.ini
text file. - By default, the name of the program without an extension +
.ini
is used to determine the name of the settings.ini file from which to read settings. Example: pyTranslateLLM.ini - This file can also specified manually at the CLI during runtime by using
--settingsFile
,-sf
. - Values in the settings.ini file are designated using the following syntax:
commandLineOption=value
- The 'None' keyword for an option indicates no value. Example:
preTranslationDictionary=None
- Whitespace is ignored.
- Lines with only whitespace are ignored.
- Lines starting with
#
are ignored. In other words,#
means a comment. - Keys in the key=value pairs are case sensitive.
- Keys in the key=value pairs must match the command line options exactly or will be interpreted as user-defined custom values.
- See:
py3TranslateLLM --help
and the Parameters enumeration below for valid values.
- The text formats used for settings.ini (.ini .txt) have their own syntax:
#
indicates that line is a comment.- Values are specified by using
item=value
Example:paragraphDelimiter=emptyLine
- Empty lines are ignored.
- TODO: Update this part.
- For the text formats used for input (.txt, .ks, .ts), the inbuilt parser will use the user provided settings file to parse the file.
- A settings file is required when parsing such raw text files.
- Examples of text file parsing templates can be found under
resources/templates/
.
- There are a lot of dictionary.csv files involved. Understanding the overall flow of the program should clarify how to use them:
- All input files besides
fileToTranslate
are read and parsed. - The data structure that holds both the untranslated and translated text while the program is working is called
mainSpreadsheet
. How it is created is handled differently depending upon iffileToTranslate
is a spreadsheet or a text file. For spreadsheets:- If
fileToTranslate
is a spreadsheet, .csv, .xlsx, .xls, .ods, .tsv, it is converted tomainSpreadsheet
as-is. See above for the formatting guidelines.
- If
- If
fileToTranslate
is not a spreadsheet, it is treated as a text file:- The lines in the text file are read in as-is line-by-line without any parsing logic.
- To parse the text file in a more complicated way, use py3AnyText2Spreadsheet.
- The lines in the text file are read in as-is line-by-line without any parsing logic.
- The process to translate the first column, 'A', in
mainSpreadsheet
using a particular translation engine begins. Examples: koboldcpp, deepl_api_free, deepl_api_pro, deepl_web, py3translationServer, sugoi.- If the paragraph is present in
cache.xlsx
, it is translated using the cache file and the translation process skips to step 7/step g. - If present,
revertAfterTranslationDictionary
is considered and replacements are performed. - If present,
preTranslationDictionary
is considered and replacements are performed. - The paragraph is submitted to the translation engine.
- If context history is enabled, the translated paragraph is added to context history for subsequent translations.
- If present,
revertAfterTranslationDictionary
is considered to revert certain changes. - The untranslated line and the translated line are added to the
cache.xlsx
file as a pair. - If present,
postTranslationDictionary
is considered to alter the translation further. - The translated paragraph is written to
mainSpreadsheet
in the column for the current translation engine. - Periodically as entries are translated, a backup.xlsx is made under
backups/[date]/
.
- If the paragraph is present in
- The spreadsheet file, .xlsx, is written to output.
- For text file and .csv output only,
postWritingToFileDictionary
is considered. This file is intended to fix encoding errors when doing baseEncoding -> unicode -> baseEncoding conversions since codec conversions are not lossless.
- All input files besides
- DeepL has quite a few quirks:
- DeepL's support page.
- Certain languages, like Chinese, English, and Portuguese, have regional variants.
- DeepL is picky about the target English dialect based upon the source language.
- But yet, language dictionaries can be used with any dialect of that language (TODO: double-check this).
- DeepL's API Free vs Pro plans.
- The formal vs informal feature is only available for Pro users, so not available for the deepl-api-free or deepl-web translation engines. About-the-formal-informal-feature.
- If translating to Japanese, not from, then read DeepL's plain vs polite feature.
- The default list of supported languages can be found at
resources/languageCodes.csv
. - If using an LLM for translation and utilizing a language not listed in
languageCodes.csv
, then add that language as a new row to make py3TranslateLLM aware of it. This file is subject to change without notice. - The default supported languages list is based on DeepL's supported languages list and their openapi.yaml specification.
- py3TranslateLLM uses mappings based upon this table and supports any of the following to specify a language:
- The full language name in English:
English
,German
,Spanish
,Russian
. - The 2 letter language code:
en-us
,de
,es
,ru
. - The 3 letter language code:
eng
,deu
,spa
,rus
. - Entries are case insensitive. Both
lav
andLAV
will work.
- The full language name in English:
- DeepL only supports 2 letter language codes which creates some ambiguity regarding conversion to 3 letter language codes.
- Note these quirks and 3 letter language code collisions:
English
has a collision in the 3 letter codeEng
betweenEnglish (American)
andEnglish (British)
.- Selecting the three letter language code of
Eng
will default toEnglish (American)
,En-US
. - To select
English (British)
as a 3 letter language code, useEng-GB
. English
is also mapped toEnglish (American)
by default.- To select
English (British)
, enterEnglish (British)
or a language code.
- To select
- The above distinction between the two dialects only applies to selecting English as the target language. If selecting English as a source language,
English
is sufficent and will be used regardless.
- Selecting the three letter language code of
Chinese (traditional)
:- This is not yet supported by DeepL as a target language.
- There is a collission when using
Chinese (traditional)
with theChinese (simplified)
2 letter language code.- If using the 2 letter language code
ZH
,Chinese (simplified)
will be selected. - To select
Chinese (traditional)
as a 2 letter language code useZH-ZH
.
- If using the 2 letter language code
- The above distinction between the two dialects only applies to selecting Chinese as the target language. If selecting Chinese as a source language,
Chinese
is sufficent and will be used regardless.- DeepL supports both
Chinese (traditional)
andChinese (simplified)
as the source language using automatic character detection.
- DeepL supports both
Czech
might be CZE (B), or CES (T). It is unclear what DeepL supports. The 3 letter language code ofCES
is used.Dutch
might be DUT (B), or NLD (T). It is unclear what DeepL supports. The 3 letter language code ofNLD
is used.French
can be FRE (B), or FRA (T). It is unclear what DeepL supports. The more commonFRA
is used.- Aside: French also some very old unused dialects from the middle ages: FRM, FRO.
German
has GER (B), DEU (T). It is unclear what DeepL supports. The 3 letter language code ofDEU
is used.- Aside: German has some archaic ones too.
- Modern
Greek
has GRE (B) and ELL (T). It is unclear what DeepL supports. The 3 letter language code ofELL
is used. - Portuguese has collission in the 3 letter code
POR
betweenPortuguese (European)
andPortuguese (Brazillian)
.- Selecting the three letter language code of
POR
will default toPortuguese (European)
,PT-PT
. - To select
Portuguese (Brazillian)
as a 3 letter language code, usePOR-BRA
. - The above distinction between the two dialects only applies to selecting Portuguese as the target language. If selecting Portuguese as a source language,
Portuguese
is sufficent and will be used regardless.
- Selecting the three letter language code of
Romanian
has both RUM (B) and RON (T). It is unclear what DeepL supports. The 3 letter language code ofRON
is used.Spanish
has an alias ofCastilian
.
- Open Office XML (OOXML), .xlsx, is the native format used in py3TranslateLLM to store data internally during processing and should be the most convenient way to edit translated entries and the cache directly without any unnecessary conversions that could introduce formatting bugs.
- Here are some free and open source software (FOSS) office suits that can read and write Open Office XML and the other spreadsheet formats (.csv, .xls, .ods):
- LibreOffice. License and source.
- OnlyOffice is AGPL v3. Source.
- Apache OpenOffice. License and source. Note: Can read but not write to .xlsx.
- OpenPyXL, the library used in the core data structure for this program, follows the Open Office XML standard closely, and will not load documents that do not follow the same standard closely.
- In other words, Microsoft Office will probably not work. If using Microsoft Excel, then export as .ods, .xls, or .csv instead. For Excel .csv files, specify the option... TODO: this part.
- See Microsoft's documentation for why their software does not work correctly.
- Read the Text Encoding wiki entry.
- After reading the above wiki entry, the rest of this section should make more sense.
- Tip: Use
py3TranslateLLM.ini
to specify the encoding for text files used withpy3TranslateLLM.py
. - For compatability reasons, data gets converted to binary strings for stdout which can result in the console sometimes showing utf-8 hexadecimal (hex) encoded unicode characters, like
\xe3\x82\xaf\xe3\x83\xad\xe3\x82\xa8
, especially withdebug
enabled. To convert them back to non-ascii chararacters, likeクロエ
, dump them into a hex to unicode converter.- Example: www.coderstool.com/unicode-text-converter
- Example: If the local console or Python IDE supports utf-8, then it can also be displayed properly after decoding the string in Python:
- Start a command prompt or terminal.
python
string=b'\xe3\x82\xaf\xe3\x83\xad\xe3\x82\xa8'
string.decode('utf-8')
- ctrl + z
- Some character encodings cannot be converted to other encodings. When such errors occur, use the following error handling options:
- docs.python.org/3.7/library/codecs.html#error-handlers, and More Examples -> Run example.
- The default error handler for input files is
strict
which means 'crash the program if the encoding specified does not match the file perfectly'. - On Python >= 3.5, the default error handler for the output file is
namereplace
. This obnoxious error handler:- Makes it obvious that there were conversion errors.
- Does not crash the program catastrophically.
- Makes it easy to do ctrl+f replacements to fix any problems.
- Tip: Use
postWritingToFileDictionary
or py3stringReplace to automate these ctrl+f replacements.
- Tip: Use
- If there are more than one or two such conversion errors per file, then the chosen file encoding settings are probably incorrect.
- If the chardet, charamel, or charset-normalizer libraries are available, they will be used to try to detect the character encoding of files via heuristics. While heuristics are an imperfect solution and obviously very error prone, it is still better than nothing.
- To make the above libraries available, install at least one using
pip
:pip install chardet
pip install charamel
pip install charset-normalizer
- Priority is chardet > charamel > charset-normalizer.
- If none of the above are available, then everything is assumed to be
utf-8
unless otherwise specified. - Note that support for
charamel
andcharset-normalizer
has not actually been implemented yet (2024-07-09).
- To make the above libraries available, install at least one using
- Reccomended: If you do not want to deal with this, then use a binary file in the releases page instead.
- py3TranslateLLM was developed on Python 3.7.
- deepl-python is going to start requiring Python 3.8+ in 2024 because ???.
- It is not necessarily clear what versions work with what other versions, in part due to the shenanigans of some developers creating deliberate incompatibilities, so just install whatever and hope it works.
- In addition to the libraries listed below, py3TranslateLLM also uses several libraries from the Python standard library. See source code for an enumeration of those.
Library name | Required, Recommended, or Optional | Description | Install command | Version used to develop py3TranslateLLM |
---|---|---|---|---|
openpyxl | Required. | Used for main data structure and Open Office XML (.xlsx) support. | pip install openpyxl |
3.1.2 |
chocolate | Required. | Implements openpyxl . Has various functions to manage using it as a data structure. Also implements other spreadsheet libraries. |
Included with py3TranslateLLM. | See source. |
functions | Required. | Has various helper functions used in main program. | Included with py3TranslateLLM. | See source. |
dealWithEncoding | Required. | Handles text codecs. Implements text codec detection libraries. | Included with py3TranslateLLM. | See source. |
translationEngines/* | Required. | Handles logic for translation services. | Included with py3TranslateLLM. | See source. |
requests | Required. | Used for HTTP get/post requests. Required by both py3TranslateLLM and DeepL. | pip install requests |
2.31.0 |
chardet | Recommended. | Detects text codecs. | pip install chardet |
5.2.0 |
charamel | Recommended. | Detects text codecs. | pip install charamel |
1.0.0 |
charset-normalizer | Recommended. | Detects text codecs. | pip install charset-normalizer |
3.3.2 |
deepl-python | Optional. | Used for DeepL NMT via their API. Optional otherwise. | pip install deepl |
1.16.1 |
xlrd | Optional. | Provides reading from Microsoft Excel Document (.xls). | pip install xlrd |
2.0.1 |
xlwt | Optional. | Provides writing to Microsoft Excel Document (.xls). | pip install xlwt |
1.3.0 |
odfpy | Optional. | Provides interoperability for Open Document Spreadsheet (.ods). | pip install odfpy |
1.4.1 |
tdqm | Optional. | Adds pretty progress bar to CLI. | pip install tdqm |
0.0.1 |
pykakasi | Optional. | Fast, simple, and lightweight JPN->Romaji dictionary based on Kakasi. | pip install pykakasi |
2.2.1 |
cutlet | Optional. | Accurate JPN->Romaji dictionary with MeCab support. | pip install cutlet |
0.4.0 |
Libraries can also require other libraries.
- deepl-python requires:
requests
,charset-normalizer
,idna
,urllib3
,certifi
. - openpyxl can optionally use:
defusedxml
. - odfpy requires:
defusedxml
. - cutlet requires fugashi to tokenize contents based upon the MeCab tokenizer using a dictionary like unidic-py, unidic-lite, ipadic-py, jumandic-py.
- Alternative MeCab wrappers:
- https://github.com/SamuraiT/mecab-python3
- https://github.com/WorksApplications/sudachi.rs and https://github.com/WorksApplications/SudachiPy
- Korean versions:
- https://github.com/NoUnique/pymecab-ko
- https://konlpy.org/en/latest/
- Python standard library's license. For source code, open the Python installation directory on the local system.
- openpyxl's license and source code.
- chardet's license is LGPL v2+. Source code.
- charamel's license and source code.
- charset-normalizer's license and source code.
- xlrd's license and source code.
- xlwt's license and source code.
- odfpy's, license is GPL v2. Source code.
- tdqm's license and source code.
- KoboldCPP is AGPL v3. The GGML library and llama.cpp part of KoboldCPP has this license.
- DeepL's various plans and Terms of Use. DeepL's python library for their API has this license and source code.
- fairseq and license.
- Sugoi and source code.
- The pretrained model used in Sugoi NMT has its own non-commericial license.
- 'Sugoi NMT' is a wrapper for fairseq which, along with the pretrained model, does the heavy lifting for 'Sugoi NMT'.
- Sugoi NMT is one part of the 'Sugoi Translator Toolkit' which is itself part of the free-as-in-free-beer distributed 'Sugoi Toolkit' which contains other projects like manga translation and upscaling.
- The use of Github to post source code for Sugoi Toolkit suggests intent to keep the wrapper code under a permissive license. A more concrete license may be available on discord.
- py3TranslateLLM.py and the associated libraries under
resources/
are GNU Affero GPL v3.- You are free to use the software as long as you do not infringe on the freedoms of other people.
- Summary: Feel free to use it, modify it, and distribute it to an unlimited extent, but if you distribute binary files of this program outside of your organization, then please make the source code for those binaries available.
- The imperative to make source code available also applies if using this program as part of a server if that server can be accessed by people outside of your organization. For additional details, consult the license text.