- In this repo you may find examples how to fine-tune Large Language Models (LLM) and apply them to the real task of news translation.
- Also in this repo we provide news parser, so you can easily parse any news web page you want (for example CNN, BBC news) and test how pre-trained LLM would translate parsed real news.
Facebook: M2M100 (1.2b parameters) - is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks, covering 100 languages.
All available languages: Afrikaans, Amharic, Arabic, Asturian, Azerbaijani, Bashkir, Belarusian , Bulgarian, Bengali, Breton, Bosnian, Catalan; Valencian, Cebuano, Czech, Welsh, Danish, German, Greeek, English, Spanish, Estonian, Persian, Fulah, Finnish, French, Western Frisian, Irish, Gaelic; Scottish Gaelic , Galician, Gujarati, Hausa, Hebrew, Hindi, Croatian, Haitian; Haitian Creole, Hungarian, Armenian, Indonesian , Igbo, Iloko, Icelandic, Italian, Japanese, Javanese, Georgian, Kazakh, Central Khmer, Kannada, Korean , Luxembourgish; Letzeburgesch, Ganda, Lingala, Lao, Lithuanian, Latvian, Malagasy, Macedonian, Malayalam, Mongolian, Marathi, Malay, Burmese, Nepali, Dutch; Flemish, Norwegian, Northern Sotho, Occitan (post 1500), Oriya, Panjabi; Punjabi, Polish, Pushto; Pashto, Portuguese, Romanian; Moldavian; Moldovan , Russian, Sindhi, Sinhala; Sinhalese, Slovak, Slovenian , Somali, Albanian, Serbian, Swati, Sundanese, Swedish, Swahili, Tamil, Thai, Tagalog, Tswana, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Wolof, Xhosa, Yiddish, Yoruba, Chinese, Zulu
Google: mT5 (1.2b parameters) - mT5 is pretrained on the mC4 corpus, covering 101 languages.
All available languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.