Skip to content

Latest commit

 

History

History
22 lines (11 loc) · 3.03 KB

README.md

File metadata and controls

22 lines (11 loc) · 3.03 KB

Large Multi-Language Models for News Translation

  • In this repo you may find examples how to fine-tune Large Language Models (LLM) and apply them to the real task of news translation.
  • Also in this repo we provide news parser, so you can easily parse any news web page you want (for example CNN, BBC news) and test how pre-trained LLM would translate parsed real news.
Снимок экрана 2023-12-18 в 14 48 37

1. Facebook: M2M100

Facebook: M2M100 (1.2b parameters) - is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks, covering 100 languages.

All available languages: Afrikaans, Amharic, Arabic, Asturian, Azerbaijani, Bashkir, Belarusian , Bulgarian, Bengali, Breton, Bosnian, Catalan; Valencian, Cebuano, Czech, Welsh, Danish, German, Greeek, English, Spanish, Estonian, Persian, Fulah, Finnish, French, Western Frisian, Irish, Gaelic; Scottish Gaelic , Galician, Gujarati, Hausa, Hebrew, Hindi, Croatian, Haitian; Haitian Creole, Hungarian, Armenian, Indonesian , Igbo, Iloko, Icelandic, Italian, Japanese, Javanese, Georgian, Kazakh, Central Khmer, Kannada, Korean , Luxembourgish; Letzeburgesch, Ganda, Lingala, Lao, Lithuanian, Latvian, Malagasy, Macedonian, Malayalam, Mongolian, Marathi, Malay, Burmese, Nepali, Dutch; Flemish, Norwegian, Northern Sotho, Occitan (post 1500), Oriya, Panjabi; Punjabi, Polish, Pushto; Pashto, Portuguese, Romanian; Moldavian; Moldovan , Russian, Sindhi, Sinhala; Sinhalese, Slovak, Slovenian , Somali, Albanian, Serbian, Swati, Sundanese, Swedish, Swahili, Tamil, Thai, Tagalog, Tswana, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Wolof, Xhosa, Yiddish, Yoruba, Chinese, Zulu

2. Google: mT5

Google: mT5 (1.2b parameters) - mT5 is pretrained on the mC4 corpus, covering 101 languages.

All available languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.

Снимок экрана 2023-12-19 в 11 54 35