ALBERT-Persian:
A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language
میتونی بهش بگی برت_کوچولو
Call it little_berty
ALBERT-Persian is the first attempt on ALBERT for the Persian Language. The model was trained based on Google's ALBERT BASE Version 2.0 over various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M
documents, 73M
sentences, and 1.3B
words, like the way we did for ParsBERT.
Table of Contents:
- Goals
- Introduction
- Results
- How to use
- Models
- NLP Tasks Tutorial 🤗
- Participants
- Cite
- Questions?
- Releases
Objective goals during training are as below (after 140K steps).
***** Eval results *****
global_step = 140000
loss = 2.0080082
masked_lm_accuracy = 0.6141017
masked_lm_loss = 1.9963315
sentence_order_accuracy = 0.985
sentence_order_loss = 0.06908702
ALBERT-Persian trained on a massive amount of public corpora (Persian Wikidumps, MirasText) and six other manually crawled text data from a various type of websites (BigBang Page scientific
, Chetor lifestyle
, Eligasht itinerary
, Digikala digital magazine
, Ted Talks general conversational
, Books novels, storybooks, short stories from old to the contemporary era
).
The following tables summarize the F1 scores obtained by ALBERT-Persian as compared to other models and architectures.
Dataset | ALBERT-fa-base-v2 | ParsBERT-v1 | mBERT | DeepSentiPers |
---|---|---|---|---|
Digikala User Comments | 81.12 | 81.74 | 80.74 | - |
SnappFood User Comments | 85.79 | 88.12 | 87.87 | - |
SentiPers (Multi Class) | 66.12 | 71.11 | - | 69.33 |
SentiPers (Binary Class) | 91.09 | 92.13 | - | 91.98 |
Dataset | ALBERT-fa-base-v2 | ParsBERT-v1 | mBERT |
---|---|---|---|
Digikala Magazine | 92.33 | 93.59 | 90.72 |
Persian News | 97.01 | 97.19 | 95.79 |
Dataset | ALBERT-fa-base-v2 | ParsBERT-v1 | mBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
---|---|---|---|---|---|---|---|---|
PEYMA | 88.99 | 93.10 | 86.64 | - | 90.59 | - | 84.00 | - |
ARMAN | 97.43 | 98.79 | 95.89 | 89.9 | 84.03 | 86.55 | - | 77.45 |
If you tested ALBERT-Persian on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference
- for using any type of Albert you have to install sentencepiece
- run this in your notebook
!pip install -q sentencepiece
from transformers import AutoConfig, AutoTokenizer
from transformers import AutoModelForMaskedLM # for pytorch
from transformers import TFAutoModelForMaskedLM # for tensorflow
config = AutoConfig.from_pretrained("HooshvareLab/albert-fa-zwnj-base-v2")
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/albert-fa-zwnj-base-v2")
# for pytorch
model = AutoModelForMaskedLM.from_pretrained("HooshvareLab/albert-fa-zwnj-base-v2")
# for tensorflow
# model = TFAutoModelForMaskedLM.from_pretrained("HooshvareLab/albert-fa-zwnj-base-v2")
text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد میتوانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
tokenizer.tokenize(text)
>>> Tokenized:
▁ما
▁در
▁هوش
واره
▁معتقدیم
▁با
▁انتقال
▁صحیح
▁دانش
▁و
▁
ا
گاهی
،
▁همه
▁افراد
▁می
[ZWNJ]
توانند
▁از
▁ابزارهای
▁هوشمند
▁استفاده
▁کنند
.
▁شعار
▁ما
▁هوش
▁مصنوعی
▁برای
▁همه
▁است
.
- m3hrdadfi/albert-fa-base-v2-sentiment-digikala
- m3hrdadfi/albert-fa-base-v2-sentiment-snappfood
- m3hrdadfi/albert-fa-base-v2-sentiment-deepsentipers-binary
- m3hrdadfi/albert-fa-base-v2-sentiment-deepsentipers-multi
- m3hrdadfi/albert-fa-base-v2-sentiment-binary
- m3hrdadfi/albert-fa-base-v2-sentiment-multi
- m3hrdadfi/albert-fa-base-v2-sentiment-multi
- m3hrdadfi/albert-fa-base-v2-ner
- m3hrdadfi/albert-fa-base-v2-ner-arman
- m3hrdadfi/albert-fa-base-v2-ner-arman
Notebook | Description | |
---|---|---|
Text Classification | ... | soon |
Sentiment Analysis | ... | soon |
Named Entity Recognition | ... | soon |
Text Generation | ... | soon |
See also the list of contributors who participated in this project.
See also the list of contributors who participated in this project.
I didn't publish any paper about this work, yet! Please cite in your publication as the following:
@misc{ALBERTPersian,
author = {Hooshvare Team},
title = {ALBERT-Persian: A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/m3hrdadfi/albert-persian}},
}
Post a Github issue on the ALBERT-Persian repo.
This version able to tackle the zero-width non-joiner character in favor of Persian writing.
This is the first version of ALBERT-Persian Base!