Bangla NLP toolkit. This toolkit was fully made by dataset and pretrained. This is version 2.0(Summarizer and Paper will come next version). You can use it now.
This repository was made Public at 29,jan 2020
- Bangla Preprocessing system
- Bangla Text Punctuation Remove
- Bangla Stopword removal
- Bangla Dust removal (without bangla character everything will be removed)
- Bangla word Normalize (Less error form)
- Bangla 'Bangla word to english equivalent word conversion'
- Bangla word Sort according to english alphabet
- Bangla word Sort according to Bangla alphabet
- Bangla Basic word tokenizer
- Bangla normalize word tokenizer
- Bangla Basic Sentence tokenizer
- Bangla normalize sentence tokenizer
- Bangla word checker word exist
- Bangla word Stemmer higher accuracy
- Bangla word2vec embedding 7,00,000+ vocab, 100 Dimension, much accurate,pretrained @pipilika
- Bangla sent2sent embedding/similiarty from word2vec
- Bangla Pos tagger
- Bangla database related NER
- numpy
- scipy
- Bangla word Count(6,15,621++)
- Bangla root Word count (83,665)
- Bangla Stop Word(356++)
- Bangla Suffix (100++)
- Bangla root word Postag count(1,33,973++)
- Bangla word2Vec embedding(7,25,061)
- Bangla NER tag(4,08,837++)
'++' sign means data will increase later
Must Download Word2Vec from google drive or it will make error
from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
text="āĻ¸ā§āĻā§āĻ° âāĻāĻžāĻ°āĻŖā§â āĻŦā§āĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĻā§āĻāĻž āĻā§āĻ˛ āĻĒā§āĻ°ā§ āĻāĻ˛āĻžāĻāĻž âāĻšāĻžāĻŦā§āĻĄā§āĻŦā§â āĻāĻžāĻā§āĻā§ āĻ
āĻĨā§ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§āĨ¤"
print(bp.punctuation_remove(text))
output
āĻ¸ā§āĻā§āĻ° āĻāĻžāĻ°āĻŖā§ āĻŦā§āĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĻā§āĻāĻž āĻā§āĻ˛ āĻĒā§āĻ°ā§ āĻāĻ˛āĻžāĻāĻž āĻšāĻžāĻŦā§āĻĄā§āĻŦā§ āĻāĻžāĻā§āĻā§ āĻ
āĻĨā§ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§
Remove some constant word from sentence. you can find those word in 'stop_word.txt'.
from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
text="āĻ¸ā§āĻā§āĻ° âāĻāĻžāĻ°āĻŖā§â āĻŦā§āĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĻā§āĻāĻž āĻā§āĻ˛ āĻĒā§āĻ°ā§ āĻāĻ˛āĻžāĻāĻž âāĻšāĻžāĻŦā§āĻĄā§āĻŦā§â āĻāĻžāĻā§āĻā§ āĻ
āĻĨā§ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§āĨ¤"
print(bp.stop_word_remove(text))
output
āĻ¸ā§āĻā§āĻ° âāĻāĻžāĻ°āĻŖā§â āĻŦā§āĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĒā§āĻ°ā§ āĻāĻ˛āĻžāĻāĻž âāĻšāĻžāĻŦā§āĻĄā§āĻŦā§â āĻāĻžāĻā§āĻā§ āĻ
āĻĨā§ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§āĨ¤
Add word in stopword list
from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
text="āĻ¸ā§āĻā§āĻ° âāĻāĻžāĻ°āĻŖā§â āĻŦā§āĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĻā§āĻāĻž āĻā§āĻ˛ āĻĒā§āĻ°ā§ āĻāĻ˛āĻžāĻāĻž âāĻšāĻžāĻŦā§āĻĄā§āĻŦā§â āĻāĻžāĻā§āĻā§ āĻ
āĻĨā§ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§āĨ¤"
bp.add_stopword('āĻ¸ā§āĻā§āĻ°')
print(bp.stop_word_remove(text))
output
âāĻāĻžāĻ°āĻŖā§â āĻŦā§āĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĒā§āĻ°ā§ āĻāĻ˛āĻžāĻāĻž âāĻšāĻžāĻŦā§āĻĄā§āĻŦā§â āĻāĻžāĻā§āĻā§ āĻ
āĻĨā§ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§āĨ¤
Everything will remove from word with out bangla character
from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
text="āĻ¸ā§āĻā§āĻ°12A'--,.:BāĻāĻžāĻ°āĻŖā§"
print(bp.dust_removal(text))
output
āĻ¸ā§āĻā§āĻ°āĻāĻžāĻ°āĻŖā§
similar vowel defines same character for better accuracy.
from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
text="āĻ
āĻ¸āĻšāĻ¨ā§ā§ āĻāĻžāĻ°ā§ āĻŦāĻ°ā§āĻˇāĻŖā§"
print(bp.word_normalize(text))
output
āĻ
āĻ¸āĻšāĻ¨āĻŋā§ āĻāĻžāĻ°āĻŋ āĻŦāĻ°ā§āĻˇāĻŖā§
from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
text="āĻ°āĻžāĻāĻ§āĻžāĻ¨ā§"
print(bp.bn2enCon(text))
output
rajadhani
from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
vec=['ā§§', 'āĻāĻŖā§āĻāĻžāĻ°', 'āĻāĻžāĻ°ā§' ,'āĻŦāĻ°ā§āĻˇāĻŖā§', 'āĻ¸ā§āĻŽāĻŦāĻžāĻ°', 'āĻ°āĻžāĻāĻ§āĻžāĻ¨ā§āĻ°', 'āĻŦāĻŋāĻāĻŋāĻ¨ā§āĻ¨', 'āĻāĻ˛āĻžāĻāĻžā§', 'āĻāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž', 'āĻĻā§āĻāĻž', 'āĻĻā§ā§']
print(bp.bn_word_sort(vec))
output
['ā§§', 'āĻāĻžāĻ°ā§', 'āĻŦāĻŋāĻāĻŋāĻ¨ā§āĻ¨', 'āĻŦāĻ°ā§āĻˇāĻŖā§', 'āĻĻā§āĻāĻž', 'āĻĻā§ā§', 'āĻāĻ˛āĻžāĻāĻžā§', 'āĻāĻŖā§āĻāĻžāĻ°', 'āĻāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž', 'āĻ°āĻžāĻāĻ§āĻžāĻ¨ā§āĻ°', 'āĻ¸ā§āĻŽāĻŦāĻžāĻ°']
from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
vec=['ā§§', 'āĻāĻŖā§āĻāĻžāĻ°', 'āĻāĻžāĻ°ā§' ,'āĻŦāĻ°ā§āĻˇāĻŖā§', 'āĻ¸ā§āĻŽāĻŦāĻžāĻ°', 'āĻ°āĻžāĻāĻ§āĻžāĻ¨ā§āĻ°', 'āĻŦāĻŋāĻāĻŋāĻ¨ā§āĻ¨', 'āĻāĻ˛āĻžāĻāĻžā§', 'āĻāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž', 'āĻĻā§āĻāĻž', 'āĻĻā§ā§']
print(bp.bn_word_sort_bn_sys(vec))
output
['āĻāĻ˛āĻžāĻāĻžā§', 'āĻāĻŖā§āĻāĻžāĻ°', 'āĻāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž', 'āĻĻā§āĻāĻž', 'āĻĻā§ā§', 'āĻŦāĻŋāĻāĻŋāĻ¨ā§āĻ¨', 'āĻŦāĻ°ā§āĻˇāĻŖā§', 'āĻāĻžāĻ°ā§', 'āĻ°āĻžāĻāĻ§āĻžāĻ¨ā§āĻ°', 'āĻ¸ā§āĻŽāĻŦāĻžāĻ°', 'ā§§']
from bn_nlp.tokenizer import wordTokenizer
wordtoken=wordTokenizer()
text="ā§§ āĻāĻŖā§āĻāĻžāĻ° āĻāĻžāĻ°ā§ āĻŦāĻ°ā§āĻˇāĻŖā§ āĻ¸ā§āĻŽāĻŦāĻžāĻ° āĻ°āĻžāĻāĻ§āĻžāĻ¨ā§āĻ° āĻŦāĻŋāĻāĻŋāĻ¨ā§āĻ¨ āĻāĻ˛āĻžāĻāĻžā§ āĻāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž āĻĻā§āĻāĻž āĻĻā§ā§"
print(wordtoken.basic_tokenizer(text))
output
['ā§§', 'āĻāĻŖā§āĻāĻžāĻ°', 'āĻāĻžāĻ°ā§', 'āĻŦāĻ°ā§āĻˇāĻŖā§', 'āĻ¸ā§āĻŽāĻŦāĻžāĻ°', 'āĻ°āĻžāĻāĻ§āĻžāĻ¨ā§āĻ°', 'āĻŦāĻŋāĻāĻŋāĻ¨ā§āĻ¨', 'āĻāĻ˛āĻžāĻāĻžā§', 'āĻāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž', 'āĻĻā§āĻāĻž', 'āĻĻā§ā§']
from bn_nlp.tokenizer import wordTokenizer
wordtoken=wordTokenizer()
text="ā§§ āĻāĻŖā§āĻāĻžāĻ° āĻāĻžāĻ°ā§ āĻŦāĻ°ā§āĻˇāĻŖā§ āĻ¸ā§āĻŽāĻŦāĻžāĻ° āĻ°āĻžāĻāĻ§āĻžāĻ¨ā§āĻ° āĻŦāĻŋāĻāĻŋāĻ¨ā§āĻ¨ āĻāĻ˛āĻžāĻāĻžā§ āĻāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž āĻĻā§āĻāĻž āĻĻā§ā§"
print(wordtoken.normalize_tokenizer(text))
output
['ā§§', 'āĻāĻŖā§āĻāĻžāĻ°', 'āĻāĻžāĻ°āĻŋ', 'āĻŦāĻ°ā§āĻˇāĻŖā§', 'āĻ¸ā§āĻŽāĻŦāĻžāĻ°', 'āĻ°āĻžāĻāĻ§āĻžāĻ¨āĻŋāĻ°', 'āĻŦāĻŋāĻāĻŋāĻ¨ā§āĻ¨', 'āĻāĻ˛āĻžāĻāĻžā§', 'āĻāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž', 'āĻĻā§āĻāĻž', 'āĻĻā§ā§']
from bn_nlp.tokenizer import sentenceTokenizer
senttoken=sentenceTokenizer()
text="āĻā§āĻāĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§ āĻĒā§ā§āĻ¨ āĻ¨āĻāĻ°āĻŦāĻžāĻ¸ā§āĨ¤ āĻŦā§āĻ¯āĻžāĻšāĻ¤ āĻšā§ āĻ¯āĻžāĻ¨ āĻāĻ˛āĻžāĻāĻ˛āĨ¤ āĻāĻ¤āĻāĻžāĻ˛ āĻ¸āĻāĻžāĻ˛āĻŦā§āĻ˛āĻž āĻāĻŋāĻ˛ āĻ
āĻ¸āĻšāĻ¨ā§ā§ āĻāĻ°āĻŽāĨ¤"
print(senttoken.basic_tokenizer(text))
output
['āĻā§āĻāĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§ āĻĒā§ā§āĻ¨ āĻ¨āĻāĻ°āĻŦāĻžāĻ¸ā§', ' āĻŦā§āĻ¯āĻžāĻšāĻ¤ āĻšā§ āĻ¯āĻžāĻ¨ āĻāĻ˛āĻžāĻāĻ˛', ' āĻāĻ¤āĻāĻžāĻ˛ āĻ¸āĻāĻžāĻ˛āĻŦā§āĻ˛āĻž āĻāĻŋāĻ˛ āĻ
āĻ¸āĻšāĻ¨ā§ā§ āĻāĻ°āĻŽ']
No Dust. No punctuation. Normalize words.
from bn_nlp.tokenizer import sentenceTokenizer
senttoken=sentenceTokenizer()
text="āĻā§āĻāĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§ āĻĒā§ā§āĻ¨ āĻ¨āĻāĻ°āĻŦāĻžāĻ¸ā§āĨ¤ āĻŦā§āĻ¯āĻžāĻšāĻ¤ āĻšā§ āĻ¯āĻžāĻ¨ āĻāĻ˛āĻžāĻāĻ˛āĨ¤ āĻāĻ¤āĻāĻžāĻ˛ āĻ¸āĻāĻžāĻ˛āĻŦā§āĻ˛āĻž āĻāĻŋāĻ˛ āĻ
āĻ¸āĻšāĻ¨ā§ā§ āĻāĻ°āĻŽāĨ¤"
print(senttoken.basic_tokenizer(text))
output
['āĻā§āĻāĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§ āĻĒā§ā§āĻ¨ āĻ¨āĻāĻ°āĻŦāĻžāĻ¸āĻŋ', 'āĻŦā§āĻ¯āĻžāĻšāĻ¤ āĻšā§ āĻ¯āĻžāĻ¨ āĻāĻ˛āĻžāĻāĻ˛', 'āĻāĻ¤āĻāĻžāĻ˛ āĻ¸āĻāĻžāĻ˛āĻŦā§āĻ˛āĻž āĻāĻŋāĻ˛ āĻ
āĻ¸āĻšāĻ¨āĻŋā§ āĻāĻ°āĻŽ']
Is this word exist in bangla dictionary?
from bn_nlp.Stemmer import stemmerOP
stemmer=stemmerOP()
text="āĻā§āĻāĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§"
print(stemmer.search(text))
output
True
finding root word.
from bn_nlp.Stemmer import stemmerOP
stemmer=stemmerOP()
text="āĻā§āĻāĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§"
print(stemmer.stem(text))
text="āĻā§āĻāĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§ āĻĒā§ā§āĻ¨ āĻ¨āĻāĻ°āĻŦāĻžāĻ¸ā§"
print(stemmer.stemSent(text))
output
āĻā§āĻāĻžāĻ¨ā§āĻ¤āĻŋ
āĻā§āĻāĻžāĻ¨ā§āĻ¤āĻŋ āĻĒā§ āĻ¨āĻāĻ°āĻŦāĻžāĻ¸āĻŋ
pretrained word2vec embedding download link:
After downloading, paste this file in bn_nlp directory.
from bn_nlp.word2vec_embedding import word2vec
w2v=word2vec()
text="āĻŦāĻ°ā§āĻˇāĻŖā§"
print(w2v.closure_word(text,5))
text2="āĻŦā§āĻˇā§āĻāĻŋ"
print(w2v.dist(text,text2))
# you can get embedding vector by calling 'w2v.embedding_vec'
output
['āĻŦāĻ°ā§āĻˇāĻŖā§', 'āĻŦā§āĻˇā§āĻāĻŋāĻĒāĻžāĻ¤ā§', 'āĻŦā§āĻˇā§āĻāĻŋāĻ¤ā§', 'āĻāĻžāĻ˛āĻŦā§āĻļāĻžāĻā§', 'āĻāĻ˛ā§āĻā§āĻā§āĻŦāĻžāĻ¸ā§']
26.64097023010254
Less value closure similarity. Built from word2vec. you can make embedding vector from similarity. I directly implement dist, cause we basically need distance.
from bn_nlp.sent2sent_embedding import sent2sent
s2s=sent2sent()
text1="āĻāĻŽāĻŋ āĻāĻžāĻ¤ āĻāĻžāĻ"
text2="āĻāĻŽāĻŋ āĻĒāĻžāĻ¸ā§āĻ¤āĻž āĻā§āĻ¤ā§ āĻāĻžāĻ"
print(s2s.dist(text1,text2))
# 'sent2sent_dist' function takes vector and gives 2D array with every sent to other sent dist
output
37.503074645996094
from bn_nlp.posTag import postag
tagger=postag()
text="āĻ¸ā§āĻā§āĻ° âāĻāĻžāĻ°āĻŖā§â āĻŦā§āĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĻā§āĻāĻž āĻā§āĻ˛ āĻĒā§āĻ°ā§ āĻāĻ˛āĻžāĻāĻž âāĻšāĻžāĻŦā§āĻĄā§āĻŦā§â āĻāĻžāĻā§āĻā§ āĻ
āĻĨā§ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§āĨ¤"
print(tagger.tag(text))
Output
[('āĻ¸ā§āĻ', 'noun'), ('āĻāĻžāĻ°āĻŖā§', 'preposition'), ('āĻŦā§āĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ°', 'noun'), ('āĻĻā§āĻāĻž', 'verb'), ('āĻā§āĻ˛', 'verb'), ('āĻĒā§āĻ°ā§', 'verb'), ('āĻāĻ˛āĻžāĻāĻž', 'noun'), ('āĻšāĻžāĻŦā§āĻĄā§āĻŦā§', 'noun'), ('āĻāĻžāĻā§āĻā§', 'verb'), ('āĻ
āĻĨā§', 'adverb'), ('āĻĒāĻžāĻ¨āĻŋ', 'noun')]
Good accuracy for single entity.
from bn_nlp.NER import UncustomizeNER
ner=UncustomizeNER()
text="āĻāĻ°ā§āĻā§āĻ¨ā§āĻāĻŋāĻ¨āĻž āĻĻāĻā§āĻˇāĻŋāĻŖ āĻāĻŽā§āĻ°āĻŋāĻāĻžāĻ° āĻāĻāĻāĻŋ āĻ°āĻžāĻˇā§āĻā§āĻ°āĨ¤ āĻŦā§āĻ¯āĻŧā§āĻ¨ā§āĻ¸ āĻāĻāĻ°ā§āĻ¸ āĻĻā§āĻļāĻāĻŋāĻ° āĻŦā§āĻšāĻ¤ā§āĻ¤āĻŽ āĻļāĻšāĻ° āĻ āĻ°āĻžāĻāĻ§āĻžāĻ¨ā§āĨ¤"
print(ner.NER(text))
output
{'āĻāĻ°ā§āĻā§āĻ¨ā§āĻāĻŋāĻ¨āĻž': 'LOC', 'āĻĻāĻā§āĻˇāĻŋāĻŖ āĻāĻŽā§āĻ°āĻŋāĻāĻžāĻ°': 'LOC', 'āĻ°āĻžāĻˇā§āĻā§āĻ°': 'LOC', 'āĻŦā§āĻ¯āĻŧā§āĻ¨ā§āĻ¸ āĻāĻāĻ°ā§āĻ¸': 'PER', 'āĻĻā§āĻļāĻāĻŋāĻ°': 'LOC', 'āĻŦā§āĻšāĻ¤ā§āĻ¤āĻŽ āĻļāĻšāĻ°': 'LOC'}
Thank you
Let's make better resources for Bangla