- Cho Moon Gi @lnyxzdevk
- https://github.com/kocohub/korean-hate-speech: labeled data, unlabeled data
- https://github.com/2runo/Curse-detection-data: data
- https://aihub.or.kr/opendata/keti-data/recognition-laguage/KETI-02-007
- Data crawled from various Internet communities (Dcinside, Fmkorea, etc...)
- TensorFlow
- scikit-learn
- Mecab(KoNLPy)
- gensim == 3.8.3
- pandas
- matplotlib
- tweepy == 3.8.0
- eunjeon
- tkinter
DataAnalysis_CapstoneDesign
├── Main
│ ├── MainProgram.py
│ └── utils.py
├── image
└── README.md
- FastText - gensim library
- BiLSTM
- RNN
- GRU
- Attention
- 1D-CNN
- With the recent increase in Internet broadcasting-type SNS such as YouTube and TikTok through the spread of smartphones, it is easy for people who use smartphones and the Internet to indiscriminately be exposed to various new kinds of TOXIC WORD.
- TOXIC WORD encountered in this way is expected to be used as anonymity in news comments, in-game chats, and anonymous communities.
- So, if someone post such abusive language on the Internet, we need a system that automatically filters the sentence.
- The goal is to create a model that can distinguish as accurately as possible whether the sentence is an abusive or non-profane sentence when presented in a sentence (short sentences of 2 to 3 lines, such as Korean, in-game chat, or Internet community comments).
- Then, it automatically filters the TOXIC WORD by masking it with *.
- Text Preprocessing (Unlabel data) - Mecab(KoNLPy)
- Make word embedding vector using FastText - FastText
- To balancing labels, augmenting toxic data using FastText's most_similar method (Synonym Replacement)
- Vectorize and padding train and test dataset - TensorFlow
- Train models - BiLSTM, RNN, GRU, 1D-CNN, Attention, BERT, KoBERT, ETC...
- Predict whether a given sentence is a toxic sentence
- Masking toxic words with * by predicting the toxic probability of each word in a sentence
- Implement program with tkinter
Model | Precision | Recall | Test Accuracy |
---|---|---|---|
1DCNN | 0.83 | 0.96 | 0.89 |
BiLSTM | 0.91 | 0.91 | 0.91 |
Double-BiLSTM | 0.94 | 0.89 | 0.92 |
Double-1DCNN | 0.85 | 0.96 | 0.89 |
GRU | 0.92 | 0.91 | 0.92 |
Attention+BiLSTM+GRU | 0.91 | 0.93 | 0.92 |
BERT | 0.75 | 0.76 | 0.89 |
KoBERT | 0.71 | 0.75 | 0.90 |
Attention+BiLSTM+LSTM+GRU | 0.86 | 0.96 | 0.90 |
Deeper Attention | 0.79 | 0.98 | 0.86 |
Node Change using best Attention | 0.82 | 0.97 | 0.88 |
Attention Refine | 0.92 | 0.95 | 0.94 |
(Batch size=100, epochs=20) (* epoch 30 in Attention Refine)
- Normal Sentence
Regexed Text: 이 프로그램이 우리 계획의 시발점이다
Tokenized Text: [['이', '프로그램', '우리', '계획', '시발점']]
0.0% 확률로 욕설 문장입니다.
----------------------------------------
욕설 부분 분석
이 : 18.57% 확률로 욕설 부분
프로그램 : 0.02% 확률로 욕설 부분
우리 : 0.11% 확률로 욕설 부분
계획 : 0.01% 확률로 욕설 부분
시발점 : 0.08% 확률로 욕설 부분
Original Text: 이 프로그램이 우리 계획의 시발점이다.
Masked Text: 이 프로그램이 우리 계획의 시발점이다.
- Toxic Sentence
Regexed Text: 아 씨발 진짜 개 좆같네
Tokenized Text: [['아', '씨발', '진짜', '개', '좆같']]
99.75% 확률로 욕설 문장입니다.
----------------------------------------
욕설 부분 분석
아 : 2.61% 확률로 욕설 부분
씨발 : 99.62% 확률로 욕설 부분
진짜 : 4.43% 확률로 욕설 부분
개 : 81.81% 확률로 욕설 부분
좆같 : 90.25% 확률로 욕설 부분
Original Text: 아 씨발 진짜 개 좆같네
Masked Text: 아 ** 진짜 * **네