KsponSpeech

Pre-processing KsponSpeech corpus provided by AI Hub

It's been a while since KsponSpeech was released, but it's hard to compare performance because there's no established preprocessing method. So we're revealing the pre-processing method we used in the KoSpeech project. This project provides processing in characters, subwords, and grapheme units.

Intro

KsponSpeech-preprocess is repository for pre-processing KsponSpeech corpus provided by AI Hub.
KsponSpeech corpus is a 1000h Korean speech data corpus provided by AI Hub in Korea.
Anyone can download this dataset just by applying. The transcription rules can see here.

You can pre-process in various output-units, such as character, subword, grapheme
We will explain the details in the Output-Unit part below.

Prerequisites

Pandas: pip install pandas (Refer here for problem installing Pandas)
Sentencepiece: pip install sentencepiece (Refer here for problem installing Sentencepiece)

Usage

Set options in run.sh

python main.py --dataset_path $DATASET_PATH --vocab_dest $VOCAB_DEST --output_unit $OUTPUT_UNIT --preprocess_mode $PREPROCESS_MODE --vocab_size $VOCAB_SIZE

Run run.sh

$ ./run.sh

Leave the computer running for hours.

Preprocess

You can choose between phonetic transcription and spelling transcription to preprocess.

Raw data

b/ (70%)/(칠 십 퍼센트) 확률이라니 아/ (뭐+ 뭔)/(모+ 몬) 소리야 진짜 (100%)(백 프로)가 왜 안돼? n/

Delete noise labels, such as b/, n/, / ..

(70%)/(칠 십 퍼센트) 확률이라니 아/ (뭐+ 뭔)/(모+ 몬) 소리야 진짜 (100%)(백 프로)가 왜 안돼?

Delete labels such as '/', '*', '+', etc. (used for interjection representation)

(70%)/(칠 십 퍼센트) 확률이라니 아 (뭐 뭔)/(모 몬) 소리야 진짜 (100%)(백 프로)가 왜 안돼?

Option1 : phonetic transcript

칠 십 퍼센트 확률이라니 아 모 몬 소리야 진짜 백 프로가 왜 안돼?

Option2 : spelling transcript

70% 확률이라니 아 뭐 뭔 소리야 진짜 100%가 왜 안돼?

Output-Unit

This project provides processing in characters, subwords, and grapheme units.

Character-Unit

아 모 몬 소리야 칠 십 퍼센트 확률이라니

Subword-Unit

▁아 ▁모 ▁ 몬 ▁소리 야 ▁ 칠 ▁ 십 ▁퍼 센트 ▁확 률 이라 니

Grapheme-Unit

ㅇㅏ ㅁㅗ ㅁㅗㄴ ㅅㅗㄹㅣㅇㅑ ㅊㅣㄹ ㅅㅣㅂ ㅍㅓㅅㅔㄴㅌㅡ ㅎㅘㄱㄹㅠㄹㅇㅣㄹㅏㄴㅣ

Conversion to Numeric id

transcript

아 모 몬 소리야 칠 십 퍼센트 확률이라니

conversion

7 3 106 3 730 3 173 32 26 3 319 3 120 3 490 552 157 3 315 747 5 33 22

Troubleshoots and Contributing

If you have any questions, bug reports, and feature requests, please open an issue on Github.
For live discussions, please go to our gitter or Contacts sh951011@gmail.com please.

I appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.

Author

Soohwan Kim, Seyoung Bae, Cheolhwang Won, Soyoung Cho, Jeongwon Kwak
Contacts: sh951011@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
preprocess		preprocess
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KsponSpeech

Pre-processing KsponSpeech corpus provided by AI Hub

Intro

Prerequisites

Usage

Preprocess

Output-Unit

Conversion to Numeric id

Troubleshoots and Contributing

Author

About

Releases

Packages

Contributors 5

Languages

License

sooftware/ksponspeech

Folders and files

Latest commit

History

Repository files navigation

KsponSpeech

Pre-processing KsponSpeech corpus provided by AI Hub

Intro

Prerequisites

Usage

Preprocess

Output-Unit

Conversion to Numeric id

Troubleshoots and Contributing

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages