Zpoken is a Ukrainian IT company with one of the major divisions oriented on Speech Recognition technologies in English and Slavic (Ukrainian, Russian) languages.
We are happy to present here our Russian Speech Dataset — Zpoken Dataset [RU]
At the current moment the dataset consists of 5 source parts: radio_source_1, radio_source_2, radio_source_3, radio_source_5, Ru-films.
All data is stored in .opus format and was converted to mono, 16 kHz sampling rate, 16-bit.
Part name | Duration (h) | Samples num. | Average duration (s) | Characters per second | Characters per sample |
---|---|---|---|---|---|
radio_source_1 | 16 424.82 | 7 887 042 | 7.50 | 14.12 | 105.84 |
radio_source_2 | 2 308.46 | 955 904 | 8.69 | 13.53 | 117.62 |
radio_source_3 | 500.14 | 165 584 | 10.87 | 13.90 | 151.16 |
radio_source_5 | 655.88 | 216 101 | 10.93 | 16.63 | 181.66 |
Ru-films | 850.88 | 203 972 | 15.02 | 8.76 | 131.57 |
Total | Average | 20 740,18 | 9 428 603 | 7.91 | 13.95 | 106.17 |
All parts were scraped from open sources. Basically there were long audio files and transcriptions without timesteps. So that one of the challenges we solved is to align original transcription directly to each short audio sample. More about this problem you will be able to read in our future paper.
We provide absolutely free to use 150 hours demos for each part. It is a randomly selected sample from the original dataset part.
Part name | Duration(h) | Samples num. | Size (MB) | Link to download |
---|---|---|---|---|
radio_source_1 | 50 | 34 356 | 837 | Radio1_50h.zip |
radio_source_2 | 25 | 16 041 | 430 | Radio2_25h.zip |
radio_source_3 | 25 | 8 933 | 418 | Radio3_25h.zip |
radio_source_5 | 25 | 10 786 | 441 | Radio5_25h.zip |
Ru-films | 25 | 7 358 | 380 | Ru_films_25h.zip |
Total | 150 | 77 474 | 2 506 |
They are hosted on Gdrive so we provide ./download.sh
to easily get them.
You need a gdown to run the ./download.sh
pip install gdown
Just run bash download.sh
on your linux machine.
You will find the next directory structure, after you unzip each archive.
+---<DatasetPartName>
| +---data
| | +---subfolder1 (optional)
| | | +---speech\_file1.opus
| | | +...
| | | \---speech\_file[N].opus
| | +...
| | +---subfolder[N] (optional)
| | | +---speech\_file1.opus
| | | ...
| | \ \---speech\_file[N].opus
| +---transcription.csv
If you are interested in the full version of the dataset feel free to contact us in this form. Usually we'll answer in one working day.
- release more hours
- optimize archive storage (Gdrive is too annoying)
CC-BY-4.0
Zpoken Dataset [RU] is licensed under a Creative Commons Attribution 4.0 International License.