diff --git a/README.md b/README.md index 222f2b9..20ed27f 100644 --- a/README.md +++ b/README.md @@ -14,6 +14,8 @@ [Installation](#installation) | [Get Started](#get-started) +English | [中文](README_CN.md) + ## Introduction @@ -21,12 +23,12 @@ MindAudio is a toolbox of audio models and algorithms based on [MindSpore](https://www.mindspore.cn/). It provides a series of API for common audio data processing,data enhancement,feature extraction, so that users can preprocess data conveniently. Also provides examples to show how to build audio deep learning models with mindaudio. The following is the corresponding `mindaudio` versions and supported `mindspore` versions. -| mindaudio | mindspore | -| :--: | :--: | -| master | master | -| 0.4 | 2.3.0 | -| 0.3 | 2.2.10 | -| 0.1.x | 1.8&1.9 | + +| `mindspore` | `mindaudio` | `tested hardware` | +|--------------|-------------|------------------------------| +| `master` | `master` | `ascend 910*` | +| `2.3.0` | `0.4` | `ascend 910*` | +| `2.2.10` | `0.3` | `ascend 910` & `ascend 910*` | ### data processing @@ -64,7 +66,7 @@ python setup.py install ### -mindaudio provides a series of commonly used audio data processing apis, which can be easily invoked for data analysis and feature extraction. +MindAudio provides a series of commonly used audio data processing apis, which can be easily invoked for data analysis and feature extraction. ```python >>> import mindaudio.data.io as io diff --git a/README_CN.md b/README_CN.md new file mode 100644 index 0000000..1e59fe5 --- /dev/null +++ b/README_CN.md @@ -0,0 +1,114 @@ +
+ + +# MindAudio + +[![GitHub Workflow Status](https://img.shields.io/github/actions/workflow/status/mindspore-lab/mindaudio/ut_test.yaml) +![GitHub issues](https://img.shields.io/github/issues/mindspore-lab/mindaudio) +![GitHub pull requests](https://img.shields.io/github/issues-pr/mindspore-lab/mindaudio) +![GitHub](https://img.shields.io/github/license/mindspore-lab/mindaudio)](GitHub) +[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) +[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) + +[Introduction](#introduction) | +[Installation](#installation) | +[Get Started](#get-started) + +[English](README.md) | 中文 +
+ +## 介绍 +MindAudio 是基于 [MindSpore](https://www.mindspore.cn/) 的音频模型和算法工具箱。它提供了一系列用于常见音频数据处理、数据增强、特征提取的 API,方便用户对数据进行预处理。此外,它还提供了一些示例,展示如何利用 mindaudio 建立音频深度学习模型。 + +下表显示了相应的 `mindaudio` 版本和支持的 `mindspore` 版本。 + +| `mindspore` | `mindaudio` | `tested hardware` | +|--------------|-------------|------------------------------| +| `master` | `master` | `ascend 910*` | +| `2.3.0` | `0.4` | `ascend 910*` | +| `2.2.10` | `0.3` | `ascend 910` & `ascend 910*` | + +### 数据处理 + + +```python +# read audio +>>> import mindaudio.data.io as io +>>> audio_data, sr = io.read(data_file) +# feature extraction +>>> import mindaudio.data.features as features +>>> feats = features.fbanks(audio_data) +``` + +## 安装 + +### Pypi安装 + +MindAudio的发布版本可以通过`PyPI`安装: + +```shell +pip install mindaudio +``` + +### 源码安装 +最新版本的 MindAudio 可以通过如下方式安装: + +```shell +git clone https://github.com/mindspore-lab/mindaudio.git +cd mindaudio +pip install -r requirements/requirements.txt +python setup.py install +``` + +## 快速入门音频数据分析 + +### + +MindAudio 提供了一系列常用的音频数据处理 APIs,可以轻松调用这些 APIs 进行数据分析和特征提取。 + +```python +>>> import mindaudio.data.io as io +>>> import mindaudio.data.spectrum as spectrum +>>> import numpy as np +>>> import matplotlib.pyplot as plt +# read audio +>>> audio_data, sr = io.read("./tests/samples/ASR/BAC009S0002W0122.wav") +# feature extraction +>>> n_fft = 512 +>>> matrix = spectrum.stft(audio_data, n_fft=n_fft) +>>> magnitude, _ = spectrum.magphase(matrix, 1) +# display +>>> x = [i for i in range(0, 256*750, 256)] +>>> f = [i/n_fft * sr for i in range(0, int(n_fft/2+1))] +>>> plt.pcolormesh(x,f,magnitude, shading='gouraud', vmin=0, vmax=np.percentile(magnitude, 98)) +>>> plt.title('STFT Magnitude') +>>> plt.ylabel('Frequency [Hz]') +>>> plt.xlabel('Time [sec]') +>>> plt.show() +``` + +结果如图: + +![image-20230310165349460](https://raw.githubusercontent.com/mindspore-lab/mindaudio/main/tests/result/stft_magnitude.png) + + +## 贡献方式 +我们感谢开发者用户的所有贡献,一起让 MindAudio 变得更好。 +贡献指南请参考[CONTRIBUTING.md](CONTRIBUTING.md) 。 + +## 许可证 + +MindAudio 遵循[Apache License 2.0](LICENSE)开源协议. + +## 引用 + +如果你觉得 MindAudio 对你的项目有帮助,请考虑引用: + +```latex +@misc{MindSpore Audio 2022, + title={{MindSpore Audio}:MindSpore Audio Toolbox and Benchmark}, + author={MindSpore Audio Contributors}, + howpublished = {\url{https://github.com/mindspore-lab/mindaudio}}, + year={2022} +} +``` diff --git a/examples/ECAPA-TDNN/readme.md b/examples/ECAPA-TDNN/readme.md index 2b101ff..85261e0 100644 --- a/examples/ECAPA-TDNN/readme.md +++ b/examples/ECAPA-TDNN/readme.md @@ -118,9 +118,9 @@ python speaker_verification_cosine.py --need_generate_data=False ## **性能表现** - - tested on ascend 910 with 8 cards. + - tested on ascend 910 with 8 cards. - total training time : 24hours -| model | eer with s-norm | eer s-norm | config| weights| -| :-: | :-: | :-: | :-: | :-:| -| ECAPA-TDNN | 1.50% | 1.70% | [yaml](https://github.com/mindsporelab/mindaudio/blob/main/example/ECAPA-TDNN/ecapatdnn.yaml) | [weights](https://download.mindspore.cn/toolkits/mindaudio/ecapatdnn/ecapatdnn_vox12.ckpt) | +| model | eer with s-norm | eer s-norm | config | weights | +|:----------:|:---------------:|:----------:|:---------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------:| +| ECAPA-TDNN | 1.50% | 1.70% | [yaml](https://github.com/mindsporelab/mindaudio/blob/main/example/ECAPA-TDNN/ecapatdnn.yaml) | [weights](https://download.mindspore.cn/toolkits/mindaudio/ecapatdnn/ecapatdnn_vox12.ckpt) | diff --git a/examples/ECAPA-TDNN/voxceleb_prepare.py b/examples/ECAPA-TDNN/voxceleb_prepare.py index 22a3255..7d27571 100644 --- a/examples/ECAPA-TDNN/voxceleb_prepare.py +++ b/examples/ECAPA-TDNN/voxceleb_prepare.py @@ -86,7 +86,7 @@ def prepare_voxceleb( ): """ Prepares the csv files for the Voxceleb1 or Voxceleb2 datasets. - Please follow the instructions in the README.md file for + Please follow the instructions in the readme.md file for preparing Voxceleb2. """ diff --git a/examples/conformer/readme.md b/examples/conformer/readme.md index 2752462..f4a75b3 100644 --- a/examples/conformer/readme.md +++ b/examples/conformer/readme.md @@ -1,40 +1,41 @@ -# 使用conformer进行语音识别 +# Using Conformer for Speech Recognition +> [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100) +## Introduction -## 介绍 +Conformer is a model that combines transformers and CNNs to model both local and global dependencies in audio sequences. Currently, models based on transformers and convolutional neural networks (CNNs) have achieved good results in automatic speech recognition (ASR). Transformers can capture long-sequence dependencies and global interactions based on content, while CNNs can effectively utilize local features. Therefore, a convolution-enhanced transformer model called Conformer has been proposed for speech recognition, showing performance superior to both transformers and CNNs. The current version supports using the Conformer model for training/testing and inference on the AISHELL-1 dataset on ascend NPU and GPU. -conformer是将一种transformer和cnn结合起来,对音频序列进行局部和全局依赖都进行建模的模型。目前基于transformer和卷积神经网络cnn的模型在ASR上已经达到了较好的效果,Transformer能够捕获长序列的依赖和基于内容的全局交互信息,CNN则能够有效利用局部特征,因此针对语音识别问题提出了卷积增强的transformer模型,称为conformer,模型性能优于transformer和cnn。目前提供版本支持在NPU和GPU上使用[conformer](https://arxiv.org/pdf/2102.06657v1.pdf)模型在aishell-1数据集上进行训练/测试和推理。 +### Model Structure -### 模型结构 +The overall structure of Conformer includes SpecAug, ConvolutionSubsampling, Linear, Dropout, and ConformerBlocks×N, as shown in the structure diagram below. -Conformer整体结构包括:SpecAug、ConvolutionSubsampling、Linear、Dropout、ConformerBlocks×N,可见如下结构图。 +- ConformerBlock Structure (N of this structure): Feed Forward Module, Multi-Head Self Attention Module, Convolution Module, Feed Forward Module, Layernorm. Each module is preceded by a Layernorm and followed by a Dropout, with residual connections linking the input data directly. -- ConformerBlock结构(N个该结构):Feed Forward Module、Multi-Head Self Attention Module、Convolution Module、Feed Forward Module、Layernorm。其中每个Module都是前接一个Layernorm后接一个Dropout,且都有残差链连接,残差数据为输入数据本身。 - -- 马卡龙结构:可以看到ConformerBlock神似马卡龙结构,即两个一样的Feed Forward Module中间夹了Multi-Head Self Attention Module和Convolution。 +- Macaron Structure: The ConformerBlock resembles a macaron structure, with a Multi-Head Self Attention Module and Convolution Module sandwiched between two identical Feed Forward Modules. ![image-20230310165349460](https://raw.githubusercontent.com/mindspore-lab/mindaudio/main/tests/result/conformer.png) +## Usage Steps -### 1. 数据集准备 +### 1. Dataset Preparation -以aishell数据集为例,mindaudio提供下载、生成统计信息的脚本(包含wav文件地址信息以及对应中文信息),执行此脚本会生成train.csv、dev.csv、test.csv三个文件。 +Take the AISHELL dataset as an example. MindAudio provides scripts to download and generate statistical information (including the addresses of wav files and corresponding Chinese information). Executing this script will generate three files: train.csv, dev.csv, and test.csv. ```shell -# data_path为存放数据的地址 +# data_path is the path where the data is stored python mindaudio/data/aishell.py --data_path "/data" --download False ``` -如需下载数据, --download True +To download data, set --download parameter to be True. -### 2. 数据预处理 +### 2. Data Preprocessing -#### 文字部分 +#### Text Part -根据aishell提供的aishell_transcript_v0.8.txt,生成逐字的编码文件,每个字对应一个id,输出包含编码信息的文件:lang_char.txt。 +Based on the aishell_transcript_v0.8.txt provided by AISHELL, generate a character-by-character encoding file where each character corresponds to an ID, outputting a file containing encoding information: lang_char.txt. ```shell cd mindaudio/utils @@ -42,45 +43,46 @@ python text2token.py -s 1 -n 1 "data_path/data_aishell/transcript/aishell_transc | sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0 " " NR+1}' >> ${/data_path/lang_char.txt} ``` -#### 音频部分 +#### Audio Part -本模型使用了全局cmvn,为提高模型训练效率,在训练前会对数据的特征进行统计,生成包含统计信息的文件:global_cmvn.json。 +This model uses global CMVN. To improve training efficiency, statistical features of the data are computed before training, generating a file with the statistical information: global_cmvn.json. ```shell cd examples/conformer python compute_cmvn_stats.py --num_workers 16 --train_config conformer.yaml --in_scp data_path/train.csv --out_cmvn data_path/global_cmvn ``` -注意:--num_workers可根据训练设备的核数进行调整 +Note: --num_workers can be adjusted according to the number of cores on the training device. -### 3. 开始训练(默认使用Ascend 910) +### 3. Training -#### 单卡 +#### Single-Card Training (by default using Ascend 910) ```shell cd examples/conformer # Standalone training python train.py --config_path ./conformer.yaml ``` -注意: - -#### 8卡训练 +Note: Use Ascend device by default. -需配置is_distributed参数为True +#### Multi-Card Training on Ascend +This example uses 8 ascend NPUs. ```shell -# Distribute_training -mpirun -n 8 python train.py --config_path ./conformer.yaml --is_distributed True +# Distribute training +mpirun -n 8 python train.py --config_path ./conformer.yaml ``` - -如果脚本是由root用户执行的,必须在mpirun中添加——allow-run-as-root参数,如下所示: +Note: +When using multi-card training, ensure that is_distributed in the YAML file is set to True. This can be configured by modifying the YAML file or adding parameters on the command line. ```shell -mpirun --allow-run-as-root -n 8 python train.py --config_path ./conformer.yaml +# Distribute_training +mpirun -n 8 python train.py --config_path ./conformer.yaml --is_distributed True ``` +If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -启动训练前,可更改环境变量设置,更改线程数以提高运行速度。如下所示: +Before starting training, you can set environment variable to adjust the number of threads for faster execution as shown below: ```shell export OPENBLAS_NUM_THREADS=1 @@ -89,28 +91,47 @@ export MKL_NUM_THREADS=1 -### 4.评估 +### 4. Evaluation -我们提供ctc greedy search、ctc prefix beam search、attention decoder、attention rescoring四种解码方式,可在yaml配置文件中对解码方式进行修改。 - -执行脚本后将生成包含预测结果的文件为result.txt +Four decoding methods are provided: CTC greedy search, CTC prefix beam search, attention decoder, and attention rescoring. The decoding method can be modified in the YAML configuration file. +Executing the script will generate a file containing the prediction results: result.txt. ```shell +# by default using ctc greedy search decoder python predict.py --config_path ./conformer.yaml + +# using ctc prefix beam search decoder +python predict.py --config_path ./conformer.yaml --decode_mode ctc_prefix_beam_search + +# using attention decoder +python predict.py --config_path ./conformer.yaml --decode_mode attention + +# using attention rescoring decoder +python predict.py --config_path ./conformer.yaml --decode_mode attention_rescoring ``` -### **性能表现** +## Model Performance +The training config can be found in the [conformer.yaml](https://github.com/mindspore-lab/mindaudio/blob/main/examples/conformer/conformer.yaml). + +Performance tested on ascend 910 (8p) with graph mode: + +| model | decoding mode | CER | +|-----------|------------------------|--------------| +| conformer | ctc greedy search | 5.35 | +| conformer | ctc prefix beam search | 5.36 | +| conformer | attention decoder | comming soon | +| conformer | attention rescoring | 4.95 | +- [weights](https://download-mindspore.osinfra.cn/toolkits/mindaudio/conformer/conformer_avg_30-548ee31b.ckpt) can be downloaded here. -* Feature info: using fbank feature, cmvn, online speed perturb -* Training info: lr 0.001, acc_grad 1, 240 epochs, ascend 910*8 -* Decoding info: ctc_weight 0.3, average_num 30 -* Performance result: total_time 11h17min, 8p, using hccl_tools. +--- +Performance tested on ascend 910* (8p) with graph mode: -| model | decoding mode | CER | -| --------- | ---------------------- | ---- | -| conformer | ctc greedy search | 5.05 | -| conformer | ctc prefix beam search | 5.05 | -| conformer | attention decoder | 5.00 | -| conformer | attention rescoring | 4.73 | +| model | decoding mode | CER | +|-----------|------------------------|--------------| +| conformer | ctc greedy search | 5.62 | +| conformer | ctc prefix beam search | 5.62 | +| conformer | attention decoder | comming soon | +| conformer | attention rescoring | 5.12 | +- [weights](https://download-mindspore.osinfra.cn/toolkits/mindaudio/conformer/conformer_avg_30-692d57b3-910v2.ckpt) can be downloaded here. diff --git a/examples/conformer/readme_cn.md b/examples/conformer/readme_cn.md new file mode 100644 index 0000000..e198491 --- /dev/null +++ b/examples/conformer/readme_cn.md @@ -0,0 +1,133 @@ +# 使用conformer进行语音识别 + +> [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100) + +## 介绍 + +conformer是将一种transformer和cnn结合起来,对音频序列进行局部和全局依赖都进行建模的模型。目前基于transformer和卷积神经网络cnn的模型在ASR上已经达到了较好的效果,Transformer能够捕获长序列的依赖和基于内容的全局交互信息,CNN则能够有效利用局部特征,因此针对语音识别问题提出了卷积增强的transformer模型,称为conformer,模型性能优于transformer和cnn。目前提供版本支持在NPU和GPU上使用[conformer](https://arxiv.org/pdf/2102.06657v1.pdf)模型在aishell-1数据集上进行训练/测试和推理。 + +### 模型结构 + +Conformer整体结构包括:SpecAug、ConvolutionSubsampling、Linear、Dropout、ConformerBlocks×N,可见如下结构图。 + +- ConformerBlock结构(N个该结构):Feed Forward Module、Multi-Head Self Attention Module、Convolution Module、Feed Forward Module、Layernorm。其中每个Module都是前接一个Layernorm后接一个Dropout,且都有残差链连接,残差数据为输入数据本身。 + +- 马卡龙结构:可以看到ConformerBlock神似马卡龙结构,即两个一样的Feed Forward Module中间夹了Multi-Head Self Attention Module和Convolution。 + + ![image-20230310165349460](https://raw.githubusercontent.com/mindspore-lab/mindaudio/main/tests/result/conformer.png) + + +## 使用步骤 + +### 1. 数据集准备 + +以aishell数据集为例,mindaudio提供下载、生成统计信息的脚本(包含wav文件地址信息以及对应中文信息),执行此脚本会生成train.csv、dev.csv、test.csv三个文件。 + +```shell +# data_path为存放数据的地址 +python mindaudio/data/aishell.py --data_path "/data" --download False +``` + +如需下载数据, --download True + +### 2. 数据预处理 + +#### 文字部分 + +根据aishell提供的aishell_transcript_v0.8.txt,生成逐字的编码文件,每个字对应一个id,输出包含编码信息的文件:lang_char.txt。 + +```shell +cd mindaudio/utils +python text2token.py -s 1 -n 1 "data_path/data_aishell/transcript/aishell_transcript_v0.8.txt" | cut -f 2- -d" " | tr " " "\n" \ + | sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0 " " NR+1}' >> ${/data_path/lang_char.txt} +``` + +#### 音频部分 + +本模型使用了全局cmvn,为提高模型训练效率,在训练前会对数据的特征进行统计,生成包含统计信息的文件:global_cmvn.json。 + +```shell +cd examples/conformer +python compute_cmvn_stats.py --num_workers 16 --train_config conformer.yaml --in_scp data_path/train.csv --out_cmvn data_path/global_cmvn +``` + +注意:--num_workers可根据训练设备的核数进行调整 + +### 3. 开始训练(默认使用Ascend 910) + +#### 单卡 +```shell +cd examples/conformer +# Standalone training +python train.py --config_path ./conformer.yaml +``` + +注意: + +#### Ascend上进行多卡训练 + +需配置is_distributed参数为True + +```shell +# Distribute_training +mpirun -n 8 python train.py --config_path ./conformer.yaml --is_distributed True +``` + +如果脚本是由root用户执行的,必须在mpirun中添加——allow-run-as-root参数,如下所示: + +```shell +mpirun --allow-run-as-root -n 8 python train.py --config_path ./conformer.yaml +``` + + +启动训练前,可更改环境变量设置,更改线程数以提高运行速度。如下所示: + +```shell +export OPENBLAS_NUM_THREADS=1 +export MKL_NUM_THREADS=1 +``` + + + +### 4.评估 + +我们提供ctc greedy search、ctc prefix beam search、attention decoder、attention rescoring四种解码方式,可在yaml配置文件中对解码方式进行修改。 + +执行脚本后将生成包含预测结果的文件为result.txt + +```shell +# by default using ctc greedy search decoder +python predict.py --config_path ./conformer.yaml + +# using ctc prefix beam search decoder +python predict.py --config_path ./conformer.yaml --decode_mode ctc_prefix_beam_search + +# using attention decoder +python predict.py --config_path ./conformer.yaml --decode_mode attention + +# using attention rescoring decoder +python predict.py --config_path ./conformer.yaml --decode_mode attention_rescoring +``` + +## **模型表现** +训练的配置文件见 [conformer.yaml](https://github.com/mindspore-lab/mindaudio/blob/main/examples/conformer/conformer.yaml)。 + +在 ascend 910(8p) 图模式上的测试性能: + +| model | decoding mode | CER | +| --------- | ---------------------- |--------------| +| conformer | ctc greedy search | 5.35 | +| conformer | ctc prefix beam search | 5.36 | +| conformer | attention decoder | comming soon | +| conformer | attention rescoring | 4.95 | +- 训练好的 [weights](https://download-mindspore.osinfra.cn/toolkits/mindaudio/conformer/conformer_avg_30-548ee31b.ckpt) 可以在此处下载。 +--- +在 ascend 910*(8p) 图模式上的测试性能: + +| model | decoding mode | CER | +| --------- | ---------------------- |--------------| +| conformer | ctc greedy search | 5.62 | +| conformer | ctc prefix beam search | 5.62 | +| conformer | attention decoder | comming soon | +| conformer | attention rescoring | 5.12 | +- 训练好的 [weights](https://download-mindspore.osinfra.cn/toolkits/mindaudio/conformer/conformer_avg_30-692d57b3-910v2.ckpt) 可以在此处下载。 diff --git a/examples/conv_tasnet/readme.md b/examples/conv_tasnet/readme.md index fffc317..ee87602 100644 --- a/examples/conv_tasnet/readme.md +++ b/examples/conv_tasnet/readme.md @@ -61,9 +61,8 @@ python eval.py -c "conv_tasnet.yaml" ``` - ## **模型表现** -| 模型 | 机器 | SI-SNR | 参数 | -| ----------- | -------- | ------ | ------------------------------------------------------------ | +| 模型 | 机器 | SI-SNR | 参数 | +| ----------- | -------- | ------ |--------------------------------------------------------------------------------------------------| | conv_tasnet | D910x8-G | 12.59 | [yaml](https://github.com/mindsporelab/mindaudio/blob/main/example/conv_tasnet/conv_tasnet.yaml) | diff --git a/examples/deepspeech2/readme.md b/examples/deepspeech2/readme.md index 725de44..ce92ad5 100644 --- a/examples/deepspeech2/readme.md +++ b/examples/deepspeech2/readme.md @@ -1,52 +1,49 @@ -# 使用DeepSpeech2进行语音识别 +# Using DeepSpeech2 for Speech Recognition +> [Deep Speech 2: End-to-End Speech Recognition in English and Mandarin](https://arxiv.org/abs/1512.02595) +## Introduction +DeepSpeech2 is a speech recognition model trained using CTC loss. It replaces the entire manually designed component pipeline with neural networks and can handle a variety of speech, including noisy environments, accents, and different languages. The currently provided version supports using the [DeepSpeech2](http://arxiv.org/pdf/1512.02595v1.pdf) model for training/testing and inference on the librispeech dataset on NPU and GPU. -## 介绍 +### Model Architecture -DeepSpeech2是一种采用CTC损失训练的语音识别模型。它用神经网络取代了整个手工设计组件的管道,可以处理各种各样的语音,包括嘈杂的环境、口音和不同的语言。目前提供版本支持在NPU和GPU上使用[DeepSpeech2](http://arxiv.org/pdf/1512.02595v1.pdf)模型在librispeech数据集上进行训练/测试和推理。 +The current reproduced model includes: -### 模型结构 +- Two convolutional layers: + - Number of channels: 32, kernel size: 41, 11, stride: 2, 2 + - Number of channels: 32, kernel size: 41, 11, stride: 2, 1 +- Five bidirectional LSTM layers (size 1024) +- A projection layer [size equal to the number of characters plus 1 (for the CTC blank symbol), 28] -目前的复现的模型包括: +### Data Processing -- 两个卷积层: - - 通道数为 32,内核大小为 41, 11 ,步长为 2, 2 - - 通道数为 32,内核大小为 41, 11 ,步长为 2, 1 -- 五个双向 LSTM 层(大小为 1024) -- 一个投影层【大小为字符数加 1(为CTC空白符号),28】 +- Audio: + 1. Feature extraction: log power spectrum. + 2. Data augmentation: not used yet. -### 数据处理 +- Text: + - Text encoding uses labels for English alphabet conversion; users can replace this with a tokenization model. -- 音频: +## Usage Steps - 1.特征提取:采用log功率谱。 - - 2.数据增强:暂未使用。 - -- 文字: - -​ 文字编码使用labels进行英文字母转换,用户可使用分词模型进行替换。 - - -### 1. 数据集准备 -如为未下载数据集,可使用提供的脚本进行一键下载以及数据准备,如下所示: +### 1. Preparing the Dataset +If the dataset is not downloaded, you can use the provided script to download and prepare the data with one command, as shown below: ```shell -# Download and creat json +# Download and create json python mindaudio/data/librispeech.py --root_path "your_data_path" ``` -如已下载好压缩文件,请按如下命令操作: +If you have already downloaded the compressed files, operate with the following command: ```shell -# creat json -python mindaudio/data/librispeech.py --root_path "your_data_path" --data_ready True +# Create json +python mindaudio/data/librispeech.py --root_path "your_data_path" --data_ready True ``` -LibriSpeech存储flac音频格式的文件。要在MindAudio中使用它们,须将所有flac文件转换为wav文件,用户可以使用[ffmpeg](https://gist.github.com/seungwonpark/4f273739beef2691cd53b5c39629d830)或[sox](https://sourceforge.net/projects/sox/)进行转换。 +LibriSpeech stores files in flac audio format. To use them in MindAudio, all flac files need to be converted to wav files. Users can use [ffmpeg](https://gist.github.com/seungwonpark/4f273739beef2691cd53b5c39629d830) or [sox](https://sourceforge.net/projects/sox/) for the conversion. -处理后,数据集目录结构如下所示: +After processing, the dataset directory structure is as follows: ``` ├─ LibriSpeech_dataset @@ -68,42 +65,37 @@ LibriSpeech存储flac音频格式的文件。要在MindAudio中使用它们, │ │ └─ txt ``` -4个**.json文件存储了相应数据的绝对路径,在后续进行模型训练以及验证中,请将yaml配置文件中的xx_manifest改为对应libri_xx_manifest.json的存放地址。 +The four **.json files store the absolute paths of the corresponding data. For subsequent model training and validation, update the xx_manifest in the yaml configuration file to the location of the corresponding libri_xx_manifest.json file. -### 2. 训练 -#### 单卡 -由于数据集较大,不推荐使用此种训练方式 +### 2. Training +#### Single-Card Training +Due to the large dataset, this training method is not recommended. ```shell # Standalone training python train.py -c "./deepspeech2.yaml" ``` +Note: The default is to use Ascend machines. - -#### 多卡 - - +#### Multi-Card Training on Ascend +This example uses 8 NPUs. If you want to change the number of NPUs, you can modify the number of cards after -n in the command below. ```shell -# Distribute_training +# Distributed training mpirun -n 8 python train.py -c "./deepspeech2.yaml" ``` -注意:如果脚本是由root用户执行的,必须在mpirun中添加——allow-run-as-root参数,如下所示: +Note: If the script is executed by the root user, you must add the --allow-run-as-root parameter in mpirun, as shown below: ```shell mpirun --allow-run-as-root -n 8 python train.py -c "./deepspeech2.yaml" ``` - -### 3.评估 - -将训好的权重地址更新在deepspeech2.yaml配置文件Pretrained_model中,执行以下命令 +### 3. Evaluating the Model +Update the path to the trained weights in the Pretrained_model section of the deepspeech2.yaml configuration file and execute the following command: ```shell # Validate a trained model python eval.py -c "./deepspeech2.yaml" ``` +## **Model Performance** - -## **性能表现** - -| model | LM | test clean cer| test clean wer | config | weights| -| ----------- | ---- | -------------- | -------------- |--------------------------------------------------------------------------------------------------| ------------------------------------------------------------ | -| deepspeech2 | No | 3.461 | 10.24 | [yaml](https://github.com/mindsporelab/mindaudio/blob/main/example/deepspeech2/deepspeech2.yaml) | [weights](https://download.mindspore.cn/toolkits/mindaudio/deepspeech2/deepspeech2.ckpt) | +| Model | Machine | LM | Test Clean CER | Test Clean WER | Parameters | Weights | +|--------------|-----------|------|----------------|----------------|----------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------| +| DeepSpeech2 | D910x8-G | No | 3.461 | 10.24 | [yaml](https://github.com/mindsporelab/mindaudio/blob/main/example/deepspeech2/deepspeech2.yaml) | [weights](https://download.mindspore.cn/toolkits/mindaudio/deepspeech2/deepspeech2.ckpt) | diff --git a/examples/deepspeech2/readme_cn.md b/examples/deepspeech2/readme_cn.md new file mode 100644 index 0000000..0b0b871 --- /dev/null +++ b/examples/deepspeech2/readme_cn.md @@ -0,0 +1,109 @@ +# 使用DeepSpeech2进行语音识别 +> [Deep Speech 2: End-to-End Speech Recognition in English and Mandarin](https://arxiv.org/abs/1512.02595) + + +## 介绍 + +DeepSpeech2是一种采用CTC损失训练的语音识别模型。它用神经网络取代了整个手工设计组件的管道,可以处理各种各样的语音,包括嘈杂的环境、口音和不同的语言。目前提供版本支持在NPU和GPU上使用[DeepSpeech2](http://arxiv.org/pdf/1512.02595v1.pdf)模型在librispeech数据集上进行训练/测试和推理。 + +### 模型结构 + +目前的复现的模型包括: + +- 两个卷积层: + - 通道数为 32,内核大小为 41, 11 ,步长为 2, 2 + - 通道数为 32,内核大小为 41, 11 ,步长为 2, 1 +- 五个双向 LSTM 层(大小为 1024) +- 一个投影层【大小为字符数加 1(为CTC空白符号),28】 + +### 数据处理 + +- 音频: + + 1.特征提取:采用log功率谱。 + + 2.数据增强:暂未使用。 + +- 文字: + +​ 文字编码使用labels进行英文字母转换,用户可使用分词模型进行替换。 + + +### 1. 数据集准备 +如为未下载数据集,可使用提供的脚本进行一键下载以及数据准备,如下所示: + +```shell +# Download and creat json +python mindaudio/data/librispeech.py --root_path "your_data_path" +``` + +如已下载好压缩文件,请按如下命令操作: + +```shell +# creat json +python mindaudio/data/librispeech.py --root_path "your_data_path" --data_ready True +``` + +LibriSpeech存储flac音频格式的文件。要在MindAudio中使用它们,须将所有flac文件转换为wav文件,用户可以使用[ffmpeg](https://gist.github.com/seungwonpark/4f273739beef2691cd53b5c39629d830)或[sox](https://sourceforge.net/projects/sox/)进行转换。 + +处理后,数据集目录结构如下所示: + +``` + ├─ LibriSpeech_dataset + │ ├── train + │ │ ├─libri_test_clean_manifest.json + │ │ ├─ wav + │ │ └─ txt + │ ├── val + │ │ ├─libri_test_clean_manifest.json + │ │ ├─ wav + │ │ └─ txt + │ ├── test_clean + │ │ ├─libri_test_clean_manifest.json + │ │ ├─ wav + │ │ └─ txt + │ └── test_other + │ │ ├─libri_test_clean_manifest.json + │ │ ├─ wav + │ │ └─ txt +``` + +4个**.json文件存储了相应数据的绝对路径,在后续进行模型训练以及验证中,请将yaml配置文件中的xx_manifest改为对应libri_xx_manifest.json的存放地址。 + +### 2. 训练 +#### 单卡 +由于数据集较大,不推荐使用此种训练方式 +```shell +# Standalone training +python train.py -c "./deepspeech2.yaml" +``` + + +#### 多卡 + + +```shell +# Distribute_training +mpirun -n 8 python train.py -c "./deepspeech2.yaml" +``` +注意:如果脚本是由root用户执行的,必须在mpirun中添加——allow-run-as-root参数,如下所示: +```shell +mpirun --allow-run-as-root -n 8 python train.py -c "./deepspeech2.yaml" +``` + + +### 3.评估 + +将训好的权重地址更新在deepspeech2.yaml配置文件Pretrained_model中,执行以下命令 +```shell +# Validate a trained model +python eval.py -c "./deepspeech2.yaml" +``` + + + +## **性能表现** + +| model | LM | test clean cer| test clean wer | config | weights| +| ----------- | ---- | -------------- | -------------- |--------------------------------------------------------------------------------------------------| ------------------------------------------------------------ | +| deepspeech2 | No | 3.461 | 10.24 | [yaml](https://github.com/mindsporelab/mindaudio/blob/main/example/deepspeech2/deepspeech2.yaml) | [weights](https://download.mindspore.cn/toolkits/mindaudio/deepspeech2/deepspeech2.ckpt) | diff --git a/examples/wavegrad/README.md b/examples/wavegrad/readme.md similarity index 100% rename from examples/wavegrad/README.md rename to examples/wavegrad/readme.md diff --git a/mindaudio/data/voxceleb.py b/mindaudio/data/voxceleb.py index 3facac0..cad01a7 100644 --- a/mindaudio/data/voxceleb.py +++ b/mindaudio/data/voxceleb.py @@ -88,7 +88,7 @@ def prepare_voxceleb( ): """ Prepares the csv files for the Voxceleb1 or Voxceleb2 datasets. - Please follow the instructions in the README.md file for + Please follow the instructions in the readme.md file for preparing Voxceleb2. """