update readme of conformer

mindspore-lab · Jul 15, 2024 · dca3bd9 · dca3bd9
1 parent 4e06a00
commit dca3bd9
Show file tree

Hide file tree

Showing 3 changed files with 281 additions and 11 deletions.
diff --git a/examples/conformer/README.md b/examples/conformer/README.md
@@ -0,0 +1,148 @@
+# Using Conformer for Speech Recognition
+
+> [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
+
+## Introduction
+
+Conformer is a model that combines transformers and CNNs to model both local and global dependencies in audio sequences. Currently, models based on transformers and convolutional neural networks (CNNs) have achieved good results in automatic speech recognition (ASR). Transformers can capture long-sequence dependencies and global interactions based on content, while CNNs can effectively utilize local features. Therefore, a convolution-enhanced transformer model called Conformer has been proposed for speech recognition, showing performance superior to both transformers and CNNs. The current version supports using the Conformer model for training/testing and inference on the AISHELL-1 dataset on ascend NPU and GPU.
+
+### Model Structure
+
+The overall structure of Conformer includes SpecAug, ConvolutionSubsampling, Linear, Dropout, and ConformerBlocks×N, as shown in the structure diagram below.
+
+- ConformerBlock Structure (N of this structure): Feed Forward Module, Multi-Head Self Attention Module, Convolution Module, Feed Forward Module, Layernorm. Each module is preceded by a Layernorm and followed by a Dropout, with residual connections linking the input data directly.
+
+- Macaron Structure: The ConformerBlock resembles a macaron structure, with a Multi-Head Self Attention Module and Convolution Module sandwiched between two identical Feed Forward Modules.
+
+  ![image-20230310165349460](https://raw.githubusercontent.com/mindspore-lab/mindaudio/main/tests/result/conformer.png)
+
+### Data Processing
+
+- Audio:
+
+  1. Feature extraction using fbank.
+  2. Data augmentation using online speed perturb.
+
+- Text:
+Character encoding uses character-by-character Chinese encoding conversion. Users can replace it with a segmentation model.
+
+## Usage Steps
+
+### 1. Dataset Preparation
+
+Take the AISHELL dataset as an example. MindAudio provides scripts to download and generate statistical information (including the addresses of wav files and corresponding Chinese information). Executing this script will generate three files: train.csv, dev.csv, and test.csv.
+
+```shell
+# data_path is the path where the data is stored
+python mindaudio/data/aishell.py --data_path "/data" --download False
+```
+
+To download data, set --download parameter to be True.
+
+### 2. Data Preprocessing
+
+#### Text Part
+
+Based on the aishell_transcript_v0.8.txt provided by AISHELL, generate a character-by-character encoding file where each character corresponds to an ID, outputting a file containing encoding information: lang_char.txt.
+
+```shell
+cd mindaudio/utils
+python text2token.py -s 1 -n 1 "data_path/data_aishell/transcript/aishell_transcript_v0.8.txt" | cut -f 2- -d" " | tr " " "\n" \
+        | sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0 " " NR+1}' >> ${/data_path/lang_char.txt}
+```
+
+#### Audio Part
+
+This model uses global CMVN. To improve training efficiency, statistical features of the data are computed before training, generating a file with the statistical information: global_cmvn.json.
+
+```shell
+cd examples/conformer
+python compute_cmvn_stats.py --num_workers 16 --train_config conformer.yaml --in_scp data_path/train.csv --out_cmvn data_path/global_cmvn
+```
+
+Note: --num_workers can be adjusted according to the number of cores on the training device.
+
+### 3. Training
+
+#### Single-Card Training
+```shell
+cd examples/conformer
+# Standalone training
+python train.py --config_path ./conformer.yaml
+```
+
+Note: Use Ascend device by default.
+
+#### Multi-Card Training on Ascend
+
+This example uses 8 ascend NPUs.
+```shell
+# Distribute training
+mpirun -n 8 python train.py --config_path ./conformer.yaml
+```
+Note: 
+
+1. When using multi-card training, ensure that is_distributed in the YAML file is set to True. This can be configured by modifying the YAML file or adding parameters on the command line.
+
+```shell
+# Distribute_training
+mpirun -n 8 python train.py --config_path ./conformer.yaml  --is_distributed True
+```
+
+> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
+
+To train on GPU, modify the configuration in the YAML file accordingly.
+
+2. Before starting training, you can set environment variable to adjust the number of threads for faster execution as shown below:
+
+```shell
+export OPENBLAS_NUM_THREADS=1
+export MKL_NUM_THREADS=1
+```
+
+
+
+### 4. Model Evaluation
+
+Four decoding methods are provided: CTC greedy search, CTC prefix beam search, attention decoder, and attention rescoring. The decoding method can be modified in the YAML configuration file.
+
+Executing the script will generate a file containing the prediction results: result.txt.
+```shell
+# by default using ctc greedy search decoder
+python predict.py --config_path ./conformer.yaml
+
+# using ctc prefix beam search decoder
+python predict.py --config_path ./conformer.yaml --decode_mode ctc_prefix_beam_search
+
+# using attention decoder
+python predict.py --config_path ./conformer.yaml --decode_mode attention
+
+# using attention rescoring decoder
+python predict.py --config_path ./conformer.yaml --decode_mode attention_rescoring 
+```
+
+
+
+## Model Performance
+The training config can be found in the [conformer.yaml](https://github.com/mindspore-lab/mindaudio/blob/main/examples/conformer/conformer.yaml).
+
+Performance tested on ascend 910 (8p) with graph mode:
+
+| model     | decoding mode          | CER          |
+| --------- | ---------------------- |--------------|
+| conformer | ctc greedy search      | 5.35         |
+| conformer | ctc prefix beam search | 5.36         |
+| conformer | attention decoder      | comming soon |
+| conformer | attention rescoring    | 4.95         |
+- [weights](https://download-mindspore.osinfra.cn/toolkits/mindaudio/conformer/conformer_avg_30-548ee31b.ckpt) can be downloaded here.
+
+---
+Performance tested on ascend 910* (8p) with graph mode:
+
+| model     | decoding mode          | CER          |
+| --------- | ---------------------- |--------------|
+| conformer | ctc greedy search      | 5.62         |
+| conformer | ctc prefix beam search | 5.62         |
+| conformer | attention decoder      | comming soon |
+| conformer | attention rescoring    | 5.12         |
+- [weights](https://download-mindspore.osinfra.cn/toolkits/mindaudio/conformer/conformer_avg_30-692d57b3-910v2.ckpt) can be downloaded here.
diff --git a/examples/conformer/README_CN.md b/examples/conformer/README_CN.md
@@ -115,21 +115,43 @@ export MKL_NUM_THREADS=1
 执行脚本后将生成包含预测结果的文件：result.txt
 
 ```shell
+# by default using ctc greedy search decoder
 python predict.py --config_path ./conformer.yaml
+
+# using ctc prefix beam search decoder
+python predict.py --config_path ./conformer.yaml --decode_mode ctc_prefix_beam_search
+
+# using attention decoder
+python predict.py --config_path ./conformer.yaml --decode_mode attention
+
+# using attention rescoring decoder
+python predict.py --config_path ./conformer.yaml --decode_mode attention_rescoring 
 ```
 
 
 
 ## **模型表现**
+训练的配置文件见 [conformer.yaml](https://github.com/mindspore-lab/mindaudio/blob/main/examples/conformer/conformer.yaml)。
+
+在 ascend 910(8p) 图模式上的测试性能:
+
+
+
+| model     | decoding mode          | CER          |
+| --------- | ---------------------- |--------------|
+| conformer | ctc greedy search      | 5.35         |
+| conformer | ctc prefix beam search | 5.36         |
+| conformer | attention decoder      | comming soon |
+| conformer | attention rescoring    | comming soon |
+- 训练好的 [weights](https://download-mindspore.osinfra.cn/toolkits/mindaudio/conformer/conformer_avg_30-548ee31b.ckpt) 可以在此处下载。
+---
+在 ascend 910*(8p) 图模式上的测试性能: 
+
 
-* Feature info: using fbank feature, cmvn, online speed perturb
-* Training info: lr 0.001, acc_grad 1, 240 epochs, 8 Ascend910
-* Decoding info: ctc_weight 0.3, average_num 30
-* Performance result: total_time 11h17min, 8p, using hccl_tools.
-
-| model     | decoding mode          | CER  |
-| --------- | ---------------------- | ---- |
-| conformer | ctc greedy search      | 5.05 |
-| conformer | ctc prefix beam search | 5.05 |
-| conformer | attention decoder      | 5.00 |
-| conformer | attention rescoring    | 4.73 |
+| model     | decoding mode          | CER          |
+| --------- | ---------------------- |--------------|
+| conformer | ctc greedy search      | 5.62         |
+| conformer | ctc prefix beam search | 5.62         |
+| conformer | attention decoder      | comming soon |
+| conformer | attention rescoring    | 5.12         |
+- 训练好的 [weights](https://download-mindspore.osinfra.cn/toolkits/mindaudio/conformer/conformer_avg_30-692d57b3-910v2.ckpt) 可以在此处下载。
diff --git a/examples/deepspeech2/README.md b/examples/deepspeech2/README.md
@@ -0,0 +1,100 @@
+# Using DeepSpeech2 for Speech Recognition
+> 
+## Introduction
+
+DeepSpeech2 is a speech recognition model trained using CTC loss. It replaces the entire manually designed component pipeline with neural networks and can handle a variety of speech, including noisy environments, accents, and different languages. The currently provided version supports using the [DeepSpeech2](http://arxiv.org/pdf/1512.02595v1.pdf) model for training/testing and inference on the librispeech dataset on NPU and GPU.
+
+### Model Architecture
+
+The current reproduced model includes:
+
+- Two convolutional layers:
+  - Number of channels: 32, kernel size: 41, 11, stride: 2, 2
+  - Number of channels: 32, kernel size: 41, 11, stride: 2, 1
+- Five bidirectional LSTM layers (size 1024)
+- A projection layer [size equal to the number of characters plus 1 (for the CTC blank symbol), 28]
+
+### Data Processing
+
+- Audio:
+  1. Feature extraction: log power spectrum.
+  2. Data augmentation: not used yet.
+
+- Text:
+  - Text encoding uses labels for English alphabet conversion; users can replace this with a tokenization model.
+
+## Usage Steps
+
+### 1. Preparing the Dataset
+If the dataset is not downloaded, you can use the provided script to download and prepare the data with one command, as shown below:
+
+```shell
+# Download and create json
+python mindaudio/data/librispeech.py --root_path "your_data_path"
+```
+
+If you have already downloaded the compressed files, operate with the following command:
+
+```shell
+# Create json
+python mindaudio/data/librispeech.py --root_path "your_data_path" --data_ready True
+```
+
+LibriSpeech stores files in flac audio format. To use them in MindAudio, all flac files need to be converted to wav files. Users can use [ffmpeg](https://gist.github.com/seungwonpark/4f273739beef2691cd53b5c39629d830) or [sox](https://sourceforge.net/projects/sox/) for the conversion.
+
+After processing, the dataset directory structure is as follows:
+
+```
+    ├─ LibriSpeech_dataset
+    │  ├── train
+    │  │   ├─libri_test_clean_manifest.json
+    │  │   ├─ wav
+    │  │   └─ txt
+    │  ├── val
+    │  │   ├─libri_test_clean_manifest.json
+    │  │   ├─ wav
+    │  │   └─ txt
+    │  ├── test_clean
+    │  │   ├─libri_test_clean_manifest.json
+    │  │   ├─ wav
+    │  │   └─ txt
+    │  └── test_other
+    │  │   ├─libri_test_clean_manifest.json
+    │  │   ├─ wav
+    │  │   └─ txt
+```
+
+The four **.json files store the absolute paths of the corresponding data. For subsequent model training and validation, update the xx_manifest in the yaml configuration file to the location of the corresponding libri_xx_manifest.json file.
+
+### 2. Training
+#### Single-Card Training
+Due to the large dataset, this training method is not recommended.
+```shell
+# Standalone training
+python train.py -c "./deepspeech2.yaml"
+```
+Note: The default is to use Ascend machines.
+
+#### Multi-Card Training on Ascend
+This example uses 8 NPUs. If you want to change the number of NPUs, you can modify the number of cards after -n in the command below.
+```shell
+# Distributed training
+mpirun -n 8 python train.py -c "./deepspeech2.yaml"
+```
+Note: If the script is executed by the root user, you must add the --allow-run-as-root parameter in mpirun, as shown below:
+```shell
+mpirun --allow-run-as-root -n 8 python train.py -c "./deepspeech2.yaml"
+```
+
+### 3. Evaluating the Model
+Update the path to the trained weights in the Pretrained_model section of the deepspeech2.yaml configuration file and execute the following command:
+```shell
+# Validate a trained model
+python eval.py -c "./deepspeech2.yaml"
+```
+
+## **Model Performance**
+
+| Model        | Machine   | LM   | Test Clean CER | Test Clean WER | Parameters                                                                                               | Weights                                                         |
+|--------------|-----------|------|----------------|----------------|----------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------|
+| DeepSpeech2  | D910x8-G  | No   | 3.461          | 10.24          | [yaml](https://github.com/mindsporelab/mindaudio/blob/main/example/deepspeech2/deepspeech2.yaml)          | [weights](https://download.mindspore.cn/toolkits/mindaudio/deepspeech2/deepspeech2.ckpt)               |