Skip to content

Commit

Permalink
update readme of conformer
Browse files Browse the repository at this point in the history
  • Loading branch information
PingqiLi committed Jul 15, 2024
1 parent 4e06a00 commit dca3bd9
Show file tree
Hide file tree
Showing 3 changed files with 281 additions and 11 deletions.
148 changes: 148 additions & 0 deletions examples/conformer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Using Conformer for Speech Recognition

> [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
## Introduction

Conformer is a model that combines transformers and CNNs to model both local and global dependencies in audio sequences. Currently, models based on transformers and convolutional neural networks (CNNs) have achieved good results in automatic speech recognition (ASR). Transformers can capture long-sequence dependencies and global interactions based on content, while CNNs can effectively utilize local features. Therefore, a convolution-enhanced transformer model called Conformer has been proposed for speech recognition, showing performance superior to both transformers and CNNs. The current version supports using the Conformer model for training/testing and inference on the AISHELL-1 dataset on ascend NPU and GPU.

### Model Structure

The overall structure of Conformer includes SpecAug, ConvolutionSubsampling, Linear, Dropout, and ConformerBlocks×N, as shown in the structure diagram below.

- ConformerBlock Structure (N of this structure): Feed Forward Module, Multi-Head Self Attention Module, Convolution Module, Feed Forward Module, Layernorm. Each module is preceded by a Layernorm and followed by a Dropout, with residual connections linking the input data directly.

- Macaron Structure: The ConformerBlock resembles a macaron structure, with a Multi-Head Self Attention Module and Convolution Module sandwiched between two identical Feed Forward Modules.

![image-20230310165349460](https://raw.githubusercontent.com/mindspore-lab/mindaudio/main/tests/result/conformer.png)

### Data Processing

- Audio:

1. Feature extraction using fbank.
2. Data augmentation using online speed perturb.

- Text:
Character encoding uses character-by-character Chinese encoding conversion. Users can replace it with a segmentation model.

## Usage Steps

### 1. Dataset Preparation

Take the AISHELL dataset as an example. MindAudio provides scripts to download and generate statistical information (including the addresses of wav files and corresponding Chinese information). Executing this script will generate three files: train.csv, dev.csv, and test.csv.

```shell
# data_path is the path where the data is stored
python mindaudio/data/aishell.py --data_path "/data" --download False
```

To download data, set --download parameter to be True.

### 2. Data Preprocessing

#### Text Part

Based on the aishell_transcript_v0.8.txt provided by AISHELL, generate a character-by-character encoding file where each character corresponds to an ID, outputting a file containing encoding information: lang_char.txt.

```shell
cd mindaudio/utils
python text2token.py -s 1 -n 1 "data_path/data_aishell/transcript/aishell_transcript_v0.8.txt" | cut -f 2- -d" " | tr " " "\n" \
| sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0 " " NR+1}' >> ${/data_path/lang_char.txt}
```

#### Audio Part

This model uses global CMVN. To improve training efficiency, statistical features of the data are computed before training, generating a file with the statistical information: global_cmvn.json.

```shell
cd examples/conformer
python compute_cmvn_stats.py --num_workers 16 --train_config conformer.yaml --in_scp data_path/train.csv --out_cmvn data_path/global_cmvn
```

Note: --num_workers can be adjusted according to the number of cores on the training device.

### 3. Training

#### Single-Card Training
```shell
cd examples/conformer
# Standalone training
python train.py --config_path ./conformer.yaml
```

Note: Use Ascend device by default.

#### Multi-Card Training on Ascend

This example uses 8 ascend NPUs.
```shell
# Distribute training
mpirun -n 8 python train.py --config_path ./conformer.yaml
```
Note:

1. When using multi-card training, ensure that is_distributed in the YAML file is set to True. This can be configured by modifying the YAML file or adding parameters on the command line.

```shell
# Distribute_training
mpirun -n 8 python train.py --config_path ./conformer.yaml --is_distributed True
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
To train on GPU, modify the configuration in the YAML file accordingly.

2. Before starting training, you can set environment variable to adjust the number of threads for faster execution as shown below:

```shell
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
```



### 4. Model Evaluation

Four decoding methods are provided: CTC greedy search, CTC prefix beam search, attention decoder, and attention rescoring. The decoding method can be modified in the YAML configuration file.

Executing the script will generate a file containing the prediction results: result.txt.
```shell
# by default using ctc greedy search decoder
python predict.py --config_path ./conformer.yaml

# using ctc prefix beam search decoder
python predict.py --config_path ./conformer.yaml --decode_mode ctc_prefix_beam_search

# using attention decoder
python predict.py --config_path ./conformer.yaml --decode_mode attention

# using attention rescoring decoder
python predict.py --config_path ./conformer.yaml --decode_mode attention_rescoring
```



## Model Performance
The training config can be found in the [conformer.yaml](https://github.com/mindspore-lab/mindaudio/blob/main/examples/conformer/conformer.yaml).

Performance tested on ascend 910 (8p) with graph mode:

| model | decoding mode | CER |
| --------- | ---------------------- |--------------|
| conformer | ctc greedy search | 5.35 |
| conformer | ctc prefix beam search | 5.36 |
| conformer | attention decoder | comming soon |
| conformer | attention rescoring | 4.95 |
- [weights](https://download-mindspore.osinfra.cn/toolkits/mindaudio/conformer/conformer_avg_30-548ee31b.ckpt) can be downloaded here.

---
Performance tested on ascend 910* (8p) with graph mode:

| model | decoding mode | CER |
| --------- | ---------------------- |--------------|
| conformer | ctc greedy search | 5.62 |
| conformer | ctc prefix beam search | 5.62 |
| conformer | attention decoder | comming soon |
| conformer | attention rescoring | 5.12 |
- [weights](https://download-mindspore.osinfra.cn/toolkits/mindaudio/conformer/conformer_avg_30-692d57b3-910v2.ckpt) can be downloaded here.
44 changes: 33 additions & 11 deletions examples/conformer/README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,21 +115,43 @@ export MKL_NUM_THREADS=1
执行脚本后将生成包含预测结果的文件:result.txt

```shell
# by default using ctc greedy search decoder
python predict.py --config_path ./conformer.yaml

# using ctc prefix beam search decoder
python predict.py --config_path ./conformer.yaml --decode_mode ctc_prefix_beam_search

# using attention decoder
python predict.py --config_path ./conformer.yaml --decode_mode attention

# using attention rescoring decoder
python predict.py --config_path ./conformer.yaml --decode_mode attention_rescoring
```



## **模型表现**
训练的配置文件见 [conformer.yaml](https://github.com/mindspore-lab/mindaudio/blob/main/examples/conformer/conformer.yaml)

在 ascend 910(8p) 图模式上的测试性能:



| model | decoding mode | CER |
| --------- | ---------------------- |--------------|
| conformer | ctc greedy search | 5.35 |
| conformer | ctc prefix beam search | 5.36 |
| conformer | attention decoder | comming soon |
| conformer | attention rescoring | comming soon |
- 训练好的 [weights](https://download-mindspore.osinfra.cn/toolkits/mindaudio/conformer/conformer_avg_30-548ee31b.ckpt) 可以在此处下载。
---
在 ascend 910*(8p) 图模式上的测试性能:


* Feature info: using fbank feature, cmvn, online speed perturb
* Training info: lr 0.001, acc_grad 1, 240 epochs, 8 Ascend910
* Decoding info: ctc_weight 0.3, average_num 30
* Performance result: total_time 11h17min, 8p, using hccl_tools.

| model | decoding mode | CER |
| --------- | ---------------------- | ---- |
| conformer | ctc greedy search | 5.05 |
| conformer | ctc prefix beam search | 5.05 |
| conformer | attention decoder | 5.00 |
| conformer | attention rescoring | 4.73 |
| model | decoding mode | CER |
| --------- | ---------------------- |--------------|
| conformer | ctc greedy search | 5.62 |
| conformer | ctc prefix beam search | 5.62 |
| conformer | attention decoder | comming soon |
| conformer | attention rescoring | 5.12 |
- 训练好的 [weights](https://download-mindspore.osinfra.cn/toolkits/mindaudio/conformer/conformer_avg_30-692d57b3-910v2.ckpt) 可以在此处下载。
100 changes: 100 additions & 0 deletions examples/deepspeech2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Using DeepSpeech2 for Speech Recognition
>
## Introduction

DeepSpeech2 is a speech recognition model trained using CTC loss. It replaces the entire manually designed component pipeline with neural networks and can handle a variety of speech, including noisy environments, accents, and different languages. The currently provided version supports using the [DeepSpeech2](http://arxiv.org/pdf/1512.02595v1.pdf) model for training/testing and inference on the librispeech dataset on NPU and GPU.

### Model Architecture

The current reproduced model includes:

- Two convolutional layers:
- Number of channels: 32, kernel size: 41, 11, stride: 2, 2
- Number of channels: 32, kernel size: 41, 11, stride: 2, 1
- Five bidirectional LSTM layers (size 1024)
- A projection layer [size equal to the number of characters plus 1 (for the CTC blank symbol), 28]

### Data Processing

- Audio:
1. Feature extraction: log power spectrum.
2. Data augmentation: not used yet.

- Text:
- Text encoding uses labels for English alphabet conversion; users can replace this with a tokenization model.

## Usage Steps

### 1. Preparing the Dataset
If the dataset is not downloaded, you can use the provided script to download and prepare the data with one command, as shown below:

```shell
# Download and create json
python mindaudio/data/librispeech.py --root_path "your_data_path"
```

If you have already downloaded the compressed files, operate with the following command:

```shell
# Create json
python mindaudio/data/librispeech.py --root_path "your_data_path" --data_ready True
```

LibriSpeech stores files in flac audio format. To use them in MindAudio, all flac files need to be converted to wav files. Users can use [ffmpeg](https://gist.github.com/seungwonpark/4f273739beef2691cd53b5c39629d830) or [sox](https://sourceforge.net/projects/sox/) for the conversion.

After processing, the dataset directory structure is as follows:

```
├─ LibriSpeech_dataset
│ ├── train
│ │ ├─libri_test_clean_manifest.json
│ │ ├─ wav
│ │ └─ txt
│ ├── val
│ │ ├─libri_test_clean_manifest.json
│ │ ├─ wav
│ │ └─ txt
│ ├── test_clean
│ │ ├─libri_test_clean_manifest.json
│ │ ├─ wav
│ │ └─ txt
│ └── test_other
│ │ ├─libri_test_clean_manifest.json
│ │ ├─ wav
│ │ └─ txt
```

The four **.json files store the absolute paths of the corresponding data. For subsequent model training and validation, update the xx_manifest in the yaml configuration file to the location of the corresponding libri_xx_manifest.json file.

### 2. Training
#### Single-Card Training
Due to the large dataset, this training method is not recommended.
```shell
# Standalone training
python train.py -c "./deepspeech2.yaml"
```
Note: The default is to use Ascend machines.

#### Multi-Card Training on Ascend
This example uses 8 NPUs. If you want to change the number of NPUs, you can modify the number of cards after -n in the command below.
```shell
# Distributed training
mpirun -n 8 python train.py -c "./deepspeech2.yaml"
```
Note: If the script is executed by the root user, you must add the --allow-run-as-root parameter in mpirun, as shown below:
```shell
mpirun --allow-run-as-root -n 8 python train.py -c "./deepspeech2.yaml"
```

### 3. Evaluating the Model
Update the path to the trained weights in the Pretrained_model section of the deepspeech2.yaml configuration file and execute the following command:
```shell
# Validate a trained model
python eval.py -c "./deepspeech2.yaml"
```

## **Model Performance**

| Model | Machine | LM | Test Clean CER | Test Clean WER | Parameters | Weights |
|--------------|-----------|------|----------------|----------------|----------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------|
| DeepSpeech2 | D910x8-G | No | 3.461 | 10.24 | [yaml](https://github.com/mindsporelab/mindaudio/blob/main/example/deepspeech2/deepspeech2.yaml) | [weights](https://download.mindspore.cn/toolkits/mindaudio/deepspeech2/deepspeech2.ckpt) |

0 comments on commit dca3bd9

Please sign in to comment.