Skip to content

Commit

Permalink
[Doc] Add data_prepare.md docs (#82)
Browse files Browse the repository at this point in the history
* add prepare

* Update dataset_prepare.md

* Update dataset_prepare.md

* modify default data path

* Update dataset_prepare.md

* fix pre-commit

* move docs to user_guide

* move zh docs to user_guide

* add zh docs

* fix typo

* Update dataset_prepare.md
  • Loading branch information
LZHgrla authored Aug 31, 2023
1 parent f73ad48 commit f1ae90d
Show file tree
Hide file tree
Showing 38 changed files with 152 additions and 50 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,11 +150,11 @@ XTuner provides tools to chat with pretrained / fine-tuned LLMs.
xtuner chat hf meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
```

For more examples, please see [chat.md](./docs/en/chat.md).
For more examples, please see [chat.md](./docs/en/user_guides/chat.md).

### Fine-tune [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QAEZVBfQ7LZURkMUtaq0b-5nEQII9G9Z?usp=sharing)

XTuner supports the efficient fine-tune (*e.g.*, QLoRA) for LLMs.
XTuner supports the efficient fine-tune (*e.g.*, QLoRA) for LLMs. Dataset prepare guides can be found on [dataset_prepare.md](./docs/en/user_guides/dataset_prepare.md).

- **Step 0**, prepare the config. XTuner provides many ready-to-use configs and we can view all configs by

Expand All @@ -178,7 +178,7 @@ XTuner supports the efficient fine-tune (*e.g.*, QLoRA) for LLMs.
(SLURM) srun ${SRUN_ARGS} xtuner train internlm_7b_qlora_oasst1_e3 --launcher slurm
```

For more examples, please see [finetune.md](./docs/en/finetune.md).
For more examples, please see [finetune.md](./docs/en/user_guides/finetune.md).

### Deployment

Expand Down
6 changes: 3 additions & 3 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,11 +150,11 @@ XTuner 提供与大语言模型对话的工具。
xtuner chat hf meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
```

更多示例,请查阅[文档](./docs/zh_cn/chat.md)。
更多示例,请查阅[文档](./docs/zh_cn/user_guides/chat.md)。

### 微调 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QAEZVBfQ7LZURkMUtaq0b-5nEQII9G9Z?usp=sharing)

XTuner 支持微调大语言模型。
XTuner 支持微调大语言模型。数据集预处理指南请查阅[文档](./docs/zh_cn/user_guides/dataset_prepare.md)。

- **步骤 0**,准备配置文件。XTuner 提供多个开箱即用的配置文件,用户可以通过下列命令查看:

Expand All @@ -177,7 +177,7 @@ XTuner 支持微调大语言模型。
NPROC_PER_NODE=${GPU_NUM} xtuner train internlm_7b_qlora_oasst1_e3
```

更多示例,请查阅[文档](./docs/zh_cn/finetune.md).
更多示例,请查阅[文档](./docs/zh_cn/user_guides/finetune.md).

### 部署

Expand Down
File renamed without changes.
File renamed without changes.
51 changes: 51 additions & 0 deletions docs/en/user_guides/dataset_prepare.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Dataset Prepare

## HuggingFace datasets

For datasets on HuggingFace Hub, such as [alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), you can quickly utilize them. For more details, please refer to [single_turn_conversation.md](./single_turn_conversation.md) and [multi_turn_conversation.md](./multi_turn_conversation.md).

## Others

### Arxiv Gentitle

Arxiv dataset is not released on HuggingFace Hub, but you can download it from Kaggle.

**Step 0**, download raw data from https://kaggle.com/datasets/Cornell-University/arxiv.

**Step 1**, process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ${SAVE_DATA_PATH} [optional arguments]`.

For example, get all `cs.AI`, `cs.CL`, `cs.CV` papers from `2020-01-01`:

```shell
xtuner preprocess arxiv ${DOWNLOADED_DATA} ${SAVE_DATA_PATH} --categories cs.AI cs.CL cs.CV --start-date 2020-01-01
```

**Step 2**, all Arixv Gentitle configs assume the dataset path to be `./data/arxiv_data.json`. You can move and rename your data, or make changes to these configs.

### MOSS-003-SFT

MOSS-003-SFT dataset can be downloaded from https://huggingface.co/datasets/fnlp/moss-003-sft-data.

**Step 0**, download data.

```shell
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/fnlp/moss-003-sft-data
```

**Step 1**, unzip.

```shell
cd moss-003-sft-data
unzip moss-003-sft-no-tools.jsonl.zip
unzip moss-003-sft-with-tools-no-text2image.zip
```

**Step 2**, all moss-003-sft configs assume the dataset path to be `./data/moss-003-sft-no-tools.jsonl` and `./data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl`. You can move and rename your data, or make changes to these configs.

### Chinese Lawyer

Chinese Lawyer dataset has two sub-dataset, and can be downloaded form https://github.com/LiuHC0428/LAW-GPT.

All lawyer configs assume the dataset path to be `./data/CrimeKgAssitant清洗后_52k.json` and `./data/训练数据_带法律依据_92k.json`. You can move and rename your data, or make changes to these configs.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
51 changes: 51 additions & 0 deletions docs/zh_cn/user_guides/dataset_prepare.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# 数据集准备

## HuggingFace 数据集

针对 HuggingFace Hub 中的数据集,比如 [alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca),用户可以快速使用它们。更多使用指南请参照[单轮对话文档](./single_turn_conversation.md)[多轮对话文档](./multi_turn_conversation.md)

## 其他

### Arxiv Gentitle 生成题目

Arxiv 数据集并未在 HuggingFace Hub上发布,但是可以在 Kaggle 上下载。

**步骤 0**,从 https://kaggle.com/datasets/Cornell-University/arxiv 下载原始数据。

**步骤 1**,使用 `xtuner preprocess arxiv ${DOWNLOADED_DATA} ${SAVE_DATA_PATH} [optional arguments]` 命令处理数据。

例如,提取从 `2020-01-01` 起的所有 `cs.AI``cs.CL``cs.CV` 论文:

```shell
xtuner preprocess arxiv ${DOWNLOADED_DATA} ${SAVE_DATA_PATH} --categories cs.AI cs.CL cs.CV --start-date 2020-01-01
```

**步骤 2**,所有的 Arixv Gentitle 配置文件都假设数据集路径为 `./data/arxiv_data.json`。用户可以移动并重命名数据,或者在配置文件中重新设置数据路径。

### MOSS-003-SFT

MOSS-003-SFT 数据集可以在 https://huggingface.co/datasets/fnlp/moss-003-sft-data 下载。

**步骤 0**,下载数据。

```shell
# 确保已经安装 git-lfs (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/fnlp/moss-003-sft-data
```

**步骤 1**,解压缩。

```shell
cd moss-003-sft-data
unzip moss-003-sft-no-tools.jsonl.zip
unzip moss-003-sft-with-tools-no-text2image.zip
```

**步骤 2**, 所有的 moss-003-sft 配置文件都假设数据集路径为 `./data/moss-003-sft-no-tools.jsonl``./data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl`。用户可以移动并重命名数据,或者在配置文件中重新设置数据路径。

### Chinese Lawyer

Chinese Lawyer 数据集有两个子数据集,它们可以在 https://github.com/LiuHC0428/LAW-GPT 下载。

所有的 Chinese Lawyer 配置文件都假设数据集路径为 `./data/CrimeKgAssitant清洗后_52k.json``./data/训练数据_带法律依据_92k.json`。用户可以移动并重命名数据,或者在配置文件中重新设置数据路径。
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@

# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data with `./tools/data_preprocess/arxiv.py`
data_path = './data/arxiv_postprocess_csAIcsCLcsCV_20200101.json'
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.title
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@

# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/law/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/law/训练数据_带法律依据_92k.json'
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.lawyer
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@

# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data with `./tools/data_preprocess/arxiv.py`
data_path = './data/arxiv_postprocess_csAIcsCLcsCV_20200101.json'
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@

# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/law/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/law/训练数据_带法律依据_92k.json'
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@

# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data with `./tools/data_preprocess/arxiv.py`
data_path = './data/arxiv_postprocess_csAIcsCLcsCV_20200101.json'
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.title
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@

# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/law/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/law/训练数据_带法律依据_92k.json'
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.lawyer
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@

# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data with `./tools/data_preprocess/arxiv.py`
data_path = './data/arxiv_postprocess_csAIcsCLcsCV_20200101.json'
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.chatglm
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@

# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/law/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/law/训练数据_带法律依据_92k.json'
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.chatglm
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@

# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data with `./tools/data_preprocess/arxiv.py`
data_path = './data/arxiv_postprocess_csAIcsCLcsCV_20200101.json'
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.title
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@

# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/law/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/law/训练数据_带法律依据_92k.json'
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.lawyer
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@

# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data with `./tools/data_preprocess/arxiv.py`
data_path = './data/arxiv_postprocess_csAIcsCLcsCV_20200101.json'
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@

# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/law/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/law/训练数据_带法律依据_92k.json'
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@

# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data with `./tools/data_preprocess/arxiv.py`
data_path = './data/arxiv_postprocess_csAIcsCLcsCV_20200101.json'
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.title
max_length = 2048
pack_to_max_length = True
Expand Down
4 changes: 2 additions & 2 deletions xtuner/configs/llama/llama2_7b/llama2_7b_qlora_lawyer_e3.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@

# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/law/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/law/训练数据_带法律依据_92k.json'
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.lawyer
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@

# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data with `./tools/data_preprocess/arxiv.py`
data_path = './data/arxiv_postprocess_csAIcsCLcsCV_20200101.json'
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.llama_2_chat
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@

# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/law/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/law/训练数据_带法律依据_92k.json'
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.llama_2_chat
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@

# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data with `./tools/data_preprocess/arxiv.py`
data_path = './data/arxiv_postprocess_csAIcsCLcsCV_20200101.json'
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.title
max_length = 2048
pack_to_max_length = True
Expand Down
4 changes: 2 additions & 2 deletions xtuner/configs/llama/llama_7b/llama_7b_qlora_lawyer_e3.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@

# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/law/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/law/训练数据_带法律依据_92k.json'
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.lawyer
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@

# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data with `./tools/data_preprocess/arxiv.py`
data_path = './data/arxiv_postprocess_csAIcsCLcsCV_20200101.json'
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.title
max_length = 2048
pack_to_max_length = True
Expand Down
4 changes: 2 additions & 2 deletions xtuner/configs/qwen/qwen_7b/qwen_7b_qlora_lawyer_e3.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@

# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/law/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/law/训练数据_带法律依据_92k.json'
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.lawyer
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@

# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data with `./tools/data_preprocess/arxiv.py`
data_path = './data/arxiv_postprocess_csAIcsCLcsCV_20200101.json'
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@

# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/law/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/law/训练数据_带法律依据_92k.json'
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
Expand Down

0 comments on commit f1ae90d

Please sign in to comment.