Skip to content

Commit

Permalink
[Feature] Support log processed dataset & Fix doc (#101)
Browse files Browse the repository at this point in the history
* fix dataset docs

* support log processed dataset

* fix bugs when import python function to config files

* Verify the correctness of the config file for the custom dataset

* rename

* fix entry point

* fix docs

* fix bugs when import python function to config files

* update toy datasets
  • Loading branch information
HIT-cwh authored Sep 8, 2023
1 parent abd9de1 commit 2ced34e
Show file tree
Hide file tree
Showing 17 changed files with 535 additions and 21 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,6 @@ venv.bak/
.mypy_cache/

# custom
/data
.vscode
.idea
.DS_Store
Expand Down
16 changes: 16 additions & 0 deletions data/toy_custom_incremental_data.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[{
"conversation":[
{
"input": "",
"output": "I am an artificial intelligence (AI) assistant named Puyu. I was created by the Shanghai AI Laboratory and my purpose is to assist users with various tasks through natural language processing technology."
}
]
},
{
"conversation":[
{
"input": "",
"output": "I am an artificial intelligence programmed to assist with various types of tasks, including answering questions, providing information, and performing automated processes."
}
]
}]
32 changes: 32 additions & 0 deletions data/toy_custom_multi_turn_data.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
[{
"conversation":[
{
"input": "Hello?",
"output": "Hello! How can I help you?"
},
{
"input": "What's the date today?",
"output": "Today is Monday, August 14, 2023."
},
{
"input": "Thank you!",
"output": "You are welcome."
}
]
},
{
"conversation":[
{
"input": "Hello?",
"output": "Hello! How can I help you?"
},
{
"input": "How's the weather today in Rosso?",
"output": "The weather in Rosso on Wednesday, August 16th, is going to be cloudy for most of the day, together with moderate rain around noon."
},
{
"input": "Thank you!",
"output": "You are welcome."
}
]
}]
18 changes: 18 additions & 0 deletions data/toy_custom_single_turn_data.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
[{
"conversation":
[
{
"input": "Give three tips for staying healthy.",
"output": "1.Eat a balanced diet. 2. Exercise regularly. 3. Get enough sleep."
}
]
},
{
"conversation":
[
{
"input": "How to study English?",
"output": "1. Set clear goals. 2. Create a study plan. 3. Build vocabulary. 4. Practice speaking."
}
]
}]
62 changes: 58 additions & 4 deletions docs/en/user_guides/incremental_pretraining.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,9 @@ The following modifications need to be made to the config file copied in Step 3:
from xtuner.dataset import process_hf_dataset
from datasets import load_dataset
- from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
+ from map_fn import oasst1_incremental_map_fn
+ from mmengine.config import read_base
+ with read_base():
+ from .map_fn import oasst1_incremental_map_fn
...
#######################################################################
# PART 1 Settings #
Expand Down Expand Up @@ -127,8 +129,33 @@ train_dataloader = dict(
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
...
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
- instruction=prompt_template.INSTRUCTION_START)
+ )
]
...
```

#### Step 5, Log Processed Dataset (Optional)

After modifying the config file, you can print the first data of the processed dataset to verify whether the dataset has been constructed correctly.

```bash
xtuner log-dataset $CONFIG
```

`$CONFIG` represents the file path of the modified configuration file in Step 4.

### Using Custom Datasets

When using custom datasets for incremental pre-training, we recommend constructing the dataset according to the [incremental pre-training data format](./dataset_format.md#incremental-pre-training-dataset-format) defined by XTuner. If the custom dataset is in other formats such as oasst1, refer to the section on [Using Dataset in HuggingFace Hub](#using-dataset-in-huggingface-hub).
Expand Down Expand Up @@ -191,18 +218,20 @@ from datasets import load_dataset
#######################################################################
- data_path = 'timdettmers/openassistant-guanaco'
- prompt_template = PROMPT_TEMPLATE.openassistant
+ data_path = 'path/to/your/data'
+ data_path = 'path/to/your/json/data'
...
#######################################################################
# STEP 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
- dataset=dict(type=load_dataset, path=data_path),
+ dataset=dict(
+ type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
- dataset_map_fn=oasst1_map_fn,
+ dataset_map_fn=oasst1_incremental_map_fn,
+ dataset_map_fn=None,
- template_map_fn=dict(
- type=template_map_fn_factory, template=prompt_template),
+ template_map_fn=None,
Expand All @@ -217,4 +246,29 @@ train_dataloader = dict(
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
...
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
- instruction=prompt_template.INSTRUCTION_START)
+ )
]
...
```

#### Step 5, Check custom Dataset (Optional)

After modifying the config file, you can execute the 'xtuner/tools/check_custom_dataset.py' script to verify the correct construction of the dataset.

```bash
xtuner check-custom-dataset $CONFIG
```

`$CONFIG` represents the file path of the modified configuration file in Step 4.
30 changes: 27 additions & 3 deletions docs/en/user_guides/multi_turn_conversation.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,9 @@ from xtuner.dataset import process_hf_dataset
from datasets import load_dataset
- from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
+ from xtuner.dataset.map_fns import template_map_fn_factory
+ from .map_fn import oasst1_multi_turns_map_fn
+ from mmengine.config import read_base
+ with read_base():
+ from .map_fn import oasst1_multi_turns_map_fn
...
#######################################################################
# PART 1 Settings #
Expand Down Expand Up @@ -189,6 +191,16 @@ train_dataloader = dict(
...
```

#### Step 6, Log Processed Dataset (Optional)

After modifying the config file, you can print the first data of the processed dataset to verify whether the dataset has been constructed correctly.

```bash
xtuner log-dataset $CONFIG
```

`$CONFIG` represents the file path of the modified configuration file in Step 5.

## Using Custom Datasets

When using a custom multi-turn dialogue dataset for command fine-tuning, we recommend constructing the dataset in the [multi-turn dialogue data format](./dataset_format.md#multi-turn-dialogue-dataset-format) as defined by XTuner. If the custom dataset format is oasst1 or other formats, you can refer to the section on [Using Datasets in HuggingFace Hub](#using-dataset-in-huggingface-hub).
Expand Down Expand Up @@ -260,7 +272,7 @@ from datasets import load_dataset
# PART 1 Settings #
#######################################################################
- data_path = 'timdettmers/openassistant-guanaco'
+ data_path = 'path/to/your/data'
+ data_path = 'path/to/your/json/data'

+ prompt_template = PROMPT_TEMPLATE.openassistant
...
Expand All @@ -269,7 +281,9 @@ from datasets import load_dataset
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
- dataset=dict(type=load_dataset, path=data_path),
+ dataset=dict(
+ type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
+ dataset_map_fn=None,
Expand All @@ -287,3 +301,13 @@ train_dataloader = dict(
collate_fn=dict(type=default_collate_fn))
...
```

#### Step 6, Check Processed Dataset (Optional)

After modifying the config file, you can execute the 'xtuner/tools/check_custom_dataset.py' script to verify the correct construction of the dataset.

```bash
xtuner check-custom-dataset $CONFIG
```

`$CONFIG` represents the file path of the modified configuration file in Step 5.
30 changes: 27 additions & 3 deletions docs/en/user_guides/single_turn_conversation.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,9 @@ from xtuner.dataset import process_hf_dataset
from datasets import load_dataset
- from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
+ from xtuner.dataset.map_fns import template_map_fn_factory
+ from .map_fn import alpaca_map_fn
+ from mmengine.config import read_base
+ with read_base():
+ from .map_fn import alpaca_map_fn
...
#######################################################################
# PART 1 Settings #
Expand Down Expand Up @@ -164,6 +166,16 @@ train_dataloader = dict(
...
```

#### Step 6, Log Processed Dataset (Optional)

After modifying the config file, you can print the first data of the processed dataset to verify whether the dataset has been constructed correctly.

```bash
xtuner log-dataset $CONFIG
```

`$CONFIG` represents the file path of the modified configuration file in Step 5.

## Using Custom Datasets

When using a custom single-turn dialogue dataset for command fine-tuning, we recommend constructing the dataset in the [single-turn dialogue data format](./dataset_format.md#single-turn-dialogue-dataset-format) as defined by XTuner. If the custom dataset format is oasst1 or other formats, you can refer to the section on [Using Datasets in HuggingFace Hub](#using-dataset-in-huggingface-hub).
Expand Down Expand Up @@ -228,15 +240,17 @@ from datasets import load_dataset
#######################################################################
- alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
- alpaca_en_path = 'tatsu-lab/alpaca'
+ data_path = 'path/to/your/data'
+ data_path = 'path/to/your/json/data'

+ prompt_template = PROMPT_TEMPLATE.alpaca
#######################################################################
# STEP 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
- dataset=dict(type=load_dataset, path=data_path),
+ dataset=dict(
+ type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
+ dataset_map_fn=None,
Expand All @@ -254,3 +268,13 @@ train_dataloader = dict(
collate_fn=dict(type=default_collate_fn))
...
```

#### Step 6, Check Processed Dataset (Optional)

After modifying the config file, you can execute the 'xtuner/tools/check_custom_dataset.py' script to verify the correct construction of the dataset.

```bash
xtuner check-custom-dataset $CONFIG
```

`$CONFIG` represents the file path of the modified configuration file in Step 5.
56 changes: 54 additions & 2 deletions docs/zh_cn/user_guides/incremental_pretraining.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,9 @@ xtuner copy-cfg internlm_7b_qlora_oasst1_e3 .
from xtuner.dataset import process_hf_dataset
from datasets import load_dataset
- from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
+ from map_fn import oasst1_incremental_map_fn
+ from mmengine.config import read_base
+ with read_base():
+ from .map_fn import oasst1_incremental_map_fn
...
#######################################################################
# PART 1 Settings #
Expand Down Expand Up @@ -127,8 +129,33 @@ train_dataloader = dict(
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
...
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
- instruction=prompt_template.INSTRUCTION_START)
+ )
]
...
```

#### Step 5, 打印数据集(可选)

在修改配置文件后,可以打印处理后数据集的第一条数据,以验证数据集是否正确构建。

```bash
xtuner log-dataset $CONFIG
```

其中 `$CONFIG` 是 Step 4 修改过的 config 的文件路径。

### 使用自定义数据集

在使用自定义数据集进行增量预训练时,我们推荐将数据集构造为 XTuner 定义的[增量预训练数据格式](./dataset_format.md#增量预训练数据集格式)。若自定义数据集格式为 `oasst1` 等其他格式,可参考[使用HuggingFace Hub数据集](#使用huggingface-hub数据集)一节。
Expand Down Expand Up @@ -204,7 +231,7 @@ train_dataset = dict(
tokenizer=tokenizer,
max_length=max_length,
- dataset_map_fn=oasst1_map_fn,
+ dataset_map_fn=oasst1_incremental_map_fn,
+ dataset_map_fn=None,
- template_map_fn=dict(
- type=template_map_fn_factory, template=prompt_template),
+ template_map_fn=None,
Expand All @@ -219,4 +246,29 @@ train_dataloader = dict(
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
...
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
- instruction=prompt_template.INSTRUCTION_START)
+ )
]
...
```

#### Step 5, 检查数据集(可选)

在修改配置文件后,可以运行`xtuner/tools/check_custom_dataset.py`脚本验证数据集是否正确构建。

```bash
xtuner check-custom-dataset $CONFIG
```

其中 `$CONFIG` 是 Step 4 修改过的 config 的文件路径。
Loading

0 comments on commit 2ced34e

Please sign in to comment.