[Feature] Support log processed dataset & Fix doc (#101)

* fix dataset docs * support log processed dataset * fix bugs when import python function to config files * Verify the correctness of the config file for the custom dataset * rename * fix entry point * fix docs * fix bugs when import python function to config files * update toy datasets
InternLM · Sep 8, 2023 · 2ced34e · 2ced34e
1 parent abd9de1
commit 2ced34e
Show file tree

Hide file tree

Showing 17 changed files with 535 additions and 21 deletions.
diff --git a/.gitignore b/.gitignore
@@ -104,7 +104,6 @@ venv.bak/
 .mypy_cache/
 
 # custom
-/data
 .vscode
 .idea
 .DS_Store

diff --git a/data/toy_custom_incremental_data.json b/data/toy_custom_incremental_data.json
@@ -0,0 +1,16 @@
+[{
+    "conversation":[
+        {
+            "input": "",
+            "output": "I am an artificial intelligence (AI) assistant named Puyu. I was created by the Shanghai AI Laboratory and my purpose is to assist users with various tasks through natural language processing technology."
+        }
+    ]
+},
+{
+    "conversation":[
+        {
+            "input": "",
+            "output": "I am an artificial intelligence programmed to assist with various types of tasks, including answering questions, providing information, and performing automated processes."
+        }
+    ]
+}]
diff --git a/data/toy_custom_multi_turn_data.json b/data/toy_custom_multi_turn_data.json
@@ -0,0 +1,32 @@
+[{
+    "conversation":[
+        {
+            "input": "Hello?",
+            "output": "Hello! How can I help you?"
+        },
+        {
+            "input": "What's the date today?",
+            "output": "Today is Monday, August 14, 2023."
+        },
+        {
+            "input": "Thank you!",
+            "output": "You are welcome."
+        }
+    ]
+},
+{
+    "conversation":[
+        {
+            "input": "Hello?",
+            "output": "Hello! How can I help you?"
+        },
+        {
+            "input": "How's the weather today in Rosso?",
+            "output": "The weather in Rosso on Wednesday, August 16th, is going to be cloudy for most of the day, together with moderate rain around noon."
+        },
+        {
+            "input": "Thank you!",
+            "output": "You are welcome."
+        }
+    ]
+}]
diff --git a/data/toy_custom_single_turn_data.json b/data/toy_custom_single_turn_data.json
@@ -0,0 +1,18 @@
+[{
+    "conversation":
+        [
+            {
+                "input": "Give three tips for staying healthy.",
+                "output": "1.Eat a balanced diet. 2. Exercise regularly. 3. Get enough sleep."
+            }
+        ]
+},
+{
+    "conversation":
+        [
+            {
+                "input": "How to study English?",
+                "output": "1. Set clear goals. 2. Create a study plan. 3. Build vocabulary. 4. Practice speaking."
+            }
+        ]
+}]
diff --git a/docs/en/user_guides/incremental_pretraining.md b/docs/en/user_guides/incremental_pretraining.md
@@ -95,7 +95,9 @@ The following modifications need to be made to the config file copied in Step 3:
 from xtuner.dataset import process_hf_dataset
 from datasets import load_dataset
 - from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
-+ from map_fn import oasst1_incremental_map_fn
++ from mmengine.config import read_base
++ with read_base():
++     from .map_fn import oasst1_incremental_map_fn
 ...
 #######################################################################
 #                          PART 1  Settings                           #
@@ -127,8 +129,33 @@ train_dataloader = dict(
     sampler=dict(type=DefaultSampler, shuffle=True),
     collate_fn=dict(type=default_collate_fn))
 ...
+#######################################################################
+#                           PART 5  Runtime                           #
+#######################################################################
+# Log the dialogue periodically during the training process, optional
+custom_hooks = [
+    dict(type=DatasetInfoHook, tokenizer=tokenizer),
+    dict(
+        type=EvaluateChatHook,
+        tokenizer=tokenizer,
+        every_n_iters=evaluation_freq,
+        evaluation_inputs=evaluation_inputs,
+-       instruction=prompt_template.INSTRUCTION_START)
++   )
+]
+...
+```
+
+#### Step 5, Log Processed Dataset (Optional)
+
+After modifying the config file, you can print the first data of the processed dataset to verify whether the dataset has been constructed correctly.
+
+```bash
+xtuner log-dataset $CONFIG
 ```
 
+`$CONFIG` represents the file path of the modified configuration file in Step 4.
+
 ### Using Custom Datasets
 
 When using custom datasets for incremental pre-training, we recommend constructing the dataset according to the [incremental pre-training data format](./dataset_format.md#incremental-pre-training-dataset-format) defined by XTuner. If the custom dataset is in other formats such as oasst1, refer to the section on [Using Dataset in HuggingFace Hub](#using-dataset-in-huggingface-hub).
@@ -191,18 +218,20 @@ from datasets import load_dataset
 #######################################################################
 - data_path = 'timdettmers/openassistant-guanaco'
 - prompt_template = PROMPT_TEMPLATE.openassistant
-+ data_path = 'path/to/your/data'
++ data_path = 'path/to/your/json/data'
 ...
 #######################################################################
 #                      STEP 3  Dataset & Dataloader                   #
 #######################################################################
 train_dataset = dict(
     type=process_hf_dataset,
-    dataset=dict(type=load_dataset, path=data_path),
+-   dataset=dict(type=load_dataset, path=data_path),
++   dataset=dict(
++       type=load_dataset, path='json', data_files=dict(train=data_path)),
     tokenizer=tokenizer,
     max_length=max_length,
 -   dataset_map_fn=oasst1_map_fn,
-+   dataset_map_fn=oasst1_incremental_map_fn,
++   dataset_map_fn=None,
 -   template_map_fn=dict(
 -       type=template_map_fn_factory, template=prompt_template),
 +   template_map_fn=None,
@@ -217,4 +246,29 @@ train_dataloader = dict(
     sampler=dict(type=DefaultSampler, shuffle=True),
     collate_fn=dict(type=default_collate_fn))
 ...
+#######################################################################
+#                           PART 5  Runtime                           #
+#######################################################################
+# Log the dialogue periodically during the training process, optional
+custom_hooks = [
+    dict(type=DatasetInfoHook, tokenizer=tokenizer),
+    dict(
+        type=EvaluateChatHook,
+        tokenizer=tokenizer,
+        every_n_iters=evaluation_freq,
+        evaluation_inputs=evaluation_inputs,
+-       instruction=prompt_template.INSTRUCTION_START)
++   )
+]
+...
 ```
+
+#### Step 5, Check custom Dataset (Optional)
+
+After modifying the config file, you can execute the 'xtuner/tools/check_custom_dataset.py' script to verify the correct construction of the dataset.
+
+```bash
+xtuner check-custom-dataset $CONFIG
+```
+
+`$CONFIG` represents the file path of the modified configuration file in Step 4.
diff --git a/docs/en/user_guides/multi_turn_conversation.md b/docs/en/user_guides/multi_turn_conversation.md
@@ -155,7 +155,9 @@ from xtuner.dataset import process_hf_dataset
 from datasets import load_dataset
 - from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
 + from xtuner.dataset.map_fns import template_map_fn_factory
-+ from .map_fn import oasst1_multi_turns_map_fn
++ from mmengine.config import read_base
++ with read_base():
++     from .map_fn import oasst1_multi_turns_map_fn
 ...
 #######################################################################
 #                          PART 1  Settings                           #
@@ -189,6 +191,16 @@ train_dataloader = dict(
 ...
 ```
 
+#### Step 6, Log Processed Dataset (Optional)
+
+After modifying the config file, you can print the first data of the processed dataset to verify whether the dataset has been constructed correctly.
+
+```bash
+xtuner log-dataset $CONFIG
+```
+
+`$CONFIG` represents the file path of the modified configuration file in Step 5.
+
 ## Using Custom Datasets
 
 When using a custom multi-turn dialogue dataset for command fine-tuning, we recommend constructing the dataset in the [multi-turn dialogue data format](./dataset_format.md#multi-turn-dialogue-dataset-format) as defined by XTuner. If the custom dataset format is oasst1 or other formats, you can refer to the section on [Using Datasets in HuggingFace Hub](#using-dataset-in-huggingface-hub).
@@ -260,7 +272,7 @@ from datasets import load_dataset
 #                          PART 1  Settings                           #
 #######################################################################
 - data_path = 'timdettmers/openassistant-guanaco'
-+ data_path = 'path/to/your/data'
++ data_path = 'path/to/your/json/data'
 
 + prompt_template = PROMPT_TEMPLATE.openassistant
 ...
@@ -269,7 +281,9 @@ from datasets import load_dataset
 #######################################################################
 train_dataset = dict(
     type=process_hf_dataset,
-    dataset=dict(type=load_dataset, path=data_path),
+-   dataset=dict(type=load_dataset, path=data_path),
++   dataset=dict(
++       type=load_dataset, path='json', data_files=dict(train=data_path)),
     tokenizer=tokenizer,
     max_length=max_length,
 +   dataset_map_fn=None,
@@ -287,3 +301,13 @@ train_dataloader = dict(
     collate_fn=dict(type=default_collate_fn))
 ...
 ```
+
+#### Step 6, Check Processed Dataset (Optional)
+
+After modifying the config file, you can execute the 'xtuner/tools/check_custom_dataset.py' script to verify the correct construction of the dataset.
+
+```bash
+xtuner check-custom-dataset $CONFIG
+```
+
+`$CONFIG` represents the file path of the modified configuration file in Step 5.
diff --git a/docs/en/user_guides/single_turn_conversation.md b/docs/en/user_guides/single_turn_conversation.md
@@ -129,7 +129,9 @@ from xtuner.dataset import process_hf_dataset
 from datasets import load_dataset
 - from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
 + from xtuner.dataset.map_fns import template_map_fn_factory
-+ from .map_fn import alpaca_map_fn
++ from mmengine.config import read_base
++ with read_base():
++     from .map_fn import alpaca_map_fn
 ...
 #######################################################################
 #                          PART 1  Settings                           #
@@ -164,6 +166,16 @@ train_dataloader = dict(
 ...
 ```
 
+#### Step 6, Log Processed Dataset (Optional)
+
+After modifying the config file, you can print the first data of the processed dataset to verify whether the dataset has been constructed correctly.
+
+```bash
+xtuner log-dataset $CONFIG
+```
+
+`$CONFIG` represents the file path of the modified configuration file in Step 5.
+
 ## Using Custom Datasets
 
 When using a custom single-turn dialogue dataset for command fine-tuning, we recommend constructing the dataset in the [single-turn dialogue data format](./dataset_format.md#single-turn-dialogue-dataset-format) as defined by XTuner. If the custom dataset format is oasst1 or other formats, you can refer to the section on [Using Datasets in HuggingFace Hub](#using-dataset-in-huggingface-hub).
@@ -228,15 +240,17 @@ from datasets import load_dataset
 #######################################################################
 - alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
 - alpaca_en_path = 'tatsu-lab/alpaca'
-+ data_path = 'path/to/your/data'
++ data_path = 'path/to/your/json/data'
 
 + prompt_template = PROMPT_TEMPLATE.alpaca
 #######################################################################
 #                      STEP 3  Dataset & Dataloader                   #
 #######################################################################
 train_dataset = dict(
     type=process_hf_dataset,
-    dataset=dict(type=load_dataset, path=data_path),
+-   dataset=dict(type=load_dataset, path=data_path),
++   dataset=dict(
++       type=load_dataset, path='json', data_files=dict(train=data_path)),
     tokenizer=tokenizer,
     max_length=max_length,
 +   dataset_map_fn=None,
@@ -254,3 +268,13 @@ train_dataloader = dict(
     collate_fn=dict(type=default_collate_fn))
 ...
 ```
+
+#### Step 6, Check Processed Dataset (Optional)
+
+After modifying the config file, you can execute the 'xtuner/tools/check_custom_dataset.py' script to verify the correct construction of the dataset.
+
+```bash
+xtuner check-custom-dataset $CONFIG
+```
+
+`$CONFIG` represents the file path of the modified configuration file in Step 5.
diff --git a/docs/zh_cn/user_guides/incremental_pretraining.md b/docs/zh_cn/user_guides/incremental_pretraining.md
@@ -95,7 +95,9 @@ xtuner copy-cfg internlm_7b_qlora_oasst1_e3 .
 from xtuner.dataset import process_hf_dataset
 from datasets import load_dataset
 - from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
-+ from map_fn import oasst1_incremental_map_fn
++ from mmengine.config import read_base
++ with read_base():
++     from .map_fn import oasst1_incremental_map_fn
 ...
 #######################################################################
 #                          PART 1  Settings                           #
@@ -127,8 +129,33 @@ train_dataloader = dict(
     sampler=dict(type=DefaultSampler, shuffle=True),
     collate_fn=dict(type=default_collate_fn))
 ...
+#######################################################################
+#                           PART 5  Runtime                           #
+#######################################################################
+# Log the dialogue periodically during the training process, optional
+custom_hooks = [
+    dict(type=DatasetInfoHook, tokenizer=tokenizer),
+    dict(
+        type=EvaluateChatHook,
+        tokenizer=tokenizer,
+        every_n_iters=evaluation_freq,
+        evaluation_inputs=evaluation_inputs,
+-       instruction=prompt_template.INSTRUCTION_START)
++   )
+]
+...
 ```
 
+#### Step 5, 打印数据集（可选）
+
+在修改配置文件后，可以打印处理后数据集的第一条数据，以验证数据集是否正确构建。
+
+```bash
+xtuner log-dataset $CONFIG
+```
+
+其中 `$CONFIG` 是 Step 4 修改过的 config 的文件路径。
+
 ### 使用自定义数据集
 
 在使用自定义数据集进行增量预训练时，我们推荐将数据集构造为 XTuner 定义的[增量预训练数据格式](./dataset_format.md#增量预训练数据集格式)。若自定义数据集格式为 `oasst1` 等其他格式，可参考[使用HuggingFace Hub数据集](#使用huggingface-hub数据集)一节。
@@ -204,7 +231,7 @@ train_dataset = dict(
     tokenizer=tokenizer,
     max_length=max_length,
 -   dataset_map_fn=oasst1_map_fn,
-+   dataset_map_fn=oasst1_incremental_map_fn,
++   dataset_map_fn=None,
 -   template_map_fn=dict(
 -       type=template_map_fn_factory, template=prompt_template),
 +   template_map_fn=None,
@@ -219,4 +246,29 @@ train_dataloader = dict(
     sampler=dict(type=DefaultSampler, shuffle=True),
     collate_fn=dict(type=default_collate_fn))
 ...
+#######################################################################
+#                           PART 5  Runtime                           #
+#######################################################################
+# Log the dialogue periodically during the training process, optional
+custom_hooks = [
+    dict(type=DatasetInfoHook, tokenizer=tokenizer),
+    dict(
+        type=EvaluateChatHook,
+        tokenizer=tokenizer,
+        every_n_iters=evaluation_freq,
+        evaluation_inputs=evaluation_inputs,
+-       instruction=prompt_template.INSTRUCTION_START)
++   )
+]
+...
+```
+
+#### Step 5, 检查数据集（可选）
+
+在修改配置文件后，可以运行`xtuner/tools/check_custom_dataset.py`脚本验证数据集是否正确构建。
+
+```bash
+xtuner check-custom-dataset $CONFIG
 ```
+
+其中 `$CONFIG` 是 Step 4 修改过的 config 的文件路径。