[Docs] Add docs/zh_cn/preparation/pretrained_model.md (#462)

* fix pre-commit * update * Update pretrained_model.md * Update pretrained_model.md * fix pre-commit * Update pretrained_model.md * update * update * update * update * Update pretrained_model.md
InternLM · Mar 26, 2024 · d01b5e6 · d01b5e6
1 parent c360cb8
commit d01b5e6
Showing 1 changed file with 104 additions and 1 deletion.
diff --git a/docs/zh_cn/preparation/pretrained_model.md b/docs/zh_cn/preparation/pretrained_model.md
@@ -1 +1,104 @@
-# 准备预训练权重
+# 准备预训练模型权重
+
+`HuggingFace` 和 `ModelScope` 提供了多种下载预训练模型权重的方法，本节将以下载 internlm2-chat-7b 为例，介绍如何快速下载预训练模型的权重。
+
+> \[!IMPORTANT\]
+> 若 HuggingFace 访问受限，请优先考虑使用 ModelScope 进行下载
+
+## \[推荐\] 方法 1：利用 `snapshot_download`
+
+### HuggingFace
+
+`huggingface_hub.snapshot_download` 支持下载特定的 HuggingFace Hub 模型权重，并且允许多线程。您可以利用下列代码并行下载模型权重：
+
+```python
+from huggingface_hub import snapshot_download
+
+snapshot_download(repo_id='internlm/internlm2-chat-7b', local_dir='./internlm2-chat-7b', max_workers=20)
+```
+
+其中，`repo_id` 表示模型在 HuggingFace Hub 的名字、`local_dir` 表示期望存储到的本地路径、`max_workers` 表示下载的最大并行数。
+
+**注意事项**
+
+1. 如果未指定 `local_dir`，则将下载至 HuggingFace 的默认 cache 路径中（`~/.cache/huggingface/hub`）。若要修改默认 cache 路径，需要修改相关环境变量：
+
+   ```shell
+   export HF_HOME=XXXX  # 默认为 `~/.cache/huggingface/`
+   ```
+
+2. 如果觉得下载较慢（例如无法达到最大带宽等情况），可以尝试设置 `export HF_HUB_ENABLE_HF_TRANSFER=1` 以获得更高的下载速度。
+
+3. 关于环境变量的更多用法，可阅读 [这里](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/environment_variables)。
+
+### ModelScope
+
+`modelscope.snapshot_download` 支持下载指定的模型权重，您可以利用下列命令下载模型：
+
+```python
+from modelscope import snapshot_download
+
+snapshot_download(model_id='Shanghai_AI_Laboratory/internlm2-chat-7b', cache_dir='./internlm2-chat-7b')
+```
+
+其中，`model_id` 表示模型在 ModelScope 模型库的名字、`cache_dir` 表示期望存储到的本地路径。
+
+**注意事项**
+
+1. 如果未指定 `cache_dir`，则将下载至 ModelScope 的默认 cache 路径中（`~/.cache/huggingface/hub`）。
+
+   若要修改默认 cache 路径，需要修改相关环境变量：
+
+   ```shell
+   export MODELSCOPE_CACHE=XXXX  # 默认为 ~/.cache/modelscope/hub/
+   ```
+
+2. `modelscope.snapshot_download` 不支持多线程并行下载。
+
+## 方法 2：利用 Git LFS
+
+HuggingFace 和 ModelScope 的远程模型仓库就是一个由 Git LFS 管理的 Git 仓库。因此，我们可以利用 `git clone` 完成权重的下载：
+
+```shell
+git lfs install
+# From HuggingFace
+git clone https://huggingface.co/internlm/internlm2-chat-7b
+# From ModelScope
+git clone https://www.modelscope.cn/Shanghai_AI_Laboratory/internlm2-chat-7b.git
+```
+
+## 方法 3：利用 `AutoModelForCausalLM.from_pretrained`
+
+`AutoModelForCausalLM.from_pretrained` 在初始化模型时，将尝试连接远程仓库并自动下载模型权重。因此，我们可以利用这一特性下载模型权重。
+
+### HuggingFace
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained('internlm/internlm2-chat-7b', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('internlm/internlm2-chat-7b', trust_remote_code=True)
+```
+
+此时模型将会下载至 HuggingFace 的 cache 路径中（默认为`~/.cache/huggingface/hub`）。
+
+若要修改默认存储路径，需要修改相关环境变量：
+
+```shell
+export HF_HOME=XXXX   # 默认为 `~/.cache/huggingface/`
+```
+
+### ModelScope
+
+```python
+from modelscope import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained('Shanghai_AI_Laboratory/internlm2-chat-7b', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('Shanghai_AI_Laboratory/internlm2-chat-7b', trust_remote_code=True)
+```
+
+此时模型将会下载至 ModelScope 的 cache 路径中（默认为`~/.cache/modelscope/hub`）。若要修改默认存储路径，需要修改相关环境变量：
+
+```shell
+export MODELSCOPE_CACHE=XXXX  # 默认为 ~/.cache/modelscope/hub/
+```