docs: 增加'文本标记'标题，并增加'文本标记章-判决书文本标记'章节,修改'_toc.yml'文件以容纳新文件

quanttide · Nov 9, 2024 · da4b3db · da4b3db
1 parent 0bdd5f8
commit da4b3db
Show file tree

Hide file tree

Showing 13 changed files with 123 additions and 41 deletions.
diff --git a/1_example_chapter/1_example_section/1_example_article.md b/1_example_chapter/1_example_section/1_example_article.md
diff --git a/1_example_chapter/1_example_section/1_example_article2.md b/1_example_chapter/1_example_section/1_example_article2.md
diff --git a/1_example_chapter/1_example_section/README.md b/1_example_chapter/1_example_section/README.md
diff --git a/1_example_chapter/2_example_article3.md b/1_example_chapter/2_example_article3.md
diff --git a/1_example_chapter/README.md b/1_example_chapter/README.md
diff --git a/2_example_chapter2/1_example_section2/README.md b/2_example_chapter2/1_example_section2/README.md
diff --git a/2_example_chapter2/README.md b/2_example_chapter2/README.md
diff --git a/3_example_chapter3/README.md b/3_example_chapter3/README.md
diff --git a/_config.yml b/_config.yml
@@ -1,7 +1,7 @@
 # https://jupyterbook.org/en/stable/customize/config.html
-name: quanttide-example-of-documentation
-title: 量潮示例文档项目
+name: quanttide-usercase-of-data-engineering
+title: 量潮数据工程用例库
 author: 量潮科技
-description: 量潮文档项目实例
+description: 量潮数据工程用例库
 # Jupyter Book Config
 only_build_toc_files: true
diff --git a/_toc.yml b/_toc.yml
@@ -1,18 +1,6 @@
 format: jb-book
 root: index.md
 parts:
-  - caption: 示例部分
-    chapters:
-      - file: 1_example_chapter/README.md
-        sections:
-          - file: 1_example_chapter/1_example_section/README.md
-            sections:
-              - file: 1_example_chapter/1_example_section/1_example_article.md
-              - file: 1_example_chapter/1_example_section/1_example_article2.md
-          - file: 1_example_chapter/2_example_article3.md
-  - caption: 示例部分2
-    chapters:
-      - file: 2_example_chapter2/README.md
-        sections:
-          - file: 2_example_chapter2/1_example_section2/README.md
-      - file: 3_example_chapter3/README.md
+  - caption: 文本标记
+    chapters: 
+      - file: text_marking/a.ipynb
diff --git a/text_cutting/README.md b/text_cutting/README.md
@@ -0,0 +1 @@
+# 文本切割
diff --git a/text_marking/README.md b/text_marking/README.md
@@ -0,0 +1 @@
+# 文本切割
diff --git a/text_marking/a.ipynb b/text_marking/a.ipynb
@@ -0,0 +1,115 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 判决书文本标记\n",
+    "\n",
+    "## 背景\n",
+    "\n",
+    "从若干份案件判决书的全文中，逐份提取肇事人、事故发生地、伤亡数量等二十余项信息。\n",
+    "\n",
+    "## 方案\n",
+    "    \n",
+    "使用大模型进行文本标记，并用function_call功能来获取结构化数据。\n",
+    "\n",
+    "### 问题\n",
+    "\n",
+    "判决书全文过长，直接丢给大模型会让其标记效果不好，如“总罚款金额”会识别成“赔偿金”、“单项罚款金额”等错误数据。\n",
+    "\n",
+    "#### 解决方案1\n",
+    "\n",
+    "正文切割成段落或句子，从这些段落和句子中摘要成半结构化数据。\n",
+    "\n",
+    "#### 解决方案2\n",
+    "\n",
+    "1. 使用大模型读取整篇文本和20个字段的说明，为每个问题字段给出合适文本 \n",
+    "    例如：\n",
+    "    relevant_text:dict[str,list[str]] = {\n",
+    "        \"案件类型\":[\n",
+    "            \"对应文本1\",\n",
+    "            \"对应文本2\"\n",
+    "        ]\n",
+    "        “肇事人”：[\n",
+    "            \"对应文本1\"\n",
+    "        ]\n",
+    "    }\n",
+    "    提示词为：将原始文本分段，让它从20个字段中选出来最相关的字段标记这段文本\n",
+    "    \n",
+    "2. 将问题字段与对应文本发送给大模型，大模型给出合适结果\n",
+    "\n",
+    "注：这个方案是RAG方案的简化版，RAG方案所做的同样是找到20个字段各自对应的文本，但它（找文本这一步的）实现方式是通过向量搜索（通过词嵌入等方式找到最对应文本段）完成的，然后发送给大模型整合答案。RAG之所以这样做是因为被搜索的文本数量很多，大模型上下文长度不够，所以只好用其他方式代替。但在此处，被搜索的文本长度是可以被直接读的，所以用大模型代替RAG部分，找到对应的文本，再次拿着这个文本去询问大模型（文本越短，结果越准）。\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\"\"\"\n",
+    "此处代码无法正常运行，后续会有补充，只是作为示例进行演示，最终版本为可运行的核心代码片段\n",
+    "\"\"\"\n",
+    "# # 解决方案2的代码\n",
+    "\n",
+    "# ## 步骤1： 大模型读取文本和字段说明，为每个问题字段给出合适文本\n",
+    "# tools = [\n",
+    "#         {\n",
+    "#             \"type\":\"function\",\n",
+    "#             \"function\": {\n",
+    "#                 \"name\": \"associate_field_with_text\",\n",
+    "#                 \"description\": \"将与文本有关联的字段保存\",\n",
+    "#                 \"parameters\": {\n",
+    "#                     \"type\": \"object\",\n",
+    "#                     \"properties\": {\n",
+    "#                         \"field\":{\n",
+    "#                             \"type\": \"list[string]\",\n",
+    "#                             \"description\": \"与所给文本有关联的若干个字段，例如['文书类型','肇事人']等，只给出字段名称即可\",\n",
+    "#                         },\n",
+    "#                         \"log\":{\n",
+    "#                             \"type\":\"string\",\n",
+    "#                             \"description\":\"日志，简要给出你的标记依据，你给出这些字段的原因\"\n",
+    "#                         }\n",
+    "#                     },\n",
+    "#                 },\n",
+    "#                 \"required\":[\n",
+    "#                     \"field\",\n",
+    "#                     \"text\",\n",
+    "#                 ]\n",
+    "#             }\n",
+    "#         }\n",
+    "#     ]\n",
+    "\n",
+    "# def get_response(self,messages,tools):\n",
+    "#     response = Generation.call(\n",
+    "#         model='qwen-max',\n",
+    "#         messages=messages,\n",
+    "#         tools=tools,\n",
+    "#         result_format='message'\n",
+    "#     )\n",
+    "#     return response\n",
+    "\n",
+    "# ## 步骤2： 将问题字段与对应文本发送给大模型，大模型给出合适结果\n",
+    "\n",
+    "# question_filed = '案件类型'\n",
+    "# chat ( question_filed, relevant_text [ question_filed ] ,prompt )\n",
+    "\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}