Skip to content

Commit

Permalink
docs: 增加'文本标记'标题,并增加'文本标记章-判决书文本标记'章节,修改'_toc.yml'文件以容纳新文件
Browse files Browse the repository at this point in the history
  • Loading branch information
xiezipeng05 committed Nov 9, 2024
1 parent 0bdd5f8 commit da4b3db
Show file tree
Hide file tree
Showing 13 changed files with 123 additions and 41 deletions.
6 changes: 0 additions & 6 deletions 1_example_chapter/1_example_section/1_example_article.md

This file was deleted.

6 changes: 0 additions & 6 deletions 1_example_chapter/1_example_section/1_example_article2.md

This file was deleted.

1 change: 0 additions & 1 deletion 1_example_chapter/1_example_section/README.md

This file was deleted.

6 changes: 0 additions & 6 deletions 1_example_chapter/2_example_article3.md

This file was deleted.

1 change: 0 additions & 1 deletion 1_example_chapter/README.md

This file was deleted.

1 change: 0 additions & 1 deletion 2_example_chapter2/1_example_section2/README.md

This file was deleted.

1 change: 0 additions & 1 deletion 2_example_chapter2/README.md

This file was deleted.

1 change: 0 additions & 1 deletion 3_example_chapter3/README.md

This file was deleted.

6 changes: 3 additions & 3 deletions _config.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# https://jupyterbook.org/en/stable/customize/config.html
name: quanttide-example-of-documentation
title: 量潮示例文档项目
name: quanttide-usercase-of-data-engineering
title: 量潮数据工程用例库
author: 量潮科技
description: 量潮文档项目实例
description: 量潮数据工程用例库
# Jupyter Book Config
only_build_toc_files: true
18 changes: 3 additions & 15 deletions _toc.yml
Original file line number Diff line number Diff line change
@@ -1,18 +1,6 @@
format: jb-book
root: index.md
parts:
- caption: 示例部分
chapters:
- file: 1_example_chapter/README.md
sections:
- file: 1_example_chapter/1_example_section/README.md
sections:
- file: 1_example_chapter/1_example_section/1_example_article.md
- file: 1_example_chapter/1_example_section/1_example_article2.md
- file: 1_example_chapter/2_example_article3.md
- caption: 示例部分2
chapters:
- file: 2_example_chapter2/README.md
sections:
- file: 2_example_chapter2/1_example_section2/README.md
- file: 3_example_chapter3/README.md
- caption: 文本标记
chapters:
- file: text_marking/a.ipynb
1 change: 1 addition & 0 deletions text_cutting/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 文本切割
1 change: 1 addition & 0 deletions text_marking/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 文本切割
115 changes: 115 additions & 0 deletions text_marking/a.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 判决书文本标记\n",
"\n",
"## 背景\n",
"\n",
"从若干份案件判决书的全文中,逐份提取肇事人、事故发生地、伤亡数量等二十余项信息。\n",
"\n",
"## 方案\n",
" \n",
"使用大模型进行文本标记,并用function_call功能来获取结构化数据。\n",
"\n",
"### 问题\n",
"\n",
"判决书全文过长,直接丢给大模型会让其标记效果不好,如“总罚款金额”会识别成“赔偿金”、“单项罚款金额”等错误数据。\n",
"\n",
"#### 解决方案1\n",
"\n",
"正文切割成段落或句子,从这些段落和句子中摘要成半结构化数据。\n",
"\n",
"#### 解决方案2\n",
"\n",
"1. 使用大模型读取整篇文本和20个字段的说明,为每个问题字段给出合适文本 \n",
" 例如:\n",
" relevant_text:dict[str,list[str]] = {\n",
" \"案件类型\":[\n",
" \"对应文本1\",\n",
" \"对应文本2\"\n",
" ]\n",
" “肇事人”:[\n",
" \"对应文本1\"\n",
" ]\n",
" }\n",
" 提示词为:将原始文本分段,让它从20个字段中选出来最相关的字段标记这段文本\n",
" \n",
"2. 将问题字段与对应文本发送给大模型,大模型给出合适结果\n",
"\n",
"注:这个方案是RAG方案的简化版,RAG方案所做的同样是找到20个字段各自对应的文本,但它(找文本这一步的)实现方式是通过向量搜索(通过词嵌入等方式找到最对应文本段)完成的,然后发送给大模型整合答案。RAG之所以这样做是因为被搜索的文本数量很多,大模型上下文长度不够,所以只好用其他方式代替。但在此处,被搜索的文本长度是可以被直接读的,所以用大模型代替RAG部分,找到对应的文本,再次拿着这个文本去询问大模型(文本越短,结果越准)。\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"此处代码无法正常运行,后续会有补充,只是作为示例进行演示,最终版本为可运行的核心代码片段\n",
"\"\"\"\n",
"# # 解决方案2的代码\n",
"\n",
"# ## 步骤1: 大模型读取文本和字段说明,为每个问题字段给出合适文本\n",
"# tools = [\n",
"# {\n",
"# \"type\":\"function\",\n",
"# \"function\": {\n",
"# \"name\": \"associate_field_with_text\",\n",
"# \"description\": \"将与文本有关联的字段保存\",\n",
"# \"parameters\": {\n",
"# \"type\": \"object\",\n",
"# \"properties\": {\n",
"# \"field\":{\n",
"# \"type\": \"list[string]\",\n",
"# \"description\": \"与所给文本有关联的若干个字段,例如['文书类型','肇事人']等,只给出字段名称即可\",\n",
"# },\n",
"# \"log\":{\n",
"# \"type\":\"string\",\n",
"# \"description\":\"日志,简要给出你的标记依据,你给出这些字段的原因\"\n",
"# }\n",
"# },\n",
"# },\n",
"# \"required\":[\n",
"# \"field\",\n",
"# \"text\",\n",
"# ]\n",
"# }\n",
"# }\n",
"# ]\n",
"\n",
"# def get_response(self,messages,tools):\n",
"# response = Generation.call(\n",
"# model='qwen-max',\n",
"# messages=messages,\n",
"# tools=tools,\n",
"# result_format='message'\n",
"# )\n",
"# return response\n",
"\n",
"# ## 步骤2: 将问题字段与对应文本发送给大模型,大模型给出合适结果\n",
"\n",
"# question_filed = '案件类型'\n",
"# chat ( question_filed, relevant_text [ question_filed ] ,prompt )\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit da4b3db

Please sign in to comment.