generated from quanttide/quanttide-example-of-documentation
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: 增加'文本标记'标题,并增加'文本标记章-判决书文本标记'章节,修改'_toc.yml'文件以容纳新文件
- Loading branch information
1 parent
0bdd5f8
commit da4b3db
Showing
13 changed files
with
123 additions
and
41 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
# https://jupyterbook.org/en/stable/customize/config.html | ||
name: quanttide-example-of-documentation | ||
title: 量潮示例文档项目 | ||
name: quanttide-usercase-of-data-engineering | ||
title: 量潮数据工程用例库 | ||
author: 量潮科技 | ||
description: 量潮文档项目实例 | ||
description: 量潮数据工程用例库 | ||
# Jupyter Book Config | ||
only_build_toc_files: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,6 @@ | ||
format: jb-book | ||
root: index.md | ||
parts: | ||
- caption: 示例部分 | ||
chapters: | ||
- file: 1_example_chapter/README.md | ||
sections: | ||
- file: 1_example_chapter/1_example_section/README.md | ||
sections: | ||
- file: 1_example_chapter/1_example_section/1_example_article.md | ||
- file: 1_example_chapter/1_example_section/1_example_article2.md | ||
- file: 1_example_chapter/2_example_article3.md | ||
- caption: 示例部分2 | ||
chapters: | ||
- file: 2_example_chapter2/README.md | ||
sections: | ||
- file: 2_example_chapter2/1_example_section2/README.md | ||
- file: 3_example_chapter3/README.md | ||
- caption: 文本标记 | ||
chapters: | ||
- file: text_marking/a.ipynb |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# 文本切割 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# 文本切割 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# 判决书文本标记\n", | ||
"\n", | ||
"## 背景\n", | ||
"\n", | ||
"从若干份案件判决书的全文中,逐份提取肇事人、事故发生地、伤亡数量等二十余项信息。\n", | ||
"\n", | ||
"## 方案\n", | ||
" \n", | ||
"使用大模型进行文本标记,并用function_call功能来获取结构化数据。\n", | ||
"\n", | ||
"### 问题\n", | ||
"\n", | ||
"判决书全文过长,直接丢给大模型会让其标记效果不好,如“总罚款金额”会识别成“赔偿金”、“单项罚款金额”等错误数据。\n", | ||
"\n", | ||
"#### 解决方案1\n", | ||
"\n", | ||
"正文切割成段落或句子,从这些段落和句子中摘要成半结构化数据。\n", | ||
"\n", | ||
"#### 解决方案2\n", | ||
"\n", | ||
"1. 使用大模型读取整篇文本和20个字段的说明,为每个问题字段给出合适文本 \n", | ||
" 例如:\n", | ||
" relevant_text:dict[str,list[str]] = {\n", | ||
" \"案件类型\":[\n", | ||
" \"对应文本1\",\n", | ||
" \"对应文本2\"\n", | ||
" ]\n", | ||
" “肇事人”:[\n", | ||
" \"对应文本1\"\n", | ||
" ]\n", | ||
" }\n", | ||
" 提示词为:将原始文本分段,让它从20个字段中选出来最相关的字段标记这段文本\n", | ||
" \n", | ||
"2. 将问题字段与对应文本发送给大模型,大模型给出合适结果\n", | ||
"\n", | ||
"注:这个方案是RAG方案的简化版,RAG方案所做的同样是找到20个字段各自对应的文本,但它(找文本这一步的)实现方式是通过向量搜索(通过词嵌入等方式找到最对应文本段)完成的,然后发送给大模型整合答案。RAG之所以这样做是因为被搜索的文本数量很多,大模型上下文长度不够,所以只好用其他方式代替。但在此处,被搜索的文本长度是可以被直接读的,所以用大模型代替RAG部分,找到对应的文本,再次拿着这个文本去询问大模型(文本越短,结果越准)。\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"\"\"\"\n", | ||
"此处代码无法正常运行,后续会有补充,只是作为示例进行演示,最终版本为可运行的核心代码片段\n", | ||
"\"\"\"\n", | ||
"# # 解决方案2的代码\n", | ||
"\n", | ||
"# ## 步骤1: 大模型读取文本和字段说明,为每个问题字段给出合适文本\n", | ||
"# tools = [\n", | ||
"# {\n", | ||
"# \"type\":\"function\",\n", | ||
"# \"function\": {\n", | ||
"# \"name\": \"associate_field_with_text\",\n", | ||
"# \"description\": \"将与文本有关联的字段保存\",\n", | ||
"# \"parameters\": {\n", | ||
"# \"type\": \"object\",\n", | ||
"# \"properties\": {\n", | ||
"# \"field\":{\n", | ||
"# \"type\": \"list[string]\",\n", | ||
"# \"description\": \"与所给文本有关联的若干个字段,例如['文书类型','肇事人']等,只给出字段名称即可\",\n", | ||
"# },\n", | ||
"# \"log\":{\n", | ||
"# \"type\":\"string\",\n", | ||
"# \"description\":\"日志,简要给出你的标记依据,你给出这些字段的原因\"\n", | ||
"# }\n", | ||
"# },\n", | ||
"# },\n", | ||
"# \"required\":[\n", | ||
"# \"field\",\n", | ||
"# \"text\",\n", | ||
"# ]\n", | ||
"# }\n", | ||
"# }\n", | ||
"# ]\n", | ||
"\n", | ||
"# def get_response(self,messages,tools):\n", | ||
"# response = Generation.call(\n", | ||
"# model='qwen-max',\n", | ||
"# messages=messages,\n", | ||
"# tools=tools,\n", | ||
"# result_format='message'\n", | ||
"# )\n", | ||
"# return response\n", | ||
"\n", | ||
"# ## 步骤2: 将问题字段与对应文本发送给大模型,大模型给出合适结果\n", | ||
"\n", | ||
"# question_filed = '案件类型'\n", | ||
"# chat ( question_filed, relevant_text [ question_filed ] ,prompt )\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"name": "python", | ||
"version": "3.12.7" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |