Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

features@add mineru #54

Merged
merged 1 commit into from
Sep 7, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions docs/mineru.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,29 @@

支持一键启动,已经打包到镜像中,自带模型权重,支持GPU推理加速,GPU速度相比CPU每页解析要快几十倍不等

## 主要功能
- 删除页眉、页脚、脚注、页码等元素,保持语义连贯
- 对多栏输出符合人类阅读顺序的文本
- 保留原文档的结构,包括标题、段落、列表等
- 提取图像、图片标题、表格、表格标题
- 自动识别文档中的公式并将公式转换成latex
- 自动识别文档中的表格并将表格转换成latex
- 乱码PDF自动检测并启用OCR
- 支持CPU和GPU环境
- 支持windows/linux/mac平台

## 具体原理
请见`PDF-Extract-Kit`:https://github.com/opendatalab/PDF-Extract-Kit/blob/main/README-zh_CN.md
PDF文档中包含大量知识信息,然而提取高质量的PDF内容并非易事。为此,我们将PDF内容提取工作进行拆解:

- 布局检测:使用`LayoutLMv3`模型进行区域检测,如图像,表格,标题,文本等;
- 公式检测:使用`YOLOv8`进行公式检测,包含行内公式和行间公式;
- 公式识别:使用`UniMERNet`进行公式识别;
- 表格识别:使用`StructEqTable`进行表格识别;
- 光学字符识别:使用`PaddleOCR`进行文本识别;
![](https://i-blog.csdnimg.cn/direct/9fe1344768ab407fba31458492454a2b.png)


## 镜像地址:

> 阿里云地址:docker pull registry.cn-beijing.aliyuncs.com/quincyqiang/mineru:0.2-models
Expand Down
Loading