Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Various improvements and fixes backported from docling #41

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
5aa67df
Fix area method of BoundingBox
cau-git Oct 9, 2024
9e1bc5b
add image placeholder
dolfim-ibm Oct 10, 2024
089cdb9
enable picture label
dolfim-ibm Oct 10, 2024
ba7f063
refactor captions and markdown
dolfim-ibm Oct 10, 2024
2b185e0
add logic to skip repeated caption
dolfim-ibm Oct 10, 2024
ba79b4a
use DocItemLabel
dolfim-ibm Oct 10, 2024
e42a1dd
Extend default export labels, add convenience mehtods
cau-git Oct 10, 2024
baceeae
Introduce ListItem API, with marker and enumerated properties
cau-git Oct 11, 2024
8223654
add classification and description in PictureData
dolfim-ibm Oct 13, 2024
f2b3afa
add molecule picture data
dolfim-ibm Oct 13, 2024
7322553
Fixes for DoclingDocument and aligned methods on legacy doc
cau-git Oct 14, 2024
cb56fbf
add advanced picture data content
dolfim-ibm Oct 14, 2024
63395bd
Many markdown export fixes, renaming BaseTableData
cau-git Oct 14, 2024
4ddecf8
Merge branch 'cau/improvements' of github.com:DS4SD/docling-core into…
cau-git Oct 14, 2024
6fee533
Rename module paths doc->legacy_doc, experimental->doc
cau-git Oct 15, 2024
7c104d6
feat: imageref with pil_image
dolfim-ibm Oct 15, 2024
0c4d3e1
Small fixes
cau-git Oct 15, 2024
1b30a74
Merge branch 'cau/improvements' of github.com:DS4SD/docling-core into…
cau-git Oct 15, 2024
d26dcf6
docs: remove documentation in markdown to support python 3.13 (#43)
ceberam Oct 15, 2024
c28a040
Merge branch 'cau/improvements' of github.com:DS4SD/docling-core into…
cau-git Oct 15, 2024
cb7e597
Fix TableCell model validator
cau-git Oct 16, 2024
33aa214
store list of classes in classification
dolfim-ibm Oct 16, 2024
18cb9f4
Fixes for DocumentOrigin mimetype validation
cau-git Oct 16, 2024
5fb2f34
Merge branch 'cau/improvements' of github.com:DS4SD/docling-core into…
cau-git Oct 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.9', '3.10', '3.11', '3.12']
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
steps:
- uses: actions/checkout@v3
- uses: ./.github/actions/setup-poetry
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ repos:
hooks:
- id: docs
name: Docs
entry: poetry run ds_generate_docs docs
entry: poetry run generate_docs docs
pass_filenames: false
language: system
files: '\.py$'
Expand Down
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Docling Core

[![PyPI version](https://img.shields.io/pypi/v/docling-core)](https://pypi.org/project/docling-core/)
![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11%20%7C%203.12-blue)
![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%20%203.11%20%7C%203.12%20%7C%203.13-blue)
[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
Expand All @@ -21,7 +21,7 @@ pip install docling-core

### Development setup

To develop for Docling Core, you need Python 3.9 / 3.10 / 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
To develop for Docling Core, you need Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13 and Poetry. You can then install from your local clone's root dir:
```bash
poetry install
```
Expand All @@ -45,14 +45,14 @@ poetry run pytest test
Document.model_validate_json(data_str)
```

- You can generate the JSON schema of a model with the script `ds_generate_jsonschema`.
- You can generate the JSON schema of a model with the script `generate_jsonschema`.

```py
# for the `Document` type
ds_generate_jsonschema Document
generate_jsonschema Document

# for the use `Record` type
ds_generate_jsonschema Record
generate_jsonschema Record
```

## Documentation
Expand All @@ -61,12 +61,12 @@ Docling supports 3 main data types:

- **Document** for publications like books, articles, reports, or patents. When Docling converts an unstructured PDF document, the generated JSON follows this schema.
The Document type also models the metadata that may be attached to the converted document.
Check [Document](docs/Document.md) for the full JSON schema.
Check [Document](docs/Document.json) for the full JSON schema.
- **Record** for structured database records, centered on an entity or _subject_ that is provided with a list of attributes.
Related to records, the statements can represent annotations on text by Natural Language Processing (NLP) tools.
Check [Record](docs/Record.md) for the full JSON schema.
Check [Record](docs/Record.json) for the full JSON schema.
- **Generic** for any data representation, ensuring minimal configuration and maximum flexibility.
Check [Generic](docs/Generic.md) for the full JSON schema.
Check [Generic](docs/Generic.json) for the full JSON schema.

The data schemas are defined using [pydantic](https://pydantic-docs.helpmanual.io/) models, which provide built-in processes to support the creation of data that adhere to those models.

Expand Down
20 changes: 12 additions & 8 deletions docling_core/types/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,25 @@

"""Define the main types."""

from docling_core.types.doc.base import BoundingBox # noqa
from docling_core.types.doc.base import Table # noqa
from docling_core.types.doc.base import TableCell # noqa
from docling_core.types.doc.base import ( # noqa
from docling_core.types.gen.generic import Generic # noqa
from docling_core.types.legacy_doc.base import BoundingBox # noqa
from docling_core.types.legacy_doc.base import Table # noqa
from docling_core.types.legacy_doc.base import TableCell # noqa
from docling_core.types.legacy_doc.base import ( # noqa
BaseCell,
BaseText,
PageDimensions,
PageReference,
Prov,
Ref,
)
from docling_core.types.doc.document import ( # noqa
from docling_core.types.legacy_doc.document import ( # noqa
CCSDocumentDescription as DocumentDescription,
)
from docling_core.types.doc.document import CCSFileInfoObject as FileInfoObject # noqa
from docling_core.types.doc.document import ExportedCCSDocument as Document # noqa
from docling_core.types.gen.generic import Generic # noqa
from docling_core.types.legacy_doc.document import ( # noqa
CCSFileInfoObject as FileInfoObject,
)
from docling_core.types.legacy_doc.document import ( # noqa
ExportedCCSDocument as Document,
)
from docling_core.types.rec.record import Record # noqa
24 changes: 24 additions & 0 deletions docling_core/types/doc/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,27 @@
#

"""Package for models defined by the Document type."""

from .base import BoundingBox, CoordOrigin, Size
from .document import (
DescriptionItem,
DocItem,
DoclingDocument,
DocumentOrigin,
FloatingItem,
GroupItem,
ImageRef,
KeyValueItem,
NodeItem,
PageItem,
PictureData,
PictureItem,
ProvenanceItem,
RefItem,
SectionHeaderItem,
TableCell,
TableData,
TableItem,
TextItem,
)
from .labels import DocItemLabel, GroupLabel, TableCellLabel
Loading
Loading