Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Document/Chunk ID management #222

Merged
merged 17 commits into from
Jan 2, 2025
Merged
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -9,6 +9,11 @@

### Changed
- Updated LLM implementations to handle message history consistently across providers.
- The `id_prefix` parameter in the `LexicalGraphConfig` is deprecated.

### Fixed
- IDs for the Document and Chunk nodes in the lexical graph are now randomly generated and unique across multiple runs, fixing issues in the lexical graph where relationships were created between chunks that were created by different pipeline runs.


## 1.3.0

6 changes: 6 additions & 0 deletions docs/source/types.rst
Original file line number Diff line number Diff line change
@@ -39,6 +39,12 @@ RagResultModel

.. autoclass:: neo4j_graphrag.generation.types.RagResultModel

DocumentInfo
============

.. autoclass:: neo4j_graphrag.experimental.components.types.DocumentInfo


TextChunk
=========

40 changes: 27 additions & 13 deletions docs/source/user_guide_kg_builder.rst
Original file line number Diff line number Diff line change
@@ -672,7 +672,7 @@ Example usage:
from neo4j_graphrag.experimental.pipeline.components.lexical_graph_builder import LexicalGraphBuilder
from neo4j_graphrag.experimental.pipeline.components.types import LexicalGraphConfig

lexical_graph_builder = LexicalGraphBuilder(config=LexicalGraphConfig(id_prefix="example"))
lexical_graph_builder = LexicalGraphBuilder(config=LexicalGraphConfig())
graph = await lexical_graph_builder.run(
text_chunks=TextChunks(chunks=[
TextChunk(text="some text", index=0),
@@ -713,7 +713,6 @@ Optionally, the document and chunk node labels can be configured using a `Lexica
# optionally, define a LexicalGraphConfig object
# shown below with the default values
config = LexicalGraphConfig(
id_prefix="", # used to prefix the chunk and document IDs
chunk_node_label="Chunk",
document_node_label="Document",
chunk_to_document_relationship_type="PART_OF_DOCUMENT",
@@ -998,7 +997,7 @@ without making assumptions about entity similarity. The Entity Resolver
is responsible for refining the created knowledge graph by merging entity
nodes that represent the same real-world object.

In practice, this package implements a single resolver that merges nodes
In practice, this package implements a simple resolver that merges nodes
with the same label and identical "name" property.

.. warning::
@@ -1018,15 +1017,30 @@ It can be used like this:

.. warning::

By default, all nodes with the __Entity__ label will be resolved.
To exclude specific nodes, a filter_query can be added to the query.
For example, if a `:Resolved` label has been applied to already resolved entities
in the graph, these entities can be excluded with the following approach:
By default, all nodes with the `__Entity__` label will be resolved.
This behavior can be controled using the `filter_query` parameter described below.

.. code:: python
Filter Query Parameter
----------------------

from neo4j_graphrag.experimental.components.resolver import (
SinglePropertyExactMatchResolver,
)
resolver = SinglePropertyExactMatchResolver(driver, filter_query="WHERE not entity:Resolved")
res = await resolver.run()
To exclude specific nodes from the resolution, a `filter_query` can be added to the query.
For example, if a `:Resolved` label has been applied to already resolved entities
in the graph, these entities can be excluded with the following approach:

.. code:: python

from neo4j_graphrag.experimental.components.resolver import (
SinglePropertyExactMatchResolver,
)
filter_query = "WHERE NOT entity:Resolved"
resolver = SinglePropertyExactMatchResolver(driver, filter_query=filter_query)
res = await resolver.run()


Similar approach can be used to exclude entities created from a previous pipeline
run on the same document, assuming a label `OldDocument` has been assigned to the
previously created document node:

.. code:: python

filter_query = "WHERE NOT EXISTS((entity)-[:FROM_DOCUMENT]->(:OldDocument))"
2 changes: 1 addition & 1 deletion examples/build_graph/simple_kg_builder_from_pdf.py
Original file line number Diff line number Diff line change
@@ -23,7 +23,7 @@
DATABASE = "neo4j"


root_dir = Path(__file__).parents[4]
root_dir = Path(__file__).parents[1]
file_path = root_dir / "data" / "Harry Potter and the Chamber of Secrets Summary.pdf"


Original file line number Diff line number Diff line change
@@ -4,8 +4,8 @@
EntityRelationExtractor,
OnError,
)
from neo4j_graphrag.experimental.components.pdf_loader import DocumentInfo
from neo4j_graphrag.experimental.components.types import (
DocumentInfo,
LexicalGraphConfig,
Neo4jGraph,
TextChunks,
Original file line number Diff line number Diff line change
@@ -13,7 +13,6 @@ async def main() -> GraphResult:
# optionally, define a LexicalGraphConfig object
# shown below with default values
config = LexicalGraphConfig(
id_prefix="", # used to prefix the chunk and document IDs
chunk_node_label="Chunk",
document_node_label="Document",
chunk_to_document_relationship_type="PART_OF_DOCUMENT",
Original file line number Diff line number Diff line change
@@ -3,11 +3,8 @@
from pathlib import Path
from typing import Dict, Optional

from neo4j_graphrag.experimental.components.pdf_loader import (
DataLoader,
DocumentInfo,
PdfDocument,
)
from neo4j_graphrag.experimental.components.pdf_loader import DataLoader
from neo4j_graphrag.experimental.components.types import DocumentInfo, PdfDocument


class MyLoader(DataLoader):
Original file line number Diff line number Diff line change
@@ -33,7 +33,6 @@ async def main(neo4j_driver: neo4j.Driver) -> PipelineResult:
pipe.add_component(TextChunkEmbedder(embedder=OpenAIEmbeddings()), "chunk_embedder")
# optional: define some custom node labels for the lexical graph:
lexical_graph_config = LexicalGraphConfig(
id_prefix="example",
chunk_node_label="TextPart",
)
pipe.add_component(
Original file line number Diff line number Diff line change
@@ -164,7 +164,6 @@ async def define_and_run_pipeline(
async def main(driver: neo4j.Driver) -> PipelineResult:
# optional: define some custom node labels for the lexical graph:
lexical_graph_config = LexicalGraphConfig(
id_prefix="example",
chunk_node_label="TextPart",
document_node_label="Text",
)
Original file line number Diff line number Diff line change
@@ -184,7 +184,6 @@ async def read_chunk_and_perform_entity_extraction(
async def main(driver: neo4j.Driver) -> PipelineResult:
# optional: define some custom node labels for the lexical graph:
lexical_graph_config = LexicalGraphConfig(
id_prefix="example",
document_node_label="Book", # default: "Document"
chunk_node_label="Chapter", # default "Chunk"
chunk_text_property="content", # default: "text"
Loading
Oops, something went wrong.
Loading
Oops, something went wrong.