Pipeline creates Chunks with duplicate ids when executed multiple times #221

risafj · 2024-11-28T11:06:41Z

When I run the Pipeline() on a loop with multiple documents, a Chunk node with an id property of ":1" and index of 1 is created for each run. This causes problems, since the ids are no longer unique.

For example, when the lexical graph gets created, a Chunk node with an id of ":1" has a NEXT_NODE relation to every Chunk node that has an id of ":2".

After running the pipeline with 4 documents, it looks like this:

The same issue is occuring with FROM_CHUNK, where an entity that's supposed to have a relation like (n:Entity)-[:FROM_CHUNK]->(c:Chunk {id: ":1", index: "1"}) actually has that relation to all documents' chunks with an index of 1.

Is there any workaround for this?
I'm guessing this issue would be solved if I could somehow pass document-specific id_prefix so each chunk gets a unique id?

neo4j-graphrag-python/src/neo4j_graphrag/experimental/components/lexical_graph.py

Lines 78 to 79 in bc6dd9c

    
           def chunk_id(self, chunk_index: int) -> str: 
        
               return f"{self.config.id_prefix}:{chunk_index}"

Additional info:
I use v1.2.0.
I have a standard pipeline setup that has these components.

    pipe = Pipeline()
    # skipping the config code
    pipe.add_component(text_splitter, "splitter")
    pipe.add_component(embedder, "chunk_embedder")
    pipe.add_component(schema_builder, "schema")
    pipe.add_component(extractor, "extractor")
    pipe.add_component(writer, "writer")
    pipe.add_component(resolver, "resolver")

The text was updated successfully, but these errors were encountered:

stellasia · 2024-11-28T17:12:08Z

Hi @risafj ,

Indeed, this behavior is quite annoying, we'll take a closer look.

In the meantime, you can control this prefix by setting it in a LexicalGraphConfig, which is a run parameter of the entity and relation extractor.

So you code will look like this:

from neo4j_graphrag.experimental.components.types import LexicalGraphConfig

config = LexicalGraphConfig(
    id_prefix="myPrefix",
)

await pipe.run(data={
   # ...
   "extractor": {
      # ...
      "lexical_graph_config": config,
   }
})

Let me know if you need more assistance.

stellasia · 2024-11-28T17:38:42Z

Are you using a custom entity and relation extractor?

risafj · 2024-11-29T09:25:52Z

Hi @stellasia ,

Thank you so much for the quick turnaround and helpful response! Your solution worked perfectly!

Are you using a custom entity and relation extractor?

No, I'm using the one defined in this library:

from neo4j_graphrag.experimental.components.entity_relation_extractor import (
    LLMEntityRelationExtractor, OnError)

extractor = LLMEntityRelationExtractor(
    llm=llm,
    on_error=OnError.RAISE,
    prompt_template=custom_prompt,
)

stellasia · 2024-11-29T09:31:51Z

Thank you for raising the issue and the information, we will investigate this shortly.

stellasia mentioned this issue Nov 29, 2024

Improve Document/Chunk ID management #222

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline creates Chunks with duplicate ids when executed multiple times #221

Pipeline creates Chunks with duplicate ids when executed multiple times #221

risafj commented Nov 28, 2024 •

edited

Loading

stellasia commented Nov 28, 2024

stellasia commented Nov 28, 2024

risafj commented Nov 29, 2024

stellasia commented Nov 29, 2024

Pipeline creates Chunks with duplicate ids when executed multiple times #221

Pipeline creates Chunks with duplicate ids when executed multiple times #221

Comments

risafj commented Nov 28, 2024 • edited Loading

stellasia commented Nov 28, 2024

stellasia commented Nov 28, 2024

risafj commented Nov 29, 2024

stellasia commented Nov 29, 2024

risafj commented Nov 28, 2024 •

edited

Loading