Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline creates Chunks with duplicate ids when executed multiple times #221

Open
risafj opened this issue Nov 28, 2024 · 4 comments
Open

Comments

@risafj
Copy link

risafj commented Nov 28, 2024

When I run the Pipeline() on a loop with multiple documents, a Chunk node with an id property of ":1" and index of 1 is created for each run. This causes problems, since the ids are no longer unique.

For example, when the lexical graph gets created, a Chunk node with an id of ":1" has a NEXT_NODE relation to every Chunk node that has an id of ":2".

After running the pipeline with 4 documents, it looks like this:

Neo4j_Aura

The same issue is occuring with FROM_CHUNK, where an entity that's supposed to have a relation like (n:Entity)-[:FROM_CHUNK]->(c:Chunk {id: ":1", index: "1"}) actually has that relation to all documents' chunks with an index of 1.

Is there any workaround for this?
I'm guessing this issue would be solved if I could somehow pass document-specific id_prefix so each chunk gets a unique id?

def chunk_id(self, chunk_index: int) -> str:
return f"{self.config.id_prefix}:{chunk_index}"

Additional info:
I use v1.2.0.
I have a standard pipeline setup that has these components.

    pipe = Pipeline()
    # skipping the config code
    pipe.add_component(text_splitter, "splitter")
    pipe.add_component(embedder, "chunk_embedder")
    pipe.add_component(schema_builder, "schema")
    pipe.add_component(extractor, "extractor")
    pipe.add_component(writer, "writer")
    pipe.add_component(resolver, "resolver")
@stellasia
Copy link
Contributor

Hi @risafj ,

Indeed, this behavior is quite annoying, we'll take a closer look.

In the meantime, you can control this prefix by setting it in a LexicalGraphConfig, which is a run parameter of the entity and relation extractor.

So you code will look like this:

from neo4j_graphrag.experimental.components.types import LexicalGraphConfig

config = LexicalGraphConfig(
    id_prefix="myPrefix",
)

await pipe.run(data={
   # ...
   "extractor": {
      # ...
      "lexical_graph_config": config,
   }
})

Let me know if you need more assistance.

@stellasia
Copy link
Contributor

Are you using a custom entity and relation extractor?

@risafj
Copy link
Author

risafj commented Nov 29, 2024

Hi @stellasia ,

Thank you so much for the quick turnaround and helpful response! Your solution worked perfectly!

Are you using a custom entity and relation extractor?

No, I'm using the one defined in this library:

from neo4j_graphrag.experimental.components.entity_relation_extractor import (
    LLMEntityRelationExtractor, OnError)

extractor = LLMEntityRelationExtractor(
    llm=llm,
    on_error=OnError.RAISE,
    prompt_template=custom_prompt,
)

@stellasia
Copy link
Contributor

Thank you for raising the issue and the information, we will investigate this shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants