-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CrateDB vector: Refactor SQLAlchemy data model to provide two storage strategies #20
base: cratedb
Are you sure you want to change the base?
Conversation
@@ -247,6 +308,7 @@ def test_cratedb_with_filter_match() -> None: | |||
assert output in [ | |||
[(Document(page_content="foo", metadata={"page": "0"}), 2.1307645)], | |||
[(Document(page_content="foo", metadata={"page": "0"}), 2.3150668)], | |||
[(Document(page_content="foo", metadata={"page": "0"}), 2.4458315)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matriv: That's just been a fluke on the _score
-nondeterminism again ;], so I just added another variant to the list of possible outcomes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a possibility to use an approximation for the score?
https://stackoverflow.com/questions/8560131/pytest-assert-almost-equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good idea. I've submitted a corresponding patch. Thank you.
4c5c422
to
5f3fcf9
Compare
8ab8203
to
ee0d1f5
Compare
5f3fcf9
to
78d6c9a
Compare
c39ee8d
to
f7ea8f0
Compare
78d6c9a
to
500d755
Compare
The implementation is based on the `pgvector` adapter, as both PostgreSQL and CrateDB share similar attributes, and can be wrapped well by using the same SQLAlchemy layer on top.
The implementation is based on the generic SQLAlchemy document loader.
The implementation is based on the generic `SQLChatMessageHistory`.
When not adding any embeddings upfront, the runtime model factory was not able to derive the vector dimension size, because the SQLAlchemy models have not been initialized correctly.
From now on, _all_ instances of SQLAlchemy model types will be created at runtime through the `ModelFactory` utility. By using `__table_args__ = {"keep_existing": True}` on the ORM entity definitions, this seems to work well, even with multiple invocations of `CrateDBVectorSearch.from_texts()` using different `collection_name` argument values. While being at it, this patch also fixes a few linter errors.
When deleting a collection, also delete its associated embeddings.
It is a special adapter which provides similarity search across multiple collections. It can not be used for indexing documents.
The CrateDB adapter works a bit different compared to the pgvector adapter it is building upon: Because the dimensionality of the vector field needs to be specified at table creation time, but because it is also a runtime parameter in LangChain, the table creation needs to be delayed. In some cases, the tables do not exist yet, but this is only relevant for the case when the user requests to pre-delete the collection, using the `pre_delete_collection` argument. So, do the error handling only there instead, and _not_ on the generic data model utility functions.
The performance gains can be substantially.
The test cases can be written substantially more elegant.
- StorageStrategy.LANGCHAIN_PGVECTOR Reflects the vanilla way the pgvector adapter manages the data model: There is a single `collection` table and a single `embedding` table. - StorageStrategy.EMBEDDING_TABLE_PER_COLLECTION Reflects a more advanced way to manage the data model: There is a single `collection` table, and multiple `embedding` tables, one per collection. The default storage strategy is `LANGCHAIN_PGVECTOR`. To configure an alternative storage strategy, invoke this snippet before doing any other operations using `CrateDBVectorSearch`: CrateDBVectorSearch.configure( storage_strategy=StorageStrategy.EMBEDDING_TABLE_PER_COLLECTION )
500d755
to
07ba7af
Compare
When you delete a collection, would it make sense to drop the table of this collection when using this new storage strategy? Use case for why it might be good to delete is, when someone adds embeddings. but recognizes that they need different embedding dimensions. I assume they would delete the collection and re-add the new embeddings to a collection with the same name - this is then not possible, as the old embedding table is still there (with the now wrong embedding dimensions). What do you think about this? Other than that I did some monkey testing and found the strategy reliable. Querying, deleting, being a little stupid all seem to result in the expected outcome. 👍 CrateDBVectorSearch.configure(
storage_strategy=StorageStrategy.LANGCHAIN_PGVECTOR
)
vector_store = CrateDBVectorSearch.from_documents(pages, OpenAIEmbeddings(), collection_name="langchain_2023_11_30", connection_string=CONNECTION_STRING)
vector_store.delete_collection() # here I'd suspect that the table for this embedding is dropped instead of emptied (but only in this multi storage strategy. For the default strategy this does not apply |
2379986
to
4e00dcb
Compare
Wonderful.
That's a good suggestion, thank you. |
bf406c7
to
c8bb72e
Compare
40bcf25
to
0a75262
Compare
def8c35
to
8b278a8
Compare
About
This patch makes a stab at GH-12, aiming to implement a more advanced data model for storing collections vs. associated embeddings. As suggested by @ckurze, this will follow a "embedding-table-per-collection" strategy, instead of storing all embedding vectors into the same database table.
Installation
🚧 The preview version can be installed directly from the branch
embedding-table-per-collection
on GitHub. 🚧Usage
The default storage strategy is
LANGCHAIN_PGVECTOR
, which is the same as before. To configure an alternative storage strategy, invoke this snippet before doing any other operations onCrateDBVectorSearch
:Details
Those strategies are implemented. The default is
LANGCHAIN_PGVECTOR
.StorageStrategy.LANGCHAIN_PGVECTOR
Reflects the vanilla way the pgvector adapter manages the data model: There is a single
collection
table and a singleembedding
table.StorageStrategy.EMBEDDING_TABLE_PER_COLLECTION
Reflects a more advanced way to manage the data model: There is a single
collection
table, and multipleembedding
tables, one per collection.Backlog
CrateDBVectorSearchMultiCollection
#15 is currently not supported by the newEMBEDDING_TABLE_PER_COLLECTION
storage strategy.CollectionStore
database entity, in a new column,embedding_table
.collection
table at all, no index table is needed.cmetadata
field. It isn't used at all within the other implementations, so it is considered to be legacy.