Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NOID] Fixes #4087 #4080 #4231: Add vector info procedures (#4142) and added Milvus and Pinecone support #4264

Draft
wants to merge 3 commits into
base: 4.4
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions LICENSES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3061,6 +3061,7 @@ MIT
jnr-x86asm-1.0.2.jar
jsoup-1.15.3.jar
localstack-1.17.6.jar
milvus-1.19.7.jar
mockito-core-3.12.4.jar
mssql-jdbc-6.2.1.jre7.jar
mysql-1.17.6.jar
Expand Down
1 change: 1 addition & 0 deletions NOTICE.txt
Original file line number Diff line number Diff line change
Expand Up @@ -462,6 +462,7 @@ MIT
jnr-x86asm-1.0.2.jar
jsoup-1.15.3.jar
localstack-1.17.6.jar
milvus-1.19.7.jar
mockito-core-3.12.4.jar
mssql-jdbc-6.2.1.jre7.jar
mysql-1.17.6.jar
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ note that the list and the signature procedures are consistent with the others,
[opts=header, cols="1, 3"]
|===
| name | description
| apoc.vectordb.chroma.info(hostOrKey, collection, $config) | Get information about the specified existing collection or throws an error 500 if it does not exist
| apoc.vectordb.chroma.createCollection(hostOrKey, collection, similarity, size, $config) |
Creates a collection, with the name specified in the 2nd parameter, and with the specified `similarity` and `size`.
The default endpoint is `<hostOrKey param>/api/v1/collections`.
Expand Down Expand Up @@ -38,6 +39,19 @@ With hostOrKey=null, the default is 'http://localhost:8000'.

=== Examples

.Get collection info (it leverages https://docs.trychroma.com/reference/py-client#get_collection[this API])
[source,cypher]
----
CALL apoc.vectordb.chroma.info(hostOrKey, 'test_collection', {<optional config>})
----

.Example results
[opts="header"]
|===
| value
| {"name": "test_collection", "metadata": {"size": 4, "hnsw:space": "cosine"}, "database": "default_database", "id": "74ebe008-1ccb-4d3d-8c5d-cdd7cfa526c2", "tenant": "default_tenant"}
|===

.Create a collection (it leverages https://docs.trychroma.com/usage-guide#creating-inspecting-and-deleting-collections[this API])
[source,cypher]
----
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,15 +49,17 @@ See the following pages for more details on specific vector db procedures
- xref:./qdrant.adoc[Qdrant]
- xref:./chroma.adoc[ChromaDB]
- xref:./weaviate.adoc[Weaviate]
- xref:./pinecone.adoc[Pinecone]
- xref:./milvus.adoc[Milvus]


== Store Vector db info (i.e. `apoc.vectordb.configure`)
== Store Vector db info (i.e. `apoc.vectordb.configure`)

We can save some info in the System Database to be reused later, that is the host, login credentials, and mapping,
to be used in `*.get` and `.*query` procedures, except for the `apoc.vectordb.custom.get` one.

Therefore, to store the vector info, we can execute the `CALL apoc.vectordb.configure(vectorName, keyConfig, databaseName, $configMap)`,
where `vectorName` can be "QDRANT", "CHROMA" or "WEAVIATE",
where `vectorName` can be "QDRANT", "CHROMA", "PINECONE", "MILVUS" or "WEAVIATE",
that indicates info to be reused respectively by `apoc.vectordb.qdrant.*`, `apoc.vectordb.chroma.*` and `apoc.vectordb.weaviate.*`.

Then `keyConfig` is the configuration name, `databaseName` is the database where the config will be set,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@

== Pinecone

Here is a list of all available Pinecone procedures:

[opts=header, cols="1, 3"]
|===
| name | description
| apoc.vectordb.pinecone.createCollection(hostOrKey, index, similarity, size, $config) |
Creates an index, with the name specified in the 2nd parameter, and with the specified `similarity` and `size`.
The default endpoint is `<hostOrKey param>/indexes`.
| apoc.vectordb.pinecone.deleteCollection(hostOrKey, index, $config) |
Deletes an index with the name specified in the 2nd parameter.
The default endpoint is `<hostOrKey param>/indexes/<collection param>`.
| apoc.vectordb.pinecone.upsert(hostOrKey, index, vectors, $config) |
Upserts, in the index with the name specified in the 2nd parameter, the vectors [{id: 'id', vector: '<vectorDb>', medatada: '<metadata>'}].
The default endpoint is `<hostOrKey param>/vectors/upsert`.
| apoc.vectordb.pinecone.delete(hostOrKey, index, ids, $config) |
Delete the vectors with the specified `ids`.
The default endpoint is `<hostOrKey param>/indexes/<collection param>`.
| apoc.vectordb.pinecone.get(hostOrKey, index, ids, $config) |
Get the vectors with the specified `ids`.
The default endpoint is `<hostOrKey param>/vectors/fetch`.
| apoc.vectordb.pinecone.getAndUpdate(hostOrKey, index, ids, $config) |
Get the vectors with the specified `ids`, and optionally creates/updates neo4j entities.
The default endpoint is `<hostOrKey param>/vectors/fetch`.
| apoc.vectordb.pinecone.query(hostOrKey, index, vector, filter, limit, $config) |
Retrieve closest vectors the the defined `vector`, `limit` of results, in the index with the name specified in the 2nd parameter.
The default endpoint is `<hostOrKey param>/query`.
| apoc.vectordb.pinecone.queryAndUpdate(hostOrKey, index, vector, filter, limit, $config) |
Retrieve closest vectors the the defined `vector`, `limit` of results, in the index with the name specified in the 2nd parameter, and optionally creates/updates neo4j entities.
The default endpoint is `<hostOrKey param>/query`.
|===

where the 1st parameter can be a key defined by the apoc config `apoc.pinecone.<key>.host=myHost`.

[NOTE]
====
The procedures create/drop/handle an index, instead of a collection like the other vectordb procedures,
since in Pinecone a collection is a static and non-queryable copy of an index.

Anyway, the create / delete index procedures are named `.createCollection` and `.deleteCollection` to be consistent with the other.
====


The default `hostOrKey` is `"https://api.pinecone.io"`,
therefore in general can be null with the `createCollection` and `deleteCollection` procedures,
and equal to the host name, with the other ones, that is, the one indicated in the Pinecone dashboard:

image::pinecone-index.png[width=800]


=== Examples

The following example assume we want to create and manage an index called `test-index`.

.Create an index (it leverages https://docs.pinecone.io/reference/api/control-plane/create_index[this API])
[source,cypher]
----
CALL apoc.vectordb.pinecone.createCollection(null, 'test-index', 'cosine', 4, {<optional config>})
----


.Delete an index (it leverages https://docs.pinecone.io/reference/api/control-plane/delete_index[this API])
[source,cypher]
----
CALL apoc.vectordb.pinecone.deleteCollection(null, 'test-index', {<optional config>})
----


.Upsert vectors (it leverages https://docs.pinecone.io/reference/api/data-plane/upsert[this API])
[source,cypher]
----
CALL apoc.vectordb.pinecone.upsert('https://test-index-ilx67g5.svc.aped-4627-b74a.pinecone.io',
'test-index',
[
{id: '1', vector: [0.05, 0.61, 0.76, 0.74], metadata: {city: "Berlin", foo: "one"}},
{id: '2', vector: [0.19, 0.81, 0.75, 0.11], metadata: {city: "London", foo: "two"}}
],
{<optional config>})
----


.Get vectors (it leverages https://docs.pinecone.io/reference/api/data-plane/fetch[this API])

[source,cypher]
----
CALL apoc.vectordb.pinecone.get($host, 'test-index', [1,2], {<optional config>})
----


.Example results
[opts="header"]
|===
| score | metadata | id | vector | text | entity
| null | {city: "Berlin", foo: "one"} | null | null | null | null
| null | {city: "Berlin", foo: "two"} | null | null | null | null
| ...
|===

.Get vectors with `{allResults: true}`
[source,cypher]
----
CALL apoc.vectordb.pinecone.get($host, 'test-index', ['1','2'], {allResults: true, <optional config>})
----


.Example results
[opts="header"]
|===
| score | metadata | id | vector | text | entity
| null | {city: "Berlin", foo: "one"} | 1 | [...] | null | null
| null | {city: "Berlin", foo: "two"} | 2 | [...] | null | null
| ...
|===

.Query vectors (it leverages https://docs.pinecone.io/reference/api/data-plane/query[this API])
[source,cypher]
----
CALL apoc.vectordb.pinecone.query($host,
'test-index',
[0.2, 0.1, 0.9, 0.7],
{ city: { `$eq`: "London" } },
5,
{allResults: true, <optional config>})
----


.Example results
[opts="header"]
|===
| score | metadata | id | vector | text | entity
| 1, | {city: "Berlin", foo: "one"} | 1 | [...] | null | null
| 0.1 | {city: "Berlin", foo: "two"} | 2 | [...] | null | null
| ...
|===


We can define a mapping, to auto-create one/multiple nodes and relationships, by leveraging the vector metadata.

For example, if we have created 2 vectors with the above upsert procedures,
we can populate some existing nodes (i.e. `(:Test {myId: 'one'})` and `(:Test {myId: 'two'})`):


[source,cypher]
----
CALL apoc.vectordb.pinecone.queryAndUpdate($host, 'test-index',
[0.2, 0.1, 0.9, 0.7],
{},
5,
{ mapping: {
embeddingKey: "vect",
nodeLabel: "Test",
entityKey: "myId",
metadataKey: "foo"
}
})
----

which populates the two nodes as: `(:Test {myId: 'one', city: 'Berlin', vect: [vector1]})` and `(:Test {myId: 'two', city: 'London', vect: [vector2]})`,
which will be returned in the `entity` column result.


Or else, we can create a node if not exists, via `create: true`:

[source,cypher]
----
CALL apoc.vectordb.pinecone.queryAndUpdate($host, 'test-index',
[0.2, 0.1, 0.9, 0.7],
{},
5,
{ mapping: {
create: true,
embeddingKey: "vect",
nodeLabel: "Test",
entityKey: "myId",
metadataKey: "foo"
}
})
----

which creates and 2 new nodes as above.

Or, we can populate an existing relationship (i.e. `(:Start)-[:TEST {myId: 'one'}]->(:End)` and `(:Start)-[:TEST {myId: 'two'}]->(:End)`):


[source,cypher]
----
CALL apoc.vectordb.pinecone.queryAndUpdate($host, 'test-index',
[0.2, 0.1, 0.9, 0.7],
{},
5,
{ mapping: {
embeddingKey: "vect",
relType: "TEST",
entityKey: "myId",
metadataKey: "foo"
}
})
----

which populates the two relationships as: `()-[:TEST {myId: 'one', city: 'Berlin', vect: [vector1]}]-()`
and `()-[:TEST {myId: 'two', city: 'London', vect: [vector2]}]-()`,
which will be returned in the `entity` column result.

[NOTE]
====
We can use mapping with `apoc.vectordb.pinecone.getAndUpdate` procedure as well
====

[NOTE]
====
To optimize performances, we can choose what to `YIELD` with the `apoc.vectordb.pinecone.query*` and the `apoc.vectordb.pinecone.get*` procedures.

For example, by executing a `CALL apoc.vectordb.pinecone.query(...) YIELD metadata, score, id`, the RestAPI request will have an {"with_payload": false, "with_vectors": false},
so that we do not return the other values that we do not need.
====



.Delete vectors (it leverages https://docs.pinecone.io/reference/api/data-plane/delete[this API])
[source,cypher]
----
CALL apoc.vectordb.pinecone.delete($host, 'test-index', ['1','2'], {<optional config>})
----
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ note that the list and the signature procedures are consistent with the others,
[opts=header, cols="1, 3"]
|===
| name | description
| apoc.vectordb.qdrant.info(hostOrKey, collection, $config) | Get information about the specified existing collection or throws a FileNotFoundException if it does not exist
| apoc.vectordb.qdrant.createCollection(hostOrKey, collection, similarity, size, $config) |
Creates a collection, with the name specified in the 2nd parameter, and with the specified `similarity` and `size`.
The default endpoint is `<hostOrKey param>/collections/<collection param>`.
Expand Down Expand Up @@ -38,6 +39,29 @@ With hostOrKey=null, the default is 'http://localhost:6333'.

=== Examples

.Get collection info (it leverages https://qdrant.github.io/qdrant/redoc/index.html#tag/collections/operation/get_collection[this API])
[source,cypher]
----
CALL apoc.vectordb.qdrant.info(hostOrKey, 'test_collection', {<optional config>})
----

.Example results
[opts="header"]
|===
| value
| {"result": {"optimizer_status": "ok", "points_count": 2, "vectors_count": 2, "segments_count": 8, "indexed_vectors_count": 0,
"config": {"params": {"on_disk_payload": true, "vectors": {"size": 4, "distance": "Cosine"}, "shard_number": 1, "replication_factor": 1, "write_consistency_factor": 1},
"optimizer_config": {"max_optimization_threads": 1, "indexing_threshold": 20000, "deleted_threshold": 0.2, "flush_interval_sec": 5, "memmap_threshold": null, "default_segment_number": 0, "max_segment_size": null, "vacuum_min_vector_number": 1000}, "quantization_config": null,
"hnsw_config": {"max_indexing_threads": 0, "full_scan_threshold": 10000, "ef_construct": 100, "m": 16, "on_disk": false},
"wal_config": {"wal_segments_ahead": 0, "wal_capacity_mb": 32}
},
"status": green,
"payload_schema": {}
},
"time": 1.2725E-4, "status": ok
}
|===

.Create a collection (it leverages https://qdrant.github.io/qdrant/redoc/index.html#tag/collections/operation/create_collection[this API])
[source,cypher]
----
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ note that the list and the signature procedures are consistent with the others,
[opts=header, cols="1, 3"]
|===
| name | description
| apoc.vectordb.weaviate.info($host, $collectionName, $config) | Get information about the specified existing collection or throws a FileNotFoundException if it does not exist
| apoc.vectordb.weaviate.createCollection(hostOrKey, collection, similarity, size, $config) |
Creates a collection, with the name specified in the 2nd parameter, and with the specified `similarity` and `size`.
The default endpoint is `<hostOrKey param>/schema`.
Expand Down Expand Up @@ -39,6 +40,33 @@ With hostOrKey=null, the default is 'http://localhost:8080/v1'.

=== Examples

.Get collection info (it leverages https://weaviate.io/developers/weaviate/api/rest#tag/schema/get/schema/{className}[this API])
[source, cypher]
----
CALL apoc.vectordb.weaviate.info($host, 'test_collection', {<optional config>})
----

.Example results
[opts="header"]
|===
| value
| {"vectorizer": "none",
"invertedIndexConfig": {"bm25": {"b": 0.75, "k1": 1.2}, "stopwords": {"additions": null, "removals": null, "preset": en}, "cleanupIntervalSeconds": 60},
"vectorIndexConfig": {"ef": -1, "dynamicEfMin": 100, "pq": {"centroids": 256, "trainingLimit": 100000, "encoder": {"type": "kmeans", "distribution": "log-normal"},
"enabled": false, "bitCompression": false, "segments": 0
},
"distance": cosine, "skip": false, "dynamicEfFactor": 8, "bq": {"enabled": false},
"vectorCacheMaxObjects": 1000000000000, "cleanupIntervalSeconds": 300, "dynamicEfMax": 500, "efConstruction": 128, "flatSearchCutoff": 40000, "maxConnections": 64},
"multiTenancyConfig": {"enabled": false},
"vectorIndexType": "hnsw", "replicationConfig": {"factor": 1},
"shardingConfig": {"desiredVirtualCount": 128, "desiredCount": 1, "actualCount": 1, "function": "murmur3", "virtualPerPhysical": 128, "strategy": "hash", "actualVirtualCount": 128, "key": "_id"},
"class": "TestCollection",
"properties": [{"name": "city", "description": "This property was generated by Weaviate's auto-schema feature on Wed Jul 10 12:50:18 2024", "indexFilterable": true, "tokenization": "word", "indexSearchable": true, "dataType": ["text"]},
{"name": "foo", "description": "This property was generated by Weaviate's auto-schema feature on Wed Jul 10 12:50:18 2024", "indexFilterable": true, "tokenization": word, "indexSearchable": true, "dataType": ["text"]}
]
}
|===

.Create a collection (it leverages https://weaviate.io/developers/weaviate/api/rest#tag/schema/post/schema[this API])
[source,cypher]
----
Expand Down
Loading
Loading