[NOID] Fixes #4080: Add Pinecone and Milvus support (#4088)

* Fixes #4080: Add Pinecone and Milvus support * small code refactoring * changes review * fix PineconeTest without env vars
neo4j-contrib · Dec 18, 2024 · d394eab · d394eab
1 parent 6f0da58
commit d394eab
Show file tree

Hide file tree

Showing 29 changed files with 2,067 additions and 126 deletions.
diff --git a/LICENSES.txt b/LICENSES.txt
@@ -3061,6 +3061,7 @@ MIT
   jnr-x86asm-1.0.2.jar
   jsoup-1.15.3.jar
   localstack-1.17.6.jar
+  milvus-1.19.7.jar
   mockito-core-3.12.4.jar
   mssql-jdbc-6.2.1.jre7.jar
   mysql-1.17.6.jar

diff --git a/NOTICE.txt b/NOTICE.txt
@@ -462,6 +462,7 @@ MIT
   jnr-x86asm-1.0.2.jar
   jsoup-1.15.3.jar
   localstack-1.17.6.jar
+  milvus-1.19.7.jar
   mockito-core-3.12.4.jar
   mssql-jdbc-6.2.1.jre7.jar
   mysql-1.17.6.jar

diff --git a/docs/asciidoc/modules/ROOT/images/pinecone-index.png b/docs/asciidoc/modules/ROOT/images/pinecone-index.png
diff --git a/docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc b/docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc
@@ -49,15 +49,17 @@ See the following pages for more details on specific vector db procedures
 - xref:./qdrant.adoc[Qdrant]
 - xref:./chroma.adoc[ChromaDB]
 - xref:./weaviate.adoc[Weaviate]
+- xref:./pinecone.adoc[Pinecone]
+- xref:./milvus.adoc[Milvus]
 
 
-== Store Vector db info (i.e. `apoc.vectordb.configure`) 
+== Store Vector db info (i.e. `apoc.vectordb.configure`)
 
 We can save some info in the System Database to be reused later, that is the host, login credentials, and mapping,
 to be used in `*.get` and `.*query` procedures, except for the `apoc.vectordb.custom.get` one.
 
 Therefore, to store the vector info, we can execute the `CALL apoc.vectordb.configure(vectorName, keyConfig, databaseName, $configMap)`,
-where `vectorName` can be "QDRANT", "CHROMA" or "WEAVIATE", 
+where `vectorName` can be "QDRANT", "CHROMA", "PINECONE", "MILVUS" or "WEAVIATE", 
 that indicates info to be reused respectively by `apoc.vectordb.qdrant.*`, `apoc.vectordb.chroma.*` and `apoc.vectordb.weaviate.*`.
 
 Then `keyConfig` is the configuration name, `databaseName` is the database where the config will be set,

diff --git a/docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/pinecone.adoc b/docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/pinecone.adoc
@@ -0,0 +1,225 @@
+
+== Pinecone
+
+Here is a list of all available Pinecone procedures:
+
+[opts=header, cols="1, 3"]
+|===
+| name | description
+| apoc.vectordb.pinecone.createCollection(hostOrKey, index, similarity, size, $config) |
+    Creates an index, with the name specified in the 2nd parameter, and with the specified `similarity` and `size`.
+    The default endpoint is `<hostOrKey param>/indexes`.
+| apoc.vectordb.pinecone.deleteCollection(hostOrKey, index, $config) | 
+    Deletes an index with the name specified in the 2nd parameter.
+    The default endpoint is `<hostOrKey param>/indexes/<collection param>`.
+| apoc.vectordb.pinecone.upsert(hostOrKey, index, vectors, $config) | 
+    Upserts, in the index with the name specified in the 2nd parameter, the vectors [{id: 'id', vector: '<vectorDb>', medatada: '<metadata>'}].
+    The default endpoint is `<hostOrKey param>/vectors/upsert`.
+| apoc.vectordb.pinecone.delete(hostOrKey, index, ids, $config) | 
+    Delete the vectors with the specified `ids`.
+    The default endpoint is `<hostOrKey param>/indexes/<collection param>`.
+| apoc.vectordb.pinecone.get(hostOrKey, index, ids, $config) | 
+    Get the vectors with the specified `ids`.
+    The default endpoint is `<hostOrKey param>/vectors/fetch`.
+| apoc.vectordb.pinecone.getAndUpdate(hostOrKey, index, ids, $config) | 
+    Get the vectors with the specified `ids`, and optionally creates/updates neo4j entities.
+    The default endpoint is `<hostOrKey param>/vectors/fetch`.
+| apoc.vectordb.pinecone.query(hostOrKey, index, vector, filter, limit, $config) | 
+    Retrieve closest vectors the the defined `vector`, `limit` of results, in the index with the name specified in the 2nd parameter.
+    The default endpoint is `<hostOrKey param>/query`.
+| apoc.vectordb.pinecone.queryAndUpdate(hostOrKey, index, vector, filter, limit, $config) | 
+    Retrieve closest vectors the the defined `vector`, `limit` of results, in the index with the name specified in the 2nd parameter, and optionally creates/updates neo4j entities.
+    The default endpoint is `<hostOrKey param>/query`.
+|===
+
+where the 1st parameter can be a key defined by the apoc config `apoc.pinecone.<key>.host=myHost`.
+
+[NOTE]
+====
+The procedures create/drop/handle an index, instead of a collection like the other vectordb procedures, 
+since in Pinecone a collection is a static and non-queryable copy of an index.
+
+Anyway, the create / delete index procedures are named `.createCollection` and `.deleteCollection` to be consistent with the other.
+====
+
+
+The default `hostOrKey` is `"https://api.pinecone.io"`,
+therefore in general can be null with the `createCollection` and `deleteCollection` procedures,
+and equal to the host name, with the other ones, that is, the one indicated in the Pinecone dashboard:
+
+image::pinecone-index.png[width=800]
+
+
+=== Examples
+
+The following example assume we want to create and manage an index called `test-index`.
+
+.Create an index (it leverages https://docs.pinecone.io/reference/api/control-plane/create_index[this API])
+[source,cypher]
+----
+CALL apoc.vectordb.pinecone.createCollection(null, 'test-index', 'cosine', 4, {<optional config>})
+----
+
+
+.Delete an index (it leverages https://docs.pinecone.io/reference/api/control-plane/delete_index[this API])
+[source,cypher]
+----
+CALL apoc.vectordb.pinecone.deleteCollection(null, 'test-index', {<optional config>})
+----
+
+
+.Upsert vectors (it leverages https://docs.pinecone.io/reference/api/data-plane/upsert[this API])
+[source,cypher]
+----
+CALL apoc.vectordb.pinecone.upsert('https://test-index-ilx67g5.svc.aped-4627-b74a.pinecone.io',
+  'test-index',
+  [
+    {id: '1', vector: [0.05, 0.61, 0.76, 0.74], metadata: {city: "Berlin", foo: "one"}},
+    {id: '2', vector: [0.19, 0.81, 0.75, 0.11], metadata: {city: "London", foo: "two"}}
+  ],
+  {<optional config>})
+----
+
+
+.Get vectors (it leverages https://docs.pinecone.io/reference/api/data-plane/fetch[this API])
+
+[source,cypher]
+----
+CALL apoc.vectordb.pinecone.get($host, 'test-index', [1,2], {<optional config>})
+----
+
+
+.Example results
+[opts="header"]
+|===
+| score | metadata | id | vector | text | entity
+| null | {city: "Berlin", foo: "one"} | null | null | null | null
+| null | {city: "Berlin", foo: "two"} | null | null | null | null
+| ...
+|===
+
+.Get vectors with `{allResults: true}`
+[source,cypher]
+----
+CALL apoc.vectordb.pinecone.get($host, 'test-index', ['1','2'], {allResults: true, <optional config>})
+----
+
+
+.Example results
+[opts="header"]
+|===
+| score | metadata | id | vector | text | entity
+| null | {city: "Berlin", foo: "one"} | 1 | [...] | null | null
+| null | {city: "Berlin", foo: "two"} | 2 | [...] | null | null
+| ...
+|===
+
+.Query vectors (it leverages https://docs.pinecone.io/reference/api/data-plane/query[this API])
+[source,cypher]
+----
+CALL apoc.vectordb.pinecone.query($host, 
+    'test-index', 
+    [0.2, 0.1, 0.9, 0.7], 
+    { city: { `$eq`: "London" } }, 
+    5, 
+    {allResults: true, <optional config>})
+----
+
+
+.Example results
+[opts="header"]
+|===
+| score | metadata | id | vector | text | entity
+| 1, | {city: "Berlin", foo: "one"} | 1 | [...] | null | null
+| 0.1 | {city: "Berlin", foo: "two"} | 2 | [...] | null | null
+| ...
+|===
+
+
+We can define a mapping, to auto-create one/multiple nodes and relationships, by leveraging the vector metadata.
+
+For example, if we have created 2 vectors with the above upsert procedures,
+we can populate some existing nodes (i.e. `(:Test {myId: 'one'})` and `(:Test {myId: 'two'})`):
+
+
+[source,cypher]
+----
+CALL apoc.vectordb.pinecone.queryAndUpdate($host, 'test-index',
+    [0.2, 0.1, 0.9, 0.7],
+    {},
+    5, 
+    { mapping: {
+            embeddingKey: "vect", 
+            nodeLabel: "Test", 
+            entityKey: "myId", 
+            metadataKey: "foo" 
+        }
+    })
+----
+
+which populates the two nodes as: `(:Test {myId: 'one', city: 'Berlin', vect: [vector1]})` and `(:Test {myId: 'two', city: 'London', vect: [vector2]})`,
+which will be returned in the `entity` column result.
+
+
+Or else, we can create a node if not exists, via `create: true`:
+
+[source,cypher]
+----
+CALL apoc.vectordb.pinecone.queryAndUpdate($host, 'test-index',
+    [0.2, 0.1, 0.9, 0.7],
+    {},
+    5, 
+    { mapping: {
+            create: true,
+            embeddingKey: "vect", 
+            nodeLabel: "Test", 
+            entityKey: "myId", 
+            metadataKey: "foo"
+        }
+    })
+----
+
+which creates and 2 new nodes as above.
+
+Or, we can populate an existing relationship (i.e. `(:Start)-[:TEST {myId: 'one'}]->(:End)` and `(:Start)-[:TEST {myId: 'two'}]->(:End)`):
+
+
+[source,cypher]
+----
+CALL apoc.vectordb.pinecone.queryAndUpdate($host, 'test-index',
+    [0.2, 0.1, 0.9, 0.7],
+    {},
+    5, 
+    { mapping: {
+            embeddingKey: "vect", 
+            relType: "TEST", 
+            entityKey: "myId", 
+            metadataKey: "foo" 
+        }
+    })
+----
+
+which populates the two relationships as: `()-[:TEST {myId: 'one', city: 'Berlin', vect: [vector1]}]-()`
+and `()-[:TEST {myId: 'two', city: 'London', vect: [vector2]}]-()`,
+which will be returned in the `entity` column result.
+
+[NOTE]
+====
+We can use mapping with `apoc.vectordb.pinecone.getAndUpdate` procedure as well
+====
+
+[NOTE]
+====
+To optimize performances, we can choose what to `YIELD` with the `apoc.vectordb.pinecone.query*` and the `apoc.vectordb.pinecone.get*` procedures.
+
+For example, by executing a `CALL apoc.vectordb.pinecone.query(...) YIELD metadata, score, id`, the RestAPI request will have an {"with_payload": false, "with_vectors": false},
+so that we do not return the other values that we do not need.
+====
+
+
+
+.Delete vectors (it leverages https://docs.pinecone.io/reference/api/data-plane/delete[this API])
+[source,cypher]
+----
+CALL apoc.vectordb.pinecone.delete($host, 'test-index', ['1','2'], {<optional config>})
+----
diff --git a/full-it/src/test/java/apoc/full/it/vectordb/ChromaDbTest.java b/full-it/src/test/java/apoc/full/it/vectordb/ChromaDbTest.java
@@ -16,7 +16,6 @@
 import static apoc.vectordb.VectorDbTestUtil.assertBerlinResult;
 import static apoc.vectordb.VectorDbTestUtil.assertLondonResult;
 import static apoc.vectordb.VectorDbTestUtil.assertNodesCreated;
-import static apoc.vectordb.VectorDbTestUtil.assertReadOnlyProcWithMappingResults;
 import static apoc.vectordb.VectorDbTestUtil.assertRelsCreated;
 import static apoc.vectordb.VectorDbTestUtil.dropAndDeleteAll;
 import static apoc.vectordb.VectorDbTestUtil.ragSetup;
@@ -284,8 +283,8 @@ public void queryVectorsWithCreateNode() {
                         "myId",
                         METADATA_KEY,
                         "foo",
-                        CREATE_KEY,
-                        true));
+                        MODE_KEY,
+                        MappingMode.CREATE_IF_MISSING.toString()));
 
         testResult(
                 db,