From c4a0b93848650301d1ff569089c3b18840743eac Mon Sep 17 00:00:00 2001 From: Cedrick Lunven Date: Thu, 25 Apr 2024 17:21:00 +0200 Subject: [PATCH] Documentation page for ASTRADB --- .../integrations/embedding-stores/astra-db.md | 127 +++++++++++++++++- .../integrations/embedding-stores/index.md | 42 +++--- .../astradb/AstraDBEmbeddingStore.java | 2 +- 3 files changed, 144 insertions(+), 27 deletions(-) diff --git a/docs/docs/integrations/embedding-stores/astra-db.md b/docs/docs/integrations/embedding-stores/astra-db.md index bcf41811ea..87806d439e 100644 --- a/docs/docs/integrations/embedding-stores/astra-db.md +++ b/docs/docs/integrations/embedding-stores/astra-db.md @@ -53,9 +53,9 @@ To get a token click the `[Generate Token]` button on the right. It will generat > The full documentation regarding AstraDB can be found [here](https://docs.datastax.com/en/astra/astra-db-vector/api-reference/dataapiclient.html). -### 2.1. Setup your Collections +### 2.1. Connecting to Astra -- Client initialization +The following code show how to initialize the client and access the database. ```java String token = System.getenv("ASTRA_DB_APPLICATION_TOKEN"); @@ -69,9 +69,126 @@ DataAPIClient client = new DataAPIClient(token); Database db = client.getDatabase(astraApiEndpoint); ``` -### 2.2. Ingestion +| Field | Description | +|-----------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------| +| **[DataAPIClient](https://datastaxdevs.github.io/astra-db-java/latest/com/datastax/astra/client/DataAPIClient.html)** | This class is the main entry point for the Astra Client. It allows to create databases and different admin operations | +| **[Database](https://datastaxdevs.github.io/astra-db-java/latest/com/datastax/astra/client/Database.html)** | The API endpoint you copied in the previous step. | + +- From a Tenant (DataAPIClient) you can access one to many databases. The `Database` object is the entry point to interact with the database. +- +- From a Database (Database) you can access one the many namespaces (logical). The default namespace is `default_keyspace. + +### 2.2. AstraDB Collection + +A Database can have one to multiple collections. A collection is a logical grouping of data. A collection can store different type of data abd can contains a `$vector` field. Those collections can be used to store any informations and not only vectors. It can then be used for ChatMemory or any cache needed. + +> AstraDB collections can use different types of identifiers for its documents. Default is the UUIDv4 (java UUID) but more can be +> use like ObjectId (MongoDB) , UUIDv7 (Snowflake) or other type of identifier. To get the complete list consult the [documentation](https://docs.datastax.com/en/astra/astra-db-vector/api-reference/collections.html#the-defaultid-option) + +In Langchain4j, `AstraDBEmbeddingStore` is associated to one Collection with a `$vector` field. The `$vector` field is used to store the embeddings. The following code shows how to create a collection with a `$vector` field. By default there is no special field to store the text segment of the chunk. By CONVENTION, the store use field name `content`. +```java +// Create a vector collection +Collection col = db.createCollection("langchain4j_embedding_store", + CollectionOptions + .builder() + .vectorDimension(1536) // related to your embedding mode + .vectorSimilarity(SimilarityMetric.COSINE) // + .indexingDeny("content") // avoid to index text segment + .build()); +``` + +If the collection exists (most of the time), you can access it with the following code: +```java +Collection col2 = db + .getCollection("langchain4j_embedding_store"); +``` + +### 2.3. Init EmbeddingStore + +To initialize the `AstraDBEmbeddingStore` simply give the collection object as argument. + +```java +EmbeddingStore embeddingStore = new AstraDBEmbeddingStore(col); +``` + +We could provide this utility method to help with creation of the store: + +```java +/** + * Create an AstraDB Embedding Store. + */ +EmbeddingStore createEmbeddingStore( + String astraToken, String apiEndpoint, + String collectionName, int dimension, + SimilarityMetric metric) { + + return new AstraDBEmbeddingStore( + // AstraDB Client + new DataAPIClient(astraToken) + .getDatabase(apiEndpoint) + .createCollection(collectionName, dimension, metric)); +} +``` + +### 2.4. Usage + +Please find enclosed a simple example of how to use the `AstraDBEmbeddingStore`: + +```java +// Given a embedding model +EmbeddingModel embeddingModel = ...; + +// Given a text file on disk +Path textFile = ....; + +// Get AstraDBEmbeddingStore +EmbeddingStore embeddingStore = createEmbeddingStore( + System.getenv("ASTRA_DB_APPLICATION_TOKEN"), + System.getenv("ASTRA_DB_API_ENDPOINT"), + "langchain4j_embedding_store", + 1536, + SimilarityMetric.COSINE); + +// Ingestion +EmbeddingStoreIngestor.builder() + .documentSplitter(recursive(100, 10, new OpenAiTokenizer(GPT_3_5_TURBO))) + .embeddingModel(embeddingModel) + .embeddingStore(embeddingStore) + .build() + .ingest(loadDocument(textFile, new TextDocumentParser())); + +// Sample Vector Search +ContentRetriever contentRetriever = EmbeddingStoreContentRetriever.builder() + .embeddingModel(embeddingModel) + .embeddingStore(embeddingStore) + .maxResults(2) + .minScore(0.5) + .build(); + +Assistant ai = AiServices.builder(Assistant.class) + .contentRetriever(contentRetriever) + .chatLanguageModel(initChatLanguageModelOpenAi()) + .chatMemory(MessageWindowChatMemory.withMaxMessages(10)) + .build(); +String response = ai.answer("What vegetable is Happy?"); + + +// Meta-Data Filtering +RetrievalAugmentor retrievalAugmentor = DefaultRetrievalAugmentor + .builder() + .contentRetriever(contentRetriever) + .contentInjector(DefaultContentInjector.builder() + .metadataKeysToInclude(asList("document_format", "text")) + .build()) + .build(); + +// configuring it to use the components we've created above. +Assistant aiWithMetaData = AiServices.builder(Assistant.class) + .retrievalAugmentor(retrievalAugmentor) + .chatLanguageModel(getChatLanguageModelChatBison()) + .chatMemory(MessageWindowChatMemory.withMaxMessages(10)) + .build(); +``` -### 2.3. Vector Search -### 2.4. Meta-Data Filtering diff --git a/docs/docs/integrations/embedding-stores/index.md b/docs/docs/integrations/embedding-stores/index.md index 03761789db..4386984d6f 100644 --- a/docs/docs/integrations/embedding-stores/index.md +++ b/docs/docs/integrations/embedding-stores/index.md @@ -4,24 +4,24 @@ hide_title: false sidebar_position: 0 --- -| Provider | Storing Metadata | Filtering by Metadata | Local | Cloud | -|---------------------------------------------------------------------------------------|------------------|-----------------------|-------|-------| -| [In-memory](/integrations/embedding-stores/in-memory) | ✅ | ✅ | | | -| [Astra DB](/integrations/embedding-stores/astra-db) | ✅ | | | | -| [Azure AI Search](/integrations/embedding-stores/azure-ai-search) | ✅ | | | | -| [Azure CosmosDB Mongo vCore](/integrations/embedding-stores/azure-cosmos-mongo-vcore) | ✅ | | | | -| [Cassandra](/integrations/embedding-stores/cassandra) | ✅ | | | | -| [Chroma](/integrations/embedding-stores/chroma) | ✅ | | | | -| [Elasticsearch](/integrations/embedding-stores/elasticsearch) | ✅ | ✅ | | | -| [Infinispan](/integrations/embedding-stores/infinispan) | ✅ | | | | -| [Milvus](/integrations/embedding-stores/milvus) | ✅ | ✅ | | | -| [MongoDB Atlas](/integrations/embedding-stores/mongodb-atlas) | ✅ | | | | -| [Neo4j](/integrations/embedding-stores/neo4j) | | | | | -| [OpenSearch](/integrations/embedding-stores/opensearch) | ✅ | | | | -| [PGVector](/integrations/embedding-stores/pgvector) | ✅ | | | | -| [Pinecone](/integrations/embedding-stores/pinecone) | | | | | -| [Qdrant](/integrations/embedding-stores/qdrant) | ✅ | | | | -| [Redis](/integrations/embedding-stores/redis) | ✅ | | | | -| [Vearch](/integrations/embedding-stores/vearch) | ✅ | | | | -| [Vespa](/integrations/embedding-stores/vespa) | | | | | -| [Weaviate](/integrations/embedding-stores/weaviate) | | | | | +| Provider | Storing Metadata | Filtering by Metadata | Local | Cloud | +|---------------------------------------------------------------------------------------|:------------------:|:---------------------:|:-------:|:-------:| +| [In-memory](/integrations/embedding-stores/in-memory) | ✅ | ✅ | | | +| [AstraDB](/integrations/embedding-stores/astra-db) | ✅ | ✅ | | ✅ | +| [Azure AI Search](/integrations/embedding-stores/azure-ai-search) | ✅ | | | | +| [Azure CosmosDB Mongo vCore](/integrations/embedding-stores/azure-cosmos-mongo-vcore) | ✅ | | | | +| [Apache Cassandra™](/integrations/embedding-stores/cassandra) | ✅ | ✅ (partial) | ✅ | ✅ | +| [Chroma](/integrations/embedding-stores/chroma) | ✅ | | | | +| [Elasticsearch](/integrations/embedding-stores/elasticsearch) | ✅ | ✅ | | | +| [Infinispan](/integrations/embedding-stores/infinispan) | ✅ | | | | +| [Milvus](/integrations/embedding-stores/milvus) | ✅ | ✅ | | | +| [MongoDB Atlas](/integrations/embedding-stores/mongodb-atlas) | ✅ | | | | +| [Neo4j](/integrations/embedding-stores/neo4j) | | | | | +| [OpenSearch](/integrations/embedding-stores/opensearch) | ✅ | | | | +| [PGVector](/integrations/embedding-stores/pgvector) | ✅ | | | | +| [Pinecone](/integrations/embedding-stores/pinecone) | | | | | +| [Qdrant](/integrations/embedding-stores/qdrant) | ✅ | | | | +| [Redis](/integrations/embedding-stores/redis) | ✅ | | | | +| [Vearch](/integrations/embedding-stores/vearch) | ✅ | | | | +| [Vespa](/integrations/embedding-stores/vespa) | | | | | +| [Weaviate](/integrations/embedding-stores/weaviate) | | | | | diff --git a/langchain4j-astradb/src/main/java/dev/langchain4j/store/embedding/astradb/AstraDBEmbeddingStore.java b/langchain4j-astradb/src/main/java/dev/langchain4j/store/embedding/astradb/AstraDBEmbeddingStore.java index c275bb687a..e2965dfe5a 100644 --- a/langchain4j-astradb/src/main/java/dev/langchain4j/store/embedding/astradb/AstraDBEmbeddingStore.java +++ b/langchain4j-astradb/src/main/java/dev/langchain4j/store/embedding/astradb/AstraDBEmbeddingStore.java @@ -82,7 +82,7 @@ public AstraDBEmbeddingStore(@NonNull Collection client) { * @param concurrentThreads * concurrent threads */ - public AstraDBEmbeddingStore(@NonNull Collection client, int itemsPerChunk, int concurrentThreads) { + public AstraDBEmbeddingStore(@NonNull Collection client, int itemsPerChunk, int concurrentThreads) { if (itemsPerChunk>20 || itemsPerChunk<1) { throw new IllegalArgumentException("'itemsPerChunk' should be in between 1 and 20"); }