Skip to content

v0.2.1 Release

Compare
Choose a tag to compare
@rohanshah18 rohanshah18 released this 25 Oct 18:50
· 35 commits to main since this release
6150c65

Added: Upsert Sparse Vectors

We have added support to insert or update the sparse vectors in the spark-pinecone connector. The basic vector type in Pinecone is a dense vector. Pinecone also supports vectors with sparse and dense values together, which allows users to perform hybrid search on their Pinecone index. Hybrid search combines semantic and keyword search in one query for more relevant results.

Example:

The following example shows how to upsert sparse-dense vectors into Pinecone.

import io.pinecone.spark.pinecone.{COMMON_SCHEMA, PineconeOptions}
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SaveMode, SparkSession}

object MainApp extends App {
  val conf = new SparkConf()
    .setMaster("local[*]")

  val spark = SparkSession
    .builder()
    .config(conf)
    .getOrCreate()

  val df = spark.read
    .option("multiLine", value = true)
    .option("mode", "PERMISSIVE")
    .schema(COMMON_SCHEMA)
    .json("path_to_sample.jsonl") 
    .repartition(2)

  val pineconeOptions = Map(
    PineconeOptions.PINECONE_API_KEY_CONF -> apiKey,
    PineconeOptions.PINECONE_ENVIRONMENT_CONF -> environment,
    PineconeOptions.PINECONE_PROJECT_NAME_CONF -> projectName,
    PineconeOptions.PINECONE_INDEX_NAME_CONF -> indexName
  )

  df.write
    .options(pineconeOptions)
    .format("io.pinecone.spark.pinecone.Pinecone")
    .mode(SaveMode.Append)
    .save()
}

Sample.jsonl file used an input in the above scala code is shown in the next example.

Added: Optional Fields for Input

We've introduced the option to make the namespace, metadata, and sparse_values fields in the input JSON optional with this release. Users can now choose not to include these fields, and the only mandatory fields are id and values. Please note that if you include the sparse_values field in the input, both indices and values within sparse_values must be present. The following example of Sample.json file illustrates possible combinations of input vectors supported by the schema.

[
  {
    "id": "v1",
    "namespace": "default",
    "values": [
      1,
      2,
      3
    ],
    "metadata": {
      "hello": [
        "world",
        "you"
      ],
      "numbers": "or not",
      "actual_number": 5.2,
      "round": 3
    },
    "sparse_values": {
      "indices": [
        0,
        2
      ],
      "values": [ 
        5.5,
        5
      ]
    }
  },
  {
    "id": "v2",
    "values": [
      3,
      2,
      1
    ]
  },
  {
    "id": "v3",
    "values": [
      1,
      4,
      9
    ],
    "namespace": "default"
  }
]

Databricks users:

Please import the spark-pinecone assembly jar from S3: s3://pinecone-jars/0.2.1/spark-pinecone-uberjar.jar.

What's Changed

Full Changelog: v0.1.4...v0.2.1