Release v0.2.1 Release · pinecone-io/spark-pinecone

Added: Upsert Sparse Vectors

We have added support to insert or update the sparse vectors in the spark-pinecone connector. The basic vector type in Pinecone is a dense vector. Pinecone also supports vectors with sparse and dense values together, which allows users to perform hybrid search on their Pinecone index. Hybrid search combines semantic and keyword search in one query for more relevant results.

Example:

The following example shows how to upsert sparse-dense vectors into Pinecone.

import io.pinecone.spark.pinecone.{COMMON_SCHEMA, PineconeOptions}
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SaveMode, SparkSession}

object MainApp extends App {
  val conf = new SparkConf()
    .setMaster("local[*]")

  val spark = SparkSession
    .builder()
    .config(conf)
    .getOrCreate()

  val df = spark.read
    .option("multiLine", value = true)
    .option("mode", "PERMISSIVE")
    .schema(COMMON_SCHEMA)
    .json("path_to_sample.jsonl") 
    .repartition(2)

  val pineconeOptions = Map(
    PineconeOptions.PINECONE_API_KEY_CONF -> apiKey,
    PineconeOptions.PINECONE_ENVIRONMENT_CONF -> environment,
    PineconeOptions.PINECONE_PROJECT_NAME_CONF -> projectName,
    PineconeOptions.PINECONE_INDEX_NAME_CONF -> indexName
  )

  df.write
    .options(pineconeOptions)
    .format("io.pinecone.spark.pinecone.Pinecone")
    .mode(SaveMode.Append)
    .save()
}

Sample.jsonl file used an input in the above scala code is shown in the next example.

Added: Optional Fields for Input

We've introduced the option to make the namespace, metadata, and sparse_values fields in the input JSON optional with this release. Users can now choose not to include these fields, and the only mandatory fields are id and values. Please note that if you include the sparse_values field in the input, both indices and values within sparse_values must be present. The following example of Sample.json file illustrates possible combinations of input vectors supported by the schema.

[
  {
    "id": "v1",
    "namespace": "default",
    "values": [
      1,
      2,
      3
    ],
    "metadata": {
      "hello": [
        "world",
        "you"
      ],
      "numbers": "or not",
      "actual_number": 5.2,
      "round": 3
    },
    "sparse_values": {
      "indices": [
        0,
        2
      ],
      "values": [ 
        5.5,
        5
      ]
    }
  },
  {
    "id": "v2",
    "values": [
      3,
      2,
      1
    ]
  },
  {
    "id": "v3",
    "values": [
      1,
      4,
      9
    ],
    "namespace": "default"
  }
]

Databricks users:

Please import the spark-pinecone assembly jar from S3: s3://pinecone-jars/0.2.1/spark-pinecone-uberjar.jar.

What's Changed

Reformat sparse vectors by @rohanshah18 in #20
Add support for adding sparse values by @rohanshah18 in #18
Update README by @rohanshah18 in #17

Full Changelog: v0.1.4...v0.2.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.1 Release

Added: Upsert Sparse Vectors

Example:

Added: Optional Fields for Input

Databricks users:

What's Changed

Contributors