Release v1.0.0 Release · pinecone-io/spark-pinecone

Added: Support for serverless indexes

Previously, users could only upsert records into pod indexes. With this release, users now have the capability to upsert records into serverless indexes as well.

Updated: Datatype of sparse indices from signed 32-bit integers to unsigned 32-bit integers

The expected datatype of sparse indices in Pinecone's backend API is unsigned 32-bit integers while the spark connector used to accept signed 32-bit integers. To address the limitations, sparse indices will now accept Long (instead of int), with the input range of [0, 2^32 - 1]. Everything outside of this range will throw an IllegalArgumentException.

Removed: projectName and environment variables

Users are not required to input projectName and environment as input fields while upserting records. The endpoint resolution is handled by the underlying java sdk without the need of both variables. ApiKey and indexName are the only required parameters.

Updated: Pinecone java sdk client v0.7.4 to v1.0.0

Spark-connector relies on pinecone java sdk and as a part of this release, we have updated the java sdk client version from v0.7.4 to v1.0.0.

Example:

The following example shows how to upsert records into Pinecone using scala-spark:

import io.pinecone.spark.pinecone.{COMMON_SCHEMA, PineconeOptions}
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SaveMode, SparkSession}

object MainApp extends App {
  val conf = new SparkConf()
    .setMaster("local[*]")

  val apiKey = "PINECONE_API_KEY"
  val indexName = "PINECONE_INDEX_NAME"

  val spark = SparkSession
    .builder()
    .config(conf)
    .getOrCreate()

  val df = spark.read
    .option("multiLine", value = true)
    .option("mode", "PERMISSIVE")
    .schema(COMMON_SCHEMA)
    .json("path_to_sample.jsonl") 
    .repartition(2)

  val pineconeOptions = Map(
    PineconeOptions.PINECONE_API_KEY_CONF -> apiKey,
    PineconeOptions.PINECONE_INDEX_NAME_CONF -> indexName
  )

  df.write
    .options(pineconeOptions)
    .format("io.pinecone.spark.pinecone.Pinecone")
    .mode(SaveMode.Append)
    .save()
}

The following example shows how to upsert records into Pinecone using python-spark:

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, ArrayType, FloatType, StringType, LongType

# Your API key, environment, project name, and index name
api_key = "PINECONE_API_KEY"
index_name = "PINECONE_INDEX_NAME"

COMMON_SCHEMA = StructType([
    StructField("id", StringType(), False),
    StructField("namespace", StringType(), True),
    StructField("values", ArrayType(FloatType(), False), False),
    StructField("metadata", StringType(), True),
    StructField("sparse_values", StructType([
        StructField("indices", ArrayType(LongType(), False), False),
        StructField("values", ArrayType(FloatType(), False), False)
    ]), True)
])

# Initialize Spark
spark = SparkSession.builder.getOrCreate()

# Read the file and apply the schema
df = spark.read \
    .option("multiLine", value = True) \
    .option("mode", "PERMISSIVE") \
    .schema(COMMON_SCHEMA) \
    .json("/FileStore/tables/sample-4.jsonl")

# Show if the read was successful
df.show()

df.write \
    .option("pinecone.apiKey", api_key) \
    .option("pinecone.indexName", index_name) \
    .format("io.pinecone.spark.pinecone.Pinecone") \
    .mode("append") \
    .save()

Sample.jsonl file used an input in the above scala-spark and python-spark examples is shown below.

[
  {
    "id": "v1",
    "namespace": "default",
    "values": [
      1,
      2,
      3
    ],
    "metadata": {
      "hello": [
        "world",
        "you"
      ],
      "numbers": "or not",
      "actual_number": 5.2,
      "round": 3
    },
    "sparse_values": {
      "indices": [
        0,
        2
      ],
      "values": [ 
        5.5,
        5
      ]
    }
  },
  {
    "id": "v2",
    "values": [
      3,
      2,
      1
    ]
  },
  {
    "id": "v3",
    "values": [
      1,
      4,
      9
    ],
    "namespace": "default"
  }
]

Databricks users:

Please import the spark-pinecone assembly jar from S3: s3://pinecone-jars/1.0.0/spark-pinecone-uberjar.jar.

What's Changed

Add s3 v0.2.2 assembly jar url by @rohanshah18 in #25
Update java sdk to v1 to add support for serverless indexes and accept sparse indices within the range of unsigned 32-bit integers by @rohanshah18 in #31
Update README for v1 by @rohanshah18 in #32

Full Changelog: v0.2.2...v1.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.0 Release