v0.2.1 Release
Added: Upsert Sparse Vectors
We have added support to insert or update the sparse vectors in the spark-pinecone connector. The basic vector type in Pinecone is a dense vector. Pinecone also supports vectors with sparse and dense values together, which allows users to perform hybrid search on their Pinecone index. Hybrid search combines semantic and keyword search in one query for more relevant results.
Example:
The following example shows how to upsert sparse-dense vectors into Pinecone.
import io.pinecone.spark.pinecone.{COMMON_SCHEMA, PineconeOptions}
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SaveMode, SparkSession}
object MainApp extends App {
val conf = new SparkConf()
.setMaster("local[*]")
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
val df = spark.read
.option("multiLine", value = true)
.option("mode", "PERMISSIVE")
.schema(COMMON_SCHEMA)
.json("path_to_sample.jsonl")
.repartition(2)
val pineconeOptions = Map(
PineconeOptions.PINECONE_API_KEY_CONF -> apiKey,
PineconeOptions.PINECONE_ENVIRONMENT_CONF -> environment,
PineconeOptions.PINECONE_PROJECT_NAME_CONF -> projectName,
PineconeOptions.PINECONE_INDEX_NAME_CONF -> indexName
)
df.write
.options(pineconeOptions)
.format("io.pinecone.spark.pinecone.Pinecone")
.mode(SaveMode.Append)
.save()
}
Sample.jsonl file used an input in the above scala code is shown in the next example.
Added: Optional Fields for Input
We've introduced the option to make the namespace
, metadata
, and sparse_values
fields in the input JSON optional with this release. Users can now choose not to include these fields, and the only mandatory fields are id
and values
. Please note that if you include the sparse_values
field in the input, both indices
and values
within sparse_values
must be present. The following example of Sample.json file illustrates possible combinations of input vectors supported by the schema.
[
{
"id": "v1",
"namespace": "default",
"values": [
1,
2,
3
],
"metadata": {
"hello": [
"world",
"you"
],
"numbers": "or not",
"actual_number": 5.2,
"round": 3
},
"sparse_values": {
"indices": [
0,
2
],
"values": [
5.5,
5
]
}
},
{
"id": "v2",
"values": [
3,
2,
1
]
},
{
"id": "v3",
"values": [
1,
4,
9
],
"namespace": "default"
}
]
Databricks users:
Please import the spark-pinecone assembly jar from S3: s3://pinecone-jars/0.2.1/spark-pinecone-uberjar.jar
.
What's Changed
- Reformat sparse vectors by @rohanshah18 in #20
- Add support for adding sparse values by @rohanshah18 in #18
- Update README by @rohanshah18 in #17
Full Changelog: v0.1.4...v0.2.1