Skip to content

Commit

Permalink
[Spark Connector] Write support for creating Pinot segments (apache#1…
Browse files Browse the repository at this point in the history
…3748)

* Stub for Spark write interface implementation

* Some further scaffolding for testing write api

* Implement basic segment generation logic with auto schema translation

* Fix tar.gz creation path

* Improve test coverage

* Refactor and add more test coverage

* Whitespace

* Add multi value column support

* Separate reader and writer integration tests

* Add support multi-value column support

* Add test coverage for the segment writer

* Add customizable segment name formatter

* Fix default segment name format

* Address review comments

* Add documentation with "experimental" notice

* Update warning box in write docs

* Add TODO for honoring fieldsToRead in PinotBufferedRecordReader

* Add Apache License to the documentation file

* Stub for Spark write interface implementation

* Some further scaffolding for testing write api

* Implement basic segment generation logic with auto schema translation

* Fix tar.gz creation path

* Improve test coverage

* Refactor and add more test coverage

* Whitespace

* Add multi value column support

* Separate reader and writer integration tests

* Add support multi-value column support

* Add test coverage for the segment writer

* Add customizable segment name formatter

* Fix default segment name format

* Address review comments

* Add documentation with "experimental" notice

* Update warning box in write docs

* Add TODO for honoring fieldsToRead in PinotBufferedRecordReader

* Add Apache License to the documentation file

* Fix library usage updated by rebase

* Update github test runner config to fix jdk21 encapsulation related issues

* Allow test access to java.net package in JDK21

* Fix gh test config

* Update pom file to account for running tests with jdk21 encapsulation

* Add jdk21 related fixes to gh test runner config

* Add nio package to jdk21 encapsulation exceptions

* Trim jdk21 related encapsulation exceptions in gh config

* Add sun.nio.ch to opens list

* Add exports option for sun.nio.ch for jdk21

* Bring back --add-opens configs to the spark connector pom file
  • Loading branch information
cbalci authored Sep 5, 2024
1 parent 7ddb7a4 commit 50ad070
Show file tree
Hide file tree
Showing 18 changed files with 1,253 additions and 6 deletions.
10 changes: 10 additions & 0 deletions .github/workflows/pinot_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,11 @@ jobs:
--add-exports=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED
--add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED
--add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED
--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED
--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
run: .github/workflows/scripts/pr-tests/.pinot_tests_build.sh
- name: Unit Test
env:
Expand All @@ -140,6 +145,11 @@ jobs:
--add-exports=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED
--add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED
--add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED
--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED
--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
run: .github/workflows/scripts/pr-tests/.pinot_tests_unit.sh
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Write Model

> [!CAUTION]
> This feature is experimental and the API may change in future releases.
Spark Connector also has experimental support for writing Pinot segments from Spark DataFrames.
Currently, only append mode is supported and the schema of the DataFrame should match the schema of the Pinot table.

```scala
// create sample data
val data = Seq(
("ORD", "Florida", 1000, true, 1722025994),
("ORD", "Florida", 1000, false, 1722025994),
("ORD", "Florida", 1000, false, 1722025994),
("NYC", "New York", 20, true, 1722025994),
)

val airports = spark.createDataFrame(data)
.toDF("airport", "state", "distance", "active", "ts")
.repartition(2)

airports.write.format("pinot")
.mode("append")
.option("table", "airlineStats")
.option("tableType", "OFFLINE")
.option("segmentNameFormat", "{table}_{partitionId:03}")
.option("invertedIndexColumns", "airport")
.option("noDictionaryColumns", "airport,state")
.option("bloomFilterColumns", "airport")
.option("timeColumnName", "ts")
.save("myPath")
```

For more details, refer to the implementation at `org.apache.pinot.connector.spark.v3.datasource.PinotDataWriter`.
9 changes: 9 additions & 0 deletions pinot-connectors/pinot-spark-3-connector/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,15 @@
<plugin>
<groupId>org.scalatest</groupId>
<artifactId>scalatest-maven-plugin</artifactId>
<configuration>
<argLine>
--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED
--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
</argLine>
</configuration>
</plugin>
</plugins>
</build>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.pinot.connector.spark.v3.datasource

import org.apache.pinot.spi.data.readers.{GenericRow, RecordReader, RecordReaderConfig}

import java.io.File
import java.util

/**
* A buffered record reader that stores the records in memory and allows for iteration over them.
* This is used to satisfy the RecordReader interface in Pinot, as well as allowing Spark executor
* to write records.
*
* TODO: To improve resilience, write records to disk when memory is full.
*/
class PinotBufferedRecordReader extends RecordReader {
private val recordBuffer = new util.ArrayList[GenericRow]()
private var readCursor = 0

def init(dataFile: File, fieldsToRead: util.Set[String], recordReaderConfig: RecordReaderConfig): Unit = {
// Do nothing.
// TODO: Honor 'fieldsToRead' parameter to avoid ingesting unwanted fields.
}

def write(record: GenericRow): Unit = {
recordBuffer.add(record)
}

def hasNext: Boolean = {
readCursor < recordBuffer.size()
}

def next(): GenericRow = {
readCursor += 1
recordBuffer.get(readCursor - 1)
}

def next(reuse: GenericRow): GenericRow = {
readCursor += 1
reuse.clear()
reuse.init(recordBuffer.get(readCursor - 1).copy())
reuse
}

def rewind(): Unit = {
readCursor = 0
}

def close(): Unit = {
recordBuffer.clear()
}
}
Loading

0 comments on commit 50ad070

Please sign in to comment.