[Spark Connector] Write support for creating Pinot segments (apache#1…

…3748) * Stub for Spark write interface implementation * Some further scaffolding for testing write api * Implement basic segment generation logic with auto schema translation * Fix tar.gz creation path * Improve test coverage * Refactor and add more test coverage * Whitespace * Add multi value column support * Separate reader and writer integration tests * Add support multi-value column support * Add test coverage for the segment writer * Add customizable segment name formatter * Fix default segment name format * Address review comments * Add documentation with "experimental" notice * Update warning box in write docs * Add TODO for honoring fieldsToRead in PinotBufferedRecordReader * Add Apache License to the documentation file * Stub for Spark write interface implementation * Some further scaffolding for testing write api * Implement basic segment generation logic with auto schema translation * Fix tar.gz creation path * Improve test coverage * Refactor and add more test coverage * Whitespace * Add multi value column support * Separate reader and writer integration tests * Add support multi-value column support * Add test coverage for the segment writer * Add customizable segment name formatter * Fix default segment name format * Address review comments * Add documentation with "experimental" notice * Update warning box in write docs * Add TODO for honoring fieldsToRead in PinotBufferedRecordReader * Add Apache License to the documentation file * Fix library usage updated by rebase * Update github test runner config to fix jdk21 encapsulation related issues * Allow test access to java.net package in JDK21 * Fix gh test config * Update pom file to account for running tests with jdk21 encapsulation * Add jdk21 related fixes to gh test runner config * Add nio package to jdk21 encapsulation exceptions * Trim jdk21 related encapsulation exceptions in gh config * Add sun.nio.ch to opens list * Add exports option for sun.nio.ch for jdk21 * Bring back --add-opens configs to the spark connector pom file
deemoliu · Sep 5, 2024 · 50ad070 · 50ad070
1 parent 7ddb7a4
commit 50ad070
Show file tree

Hide file tree

Showing 18 changed files with 1,253 additions and 6 deletions.
diff --git a/.github/workflows/pinot_tests.yml b/.github/workflows/pinot_tests.yml
@@ -123,6 +123,11 @@ jobs:
             --add-exports=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED
             --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED
             --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED
+            --add-opens=java.base/java.nio=ALL-UNNAMED
+            --add-opens=java.base/sun.nio.ch=ALL-UNNAMED
+            --add-opens=java.base/java.lang=ALL-UNNAMED
+            --add-opens=java.base/java.util=ALL-UNNAMED
+            --add-opens=java.base/java.lang.reflect=ALL-UNNAMED
         run: .github/workflows/scripts/pr-tests/.pinot_tests_build.sh
       - name: Unit Test
         env:
@@ -140,6 +145,11 @@ jobs:
             --add-exports=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED
             --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED
             --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED
+            --add-opens=java.base/java.nio=ALL-UNNAMED
+            --add-opens=java.base/sun.nio.ch=ALL-UNNAMED
+            --add-opens=java.base/java.lang=ALL-UNNAMED
+            --add-opens=java.base/java.util=ALL-UNNAMED
+            --add-opens=java.base/java.lang.reflect=ALL-UNNAMED
         run: .github/workflows/scripts/pr-tests/.pinot_tests_unit.sh
       - name: Upload coverage to Codecov
         uses: codecov/codecov-action@v4

diff --git a/pinot-connectors/pinot-spark-3-connector/documentation/write_model.md b/pinot-connectors/pinot-spark-3-connector/documentation/write_model.md
@@ -0,0 +1,54 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+# Write Model
+
+> [!CAUTION]
+> This feature is experimental and the API may change in future releases.
+
+Spark Connector also has experimental support for writing Pinot segments from Spark DataFrames.
+Currently, only append mode is supported and the schema of the DataFrame should match the schema of the Pinot table.
+
+```scala
+// create sample data
+val data = Seq(
+  ("ORD", "Florida", 1000, true, 1722025994),
+  ("ORD", "Florida", 1000, false, 1722025994),
+  ("ORD", "Florida", 1000, false, 1722025994),
+  ("NYC", "New York", 20, true, 1722025994),
+)
+
+val airports = spark.createDataFrame(data)
+  .toDF("airport", "state", "distance", "active", "ts")
+  .repartition(2)
+
+airports.write.format("pinot")
+  .mode("append")
+  .option("table", "airlineStats")
+  .option("tableType", "OFFLINE")
+  .option("segmentNameFormat", "{table}_{partitionId:03}")
+  .option("invertedIndexColumns", "airport")
+  .option("noDictionaryColumns", "airport,state")
+  .option("bloomFilterColumns", "airport")
+  .option("timeColumnName", "ts")
+  .save("myPath")
+```
+
+For more details, refer to the implementation at `org.apache.pinot.connector.spark.v3.datasource.PinotDataWriter`.
diff --git a/pinot-connectors/pinot-spark-3-connector/pom.xml b/pinot-connectors/pinot-spark-3-connector/pom.xml
@@ -135,6 +135,15 @@
       <plugin>
         <groupId>org.scalatest</groupId>
         <artifactId>scalatest-maven-plugin</artifactId>
+        <configuration>
+          <argLine>
+            --add-opens=java.base/java.nio=ALL-UNNAMED
+            --add-opens=java.base/sun.nio.ch=ALL-UNNAMED
+            --add-opens=java.base/java.lang=ALL-UNNAMED
+            --add-opens=java.base/java.util=ALL-UNNAMED
+            --add-opens=java.base/java.lang.reflect=ALL-UNNAMED
+          </argLine>
+        </configuration>
       </plugin>
     </plugins>
   </build>

diff --git a/...main/scala/org/apache/pinot/connector/spark/v3/datasource/PinotBufferedRecordReader.scala b/...main/scala/org/apache/pinot/connector/spark/v3/datasource/PinotBufferedRecordReader.scala
@@ -0,0 +1,69 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.connector.spark.v3.datasource
+
+import org.apache.pinot.spi.data.readers.{GenericRow, RecordReader, RecordReaderConfig}
+
+import java.io.File
+import java.util
+
+/**
+ * A buffered record reader that stores the records in memory and allows for iteration over them.
+ * This is used to satisfy the RecordReader interface in Pinot, as well as allowing Spark executor
+ * to write records.
+ *
+ * TODO: To improve resilience, write records to disk when memory is full.
+ */
+class PinotBufferedRecordReader extends RecordReader {
+  private val recordBuffer = new util.ArrayList[GenericRow]()
+  private var readCursor = 0
+
+  def init(dataFile: File, fieldsToRead: util.Set[String], recordReaderConfig: RecordReaderConfig): Unit = {
+    // Do nothing.
+    // TODO: Honor 'fieldsToRead' parameter to avoid ingesting unwanted fields.
+  }
+
+  def write(record: GenericRow): Unit = {
+    recordBuffer.add(record)
+  }
+
+  def hasNext: Boolean = {
+    readCursor < recordBuffer.size()
+  }
+
+  def next(): GenericRow = {
+    readCursor += 1
+    recordBuffer.get(readCursor - 1)
+  }
+
+  def next(reuse: GenericRow): GenericRow = {
+    readCursor += 1
+    reuse.clear()
+    reuse.init(recordBuffer.get(readCursor - 1).copy())
+    reuse
+  }
+
+  def rewind(): Unit = {
+    readCursor = 0
+  }
+
+  def close(): Unit = {
+    recordBuffer.clear()
+  }
+}