Skip to content

Releases: StabRise/spark-pdf

0.1.11

06 Dec 07:51
Compare
Choose a tag to compare

Added cross building for Spark 3.3.x, 3.4.x and 3.5.x

Binary package is available in the Maven Central Repository.

Spark 3.5.x: com.stabrise:spark-pdf-spark35_2.12:0.1.11
Spark 3.4.x: com.stabrise:spark-pdf-spark34_2.12:0.1.11
Spark 3.3.x: com.stabrise:spark-pdf-spark33_2.12:0.1.11

Added possibility to specify Tesseract config options

ocrConfig: Tesseract OCR configuration. Default: "psm=3". For more information see Tesseract OCR Params

df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "200") \
    .option("pagePerPartition", "2") \
    .option("reader", "pdfBox") \
    .option("ocrConfig", "psm=11") \
    .load("path to the pdf file(s)")

Screenshot from 2024-12-06 10-45-49

0.1.12

03 Dec 08:23
Compare
Choose a tag to compare
0.1.12 Pre-release
Pre-release

Build for Spark 3.5.0