Releases: StabRise/spark-pdf
Releases · StabRise/spark-pdf
0.1.11
Added cross building for Spark 3.3.x, 3.4.x and 3.5.x
Binary package is available in the Maven Central Repository.
Spark 3.5.x: com.stabrise:spark-pdf-spark35_2.12:0.1.11
Spark 3.4.x: com.stabrise:spark-pdf-spark34_2.12:0.1.11
Spark 3.3.x: com.stabrise:spark-pdf-spark33_2.12:0.1.11
Added possibility to specify Tesseract config options
ocrConfig: Tesseract OCR configuration. Default: "psm=3". For more information see Tesseract OCR Params
df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.option("ocrConfig", "psm=11") \
.load("path to the pdf file(s)")
0.1.12
Build for Spark 3.5.0