⭐ Star us on GitHub — it motivates us a lot!
Source Code: https://github.com/StabRise/spark-pdf
Quick Start Jupyter Notebook Spark 3.x.x: PdfDataSource.ipynb
Quick Start Jupyter Notebook Spark 4.0.x: PdfDataSourceSpark4.ipynb
The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.
If you found useful this project, please give a star to the repository.
- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package
- Java 8, 11, 17
- Apache Spark 3.3.2, 3.4.1, 3.5.0, 4.0.0
- Ghostscript 9.50 or later (only for the GhostScript reader)
Spark 4.0.0 is supported in the version 0.1.11
and later (need Java 17 and Scala 2.13).
Binary package is available in the Maven Central Repository.
- Spark 3.5.*: com.stabrise:spark-pdf-spark35_2.12:0.1.11
- Spark 3.4.*: com.stabrise:spark-pdf-spark34_2.12:0.1.11
- Spark 3.3.*: com.stabrise:spark-pdf-spark33_2.12:0.1.11
- Spark 4.0.*: com.stabrise:spark-pdf-spark34_2.13:0.1.11
imageType
: Oputput image type. Can be: "BINARY", "GREY", "RGB". Default: "RGB".resolution
: Resolution for rendering PDF page to the image. Default: "300" dpi.pagePerPartition
: Number pages per partition in Spark DataFrame. Default: "5".reader
: Supports:pdfBox
- based on PdfBox java lib,gs
- based on GhostScript (need installation GhostScipt to the system)ocrConfig
: Tesseract OCR configuration. Default: "psm=3". For more information see Tesseract OCR Params
The DataFrame contains the following columns:
path
: path to the filepage_number
: page number of the documenttext
: extracted text from the text layer of the PDF pageimage
: image representation of the pagedocument
: the OCR-extracted text from the rendered image (calls Tesseract OCR)partition_number
: partition number
Output Schema:
root
|-- path: string (nullable = true)
|-- filename: string (nullable = true)
|-- page_number: integer (nullable = true)
|-- partition_number: integer (nullable = true)
|-- text: string (nullable = true)
|-- image: struct (nullable = true)
| |-- path: string (nullable = true)
| |-- resolution: integer (nullable = true)
| |-- data: binary (nullable = true)
| |-- imageType: string (nullable = true)
| |-- exception: string (nullable = true)
| |-- height: integer (nullable = true)
| |-- width: integer (nullable = true)
|-- document: struct (nullable = true)
| |-- path: string (nullable = true)
| |-- text: string (nullable = true)
| |-- outputType: string (nullable = true)
| |-- bBoxes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- text: string (nullable = true)
| | | |-- score: float (nullable = true)
| | | |-- x: integer (nullable = true)
| | | |-- y: integer (nullable = true)
| | | |-- width: integer (nullable = true)
| | | |-- height: integer (nullable = true)
| |-- exception: string (nullable = true)
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("Spark PDF Example")
.master("local[*]")
.config("spark.jars.packages", "com.stabrise:spark-pdf-spark35_2.12:0.1.11")
.getOrCreate()
val df = spark.read.format("pdf")
.option("imageType", "BINARY")
.option("resolution", "200")
.option("pagePerPartition", "2")
.option("reader", "pdfBox")
.option("ocrConfig", "psm=11")
.load("path to the pdf file(s)")
df.select("path", "document").show()
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.appName("SparkPdf") \
.config("spark.jars.packages", "com.stabrise:spark-pdf-spark35_2.12:0.1.11") \
.getOrCreate()
df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.option("ocrConfig", "psm=11") \
.load("path to the pdf file(s)")
df.select("path", "document").show()
This project is not affiliated with, endorsed by, or connected to the Apache Software Foundation or Apache Spark.