Welcome to the Spark PDF

The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

If you found useful this project, please give a star to the repository.

Key features:

Read PDF documents to the Spark DataFrame
Support read PDF files lazy per page
Support big files, up to 10k pages
Support scanned PDF files (call OCR)
No need to install Tesseract OCR, it's included in the package

Requirements

Java 8, 11, 17
Apache Spark 3.3.2, 3.4.1, 3.5.0, 4.0.0
Ghostscript 9.50 or later (only for the GhostScript reader)

Spark 4.0.0 is supported in the version 0.1.11 and later (need Java 17 and Scala 2.13).

Installation

Binary package is available in the Maven Central Repository.

Spark 3.5.*: com.stabrise:spark-pdf-spark35_2.12:0.1.11
Spark 3.4.*: com.stabrise:spark-pdf-spark34_2.12:0.1.11
Spark 3.3.*: com.stabrise:spark-pdf-spark33_2.12:0.1.11
Spark 4.0.*: com.stabrise:spark-pdf-spark34_2.13:0.1.11

Options for the data source:

imageType: Oputput image type. Can be: "BINARY", "GREY", "RGB". Default: "RGB".
resolution: Resolution for rendering PDF page to the image. Default: "300" dpi.
pagePerPartition: Number pages per partition in Spark DataFrame. Default: "5".
reader: Supports: pdfBox - based on PdfBox java lib, gs - based on GhostScript (need installation GhostScipt to the system)
ocrConfig: Tesseract OCR configuration. Default: "psm=3". For more information see Tesseract OCR Params

Output Columns in the DataFrame:

The DataFrame contains the following columns:

path: path to the file
page_number: page number of the document
text: extracted text from the text layer of the PDF page
image: image representation of the page
document: the OCR-extracted text from the rendered image (calls Tesseract OCR)
partition_number: partition number

Output Schema:

root
 |-- path: string (nullable = true)
 |-- filename: string (nullable = true)
 |-- page_number: integer (nullable = true)
 |-- partition_number: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- image: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- resolution: integer (nullable = true)
 |    |-- data: binary (nullable = true)
 |    |-- imageType: string (nullable = true)
 |    |-- exception: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |-- document: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- text: string (nullable = true)
 |    |-- outputType: string (nullable = true)
 |    |-- bBoxes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |    |    |-- score: float (nullable = true)
 |    |    |    |-- x: integer (nullable = true)
 |    |    |    |-- y: integer (nullable = true)
 |    |    |    |-- width: integer (nullable = true)
 |    |    |    |-- height: integer (nullable = true)
 |    |-- exception: string (nullable = true)

Example of usage

Scala

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("Spark PDF Example")
  .master("local[*]")
  .config("spark.jars.packages", "com.stabrise:spark-pdf-spark35_2.12:0.1.11")
  .getOrCreate()
  
val df = spark.read.format("pdf")
  .option("imageType", "BINARY")
  .option("resolution", "200")
  .option("pagePerPartition", "2")
  .option("reader", "pdfBox")
  .option("ocrConfig", "psm=11")
  .load("path to the pdf file(s)")

df.select("path", "document").show()

Python

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("SparkPdf") \
    .config("spark.jars.packages", "com.stabrise:spark-pdf-spark35_2.12:0.1.11") \
    .getOrCreate()

df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "200") \
    .option("pagePerPartition", "2") \
    .option("reader", "pdfBox") \
    .option("ocrConfig", "psm=11") \
    .load("path to the pdf file(s)")

df.select("path", "document").show()

Disclaimer

This project is not affiliated with, endorsed by, or connected to the Apache Software Foundation or Apache Spark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Welcome to the Spark PDF

Key features:

Requirements

Installation

Options for the data source:

Output Columns in the DataFrame:

Example of usage

Scala

Python

Disclaimer

Files

README.md

Latest commit

History

README.md

File metadata and controls

Welcome to the Spark PDF

Key features:

Requirements

Installation

Options for the data source:

Output Columns in the DataFrame:

Example of usage

Scala

Python

Disclaimer