GitHub - StabRise/spark-pdf: PDF DataSource for Apache Spark

⭐ Star us on GitHub — it motivates us a lot!

Source Code: https://github.com/StabRise/spark-pdf

Quick Start Jupyter Notebook Spark 3.x.x: PdfDataSource.ipynb

Quick Start Jupyter Notebook Spark 4.0.x: PdfDataSourceSpark4.ipynb

Welcome to the Spark PDF

The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

If you found useful this project, please give a star to the repository.

Key features:

Read PDF documents to the Spark DataFrame
Support read PDF files lazy per page
Support big files, up to 10k pages
Support scanned PDF files (call OCR)
No need to install Tesseract OCR, it's included in the package

Requirements

Java 8, 11, 17
Apache Spark 3.3.2, 3.4.1, 3.5.0, 4.0.0
Ghostscript 9.50 or later (only for the GhostScript reader)

Spark 4.0.0 is supported in the version 0.1.11 and later (need Java 17 and Scala 2.13).

Installation

Binary package is available in the Maven Central Repository.

Spark 3.5.*: com.stabrise:spark-pdf-spark35_2.12:0.1.11
Spark 3.4.*: com.stabrise:spark-pdf-spark34_2.12:0.1.11
Spark 3.3.*: com.stabrise:spark-pdf-spark33_2.12:0.1.11
Spark 4.0.*: com.stabrise:spark-pdf-spark34_2.13:0.1.11

Options for the data source:

imageType: Oputput image type. Can be: "BINARY", "GREY", "RGB". Default: "RGB".
resolution: Resolution for rendering PDF page to the image. Default: "300" dpi.
pagePerPartition: Number pages per partition in Spark DataFrame. Default: "5".
reader: Supports: pdfBox - based on PdfBox java lib, gs - based on GhostScript (need installation GhostScipt to the system)
ocrConfig: Tesseract OCR configuration. Default: "psm=3". For more information see Tesseract OCR Params

Output Columns in the DataFrame:

The DataFrame contains the following columns:

path: path to the file
page_number: page number of the document
text: extracted text from the text layer of the PDF page
image: image representation of the page
document: the OCR-extracted text from the rendered image (calls Tesseract OCR)
partition_number: partition number

Output Schema:

root
 |-- path: string (nullable = true)
 |-- filename: string (nullable = true)
 |-- page_number: integer (nullable = true)
 |-- partition_number: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- image: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- resolution: integer (nullable = true)
 |    |-- data: binary (nullable = true)
 |    |-- imageType: string (nullable = true)
 |    |-- exception: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |-- document: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- text: string (nullable = true)
 |    |-- outputType: string (nullable = true)
 |    |-- bBoxes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |    |    |-- score: float (nullable = true)
 |    |    |    |-- x: integer (nullable = true)
 |    |    |    |-- y: integer (nullable = true)
 |    |    |    |-- width: integer (nullable = true)
 |    |    |    |-- height: integer (nullable = true)
 |    |-- exception: string (nullable = true)

Example of usage

Scala

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("Spark PDF Example")
  .master("local[*]")
  .config("spark.jars.packages", "com.stabrise:spark-pdf-spark35_2.12:0.1.11")
  .getOrCreate()
  
val df = spark.read.format("pdf")
  .option("imageType", "BINARY")
  .option("resolution", "200")
  .option("pagePerPartition", "2")
  .option("reader", "pdfBox")
  .option("ocrConfig", "psm=11")
  .load("path to the pdf file(s)")

df.select("path", "document").show()

Python

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("SparkPdf") \
    .config("spark.jars.packages", "com.stabrise:spark-pdf-spark35_2.12:0.1.11") \
    .getOrCreate()

df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "200") \
    .option("pagePerPartition", "2") \
    .option("reader", "pdfBox") \
    .option("ocrConfig", "psm=11") \
    .load("path to the pdf file(s)")

df.select("path", "document").show()

Disclaimer

This project is not affiliated with, endorsed by, or connected to the Apache Software Foundation or Apache Spark.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
common/src/main/scala		common/src/main/scala
examples		examples
project		project
spark33/src/main/scala/datasources		spark33/src/main/scala/datasources
spark34/src/main/scala/datasources		spark34/src/main/scala/datasources
spark35/src/main/scala/datasources		spark35/src/main/scala/datasources
spark40/src/main/scala/datasources		spark40/src/main/scala/datasources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TesseractParams.md		TesseractParams.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to the Spark PDF

Key features:

Requirements

Installation

Options for the data source:

Output Columns in the DataFrame:

Example of usage

Scala

Python

Disclaimer

About

Releases 2

Packages

Languages

License

StabRise/spark-pdf

Folders and files

Latest commit

History

Repository files navigation

Welcome to the Spark PDF

Key features:

Requirements

Installation

Options for the data source:

Output Columns in the DataFrame:

Example of usage

Scala

Python

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages