Skip to content

A Python-based tool for extracting structured data from PDFs using OCR and regex, and exporting it to CSV. Ideal for processing invoices, logs, or scanned documents into organized, usable datasets.

License

Notifications You must be signed in to change notification settings

towfique-elahe/pdf-to-structured-csv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Automated PDF to Structured CSV Data Extraction

Description:

A Python tool for extracting and structuring text data from PDF files using OCR and regex, then exporting it to CSV. This project is particularly suited for converting scanned PDFs containing structured forms or invoices into usable, tabulated data. By leveraging pdf2image and pytesseract, it handles each page of a PDF as an image, allowing reliable OCR text extraction. Regex patterns are used to extract specific data fields, enabling customizable and targeted text capture for complex data layouts.

Features:

  • PDF-to-Image Conversion: Converts each PDF page to an image using pdf2image for high-fidelity OCR, making it suitable for scanned or image-heavy PDFs.
  • OCR with Pytesseract: Extracts text from images using pytesseract, enabling multi-language support and high accuracy for complex documents.
  • Regex-Based Data Extraction: Uses Python’s re module to apply regular expressions for capturing specific data fields from extracted text, such as dates, ticket numbers, customer information, weights, prices, and other structured details.
  • Automated CSV Generation: Outputs structured data into a CSV file, with customizable headers, making it easy to analyze or integrate the data with other applications.

Technical Details:

Dependencies:

  • pdf2image: Converts each page of the PDF into high-resolution images for improved OCR accuracy.
  • pytesseract: Provides OCR capabilities to recognize and extract text from images.
  • re (Regex): Extracts targeted fields from the OCR output text.
  • csv: Saves extracted data to CSV, ensuring structured and organized output.

Data Extraction Example:

  • Extracted fields include Ticket, Date, Time, Customer, Transporter, Gross Weight, and others using regex to capture patterns for each field.
  • Supports missing data by returning "N/A" for fields not found, ensuring consistent CSV formatting.

Setup and Usage:

  1. Install dependencies: pip install pdf2image pytesseract.
  2. Install poppler-utils for compatibility with pdf2image.
  3. Run the script by specifying your PDF file, and view results in the generated CSV.
  4. This tool is ideal for analysts, developers, or organizations needing a reliable way to extract and organize structured data from PDF documents, such as invoices, logs, or records.

About

A Python-based tool for extracting structured data from PDFs using OCR and regex, and exporting it to CSV. Ideal for processing invoices, logs, or scanned documents into organized, usable datasets.

Topics

Resources

License

Stars

Watchers

Forks