Skip to content

Latest commit

 

History

History
239 lines (181 loc) · 12.6 KB

README.md

File metadata and controls

239 lines (181 loc) · 12.6 KB
yobix ai logo

Extractous

https://pypi.python.org/pypi/unstructured/ Commits per month Downloads

Extractous offers a fast and efficient solution for extracting content and metadata from various documents types such as PDF, Word, HTML, and many other formats. Our goal is to deliver a fast and efficient comprehensive solution in Rust with bindings for many programming languages.


Demo: showing that Extractous 🚀 is 25x faster than the popular unstructured-io library ($65m in funding and 8.5k GitHub stars). For complete benchmarking details please consult our benchmarking repository

unstructured_vs_extractous * demo running at 5x recoding speed

Why Extractous?

Extractous was born out of frustration with the need to rely on external services or APIs for content extraction from unstructured data. Do we really need to call external APIs or run special servers just for content extraction? Couldn't extraction be performed locally and efficiently?

In our search for solutions, unstructured-io stood out as the popular and widely-used library for parsing unstructured content with in-process parsing. However, we identified several significant limitations:

  • Architecturally, unstructured-io wraps around numerous heavyweight Python libraries, resulting in slow performance and high memory consumption (see our benchmarks for more details).
  • Inefficient in utilizing multiple CPU cores for data processing tasks, which are predominantly CPU-bound. This inefficiency is due to limitations in its dependencies and constraints like the Global Interpreter Lock (GIL), which prevents multiple threads from executing Python bytecode simultaneously.
  • As unstructured-io evolves, it is becoming increasingly complicated, transitioning into more of a complex framework and focusing more offering an external API service for text and metadata extraction.

In contrast, Extractous maintains a dedicated focus on text and metadata extraction. It achieves significantly faster processing speeds and lower memory utilization through native code execution.

  • Built with Rust: The core is developed in Rust, leveraging its high performance, memory safety, multi-threading capabilities, and zero-cost abstractions.
  • Extended format support with Apache Tika: For file formats not natively supported by the Rust core, we compile the well-known Apache Tika into native shared libraries using GraalVM ahead-of-time compilation technology. These shared libraries are then linked to and called from our Rust core. No local servers, no virtual machines, or any garbage collection, just pure native execution.
  • Bindings for many languages: we plan to introduce bindings for many languages. At the moment we offer only Python binding, which is essentially is a wrapper around the Rust core with the potential to circumventing the Python GIL limitation and make efficient use of multi-cores.

With Extractous, the need for external services or APIs is eliminated, making data processing pipelines faster and more efficient.

🌳 Key Features

  • High-performance unstructured data extraction optimized for speed and low memory usage.
  • Clear and simple API for extracting text and metadata content.
  • Automatically identifies document types and extracts content accordingly
  • Supports many file formats (most formats supported by Apache Tika).
  • Extracts text from images and scanned documents with OCR through tesseract-ocr.
  • Core engine written in Rust with bindings for Python and upcoming support for JavaScript/TypeScript.
  • Detailed documentation and examples to help you get started quickly and efficiently.
  • Free for Commercial Use: Apache 2.0 License.

🚀 Quickstart

Extractous provides a simple and easy-to-use API for extracting content from various file formats. Below are quick examples:

Python

  • Extract a file content to a string:
from extractous import Extractor

# Create a new extractor
extractor = Extractor()
extractor = extractor.set_extract_string_max_length(1000)
# if you need an xml
# extractor = extractor.set_xml_output(True)

# Extract text from a file
result, metadata = extractor.extract_file_to_string("README.md")
print(result)
print(metadata)
  • Extracting a file(URL / bytearray) to a buffered stream:
from extractous import Extractor

extractor = Extractor()
# if you need an xml
# extractor = extractor.set_xml_output(True)

# for file
reader, metadata = extractor.extract_file("tests/quarkus.pdf")
# for url
# reader, metadata = extractor.extract_url("https://www.google.com")
# for bytearray
# with open("tests/quarkus.pdf", "rb") as file:
#     buffer = bytearray(file.read())
# reader, metadata = extractor.extract_bytes(buffer)

result = ""
buffer = reader.read(4096)
while len(buffer) > 0:
    result += buffer.decode("utf-8")
    buffer = reader.read(4096)

print(result)
print(metadata)
  • Extracting a file with OCR:

You need to have Tesseract installed with the language pack. For example on debian sudo apt install tesseract-ocr tesseract-ocr-deu

from extractous import Extractor, TesseractOcrConfig

extractor = Extractor().set_ocr_config(TesseractOcrConfig().set_language("deu"))
result, metadata = extractor.extract_file_to_string("../../test_files/documents/eng-ocr.pdf")

print(result)
print(metadata)

Rust

  • Extract a file content to a string:
use extractous::Extractor;

fn main() {
    // Create a new extractor. Note it uses a consuming builder pattern
    let mut extractor = Extractor::new().set_extract_string_max_length(1000);
    // if you need an xml
    // extractor = extractor.set_xml_output(true);

    // Extract text from a file
    let (text, metadata) = extractor.extract_file_to_string("README.md").unwrap();
    println!("{}", text);
    println!("{:?}", metadata);
}
  • Extract a content of a file(URL/ bytes) to a StreamReader and perform buffered reading
use std::io::{BufReader, Read};
// use std::fs::File; use for bytes
use extractous::Extractor;

fn main() {
    // Get the command-line arguments
    let args: Vec<String> = std::env::args().collect();
    let file_path = &args[1];

    // Extract the provided file content to a string
    let extractor = Extractor::new();
    // if you need an xml
    // extractor = extractor.set_xml_output(true);

    let (stream, metadata) = extractor.extract_file(file_path).unwrap();
    // Extract url
    // let (stream, metadata) = extractor.extract_url("https://www.google.com/").unwrap();
    // Extract bytes
    // let mut file = File::open(file_path)?;
    // let mut buffer = Vec::new();
    // file.read_to_end(&mut buffer)?;
    // let (stream, metadata) = extractor.extract_bytes(&file_bytes);

    // Because stream implements std::io::Read trait we can perform buffered reading
    // For example we can use it to create a BufReader
    let mut reader = BufReader::new(stream);
    let mut buffer = Vec::new();
    reader.read_to_end(&mut buffer).unwrap();

    println!("{}", String::from_utf8(buffer).unwrap());
    println!("{:?}", metadata);
}
  • Extract content of PDF with OCR.

You need to have Tesseract installed with the language pack. For example on debian sudo apt install tesseract-ocr tesseract-ocr-deu

use extractous::Extractor;

fn main() {
  let file_path = "../test_files/documents/deu-ocr.pdf";

    let extractor = Extractor::new()
          .set_ocr_config(TesseractOcrConfig::new().set_language("deu"))
          .set_pdf_config(PdfParserConfig::new().set_ocr_strategy(PdfOcrStrategy::OCR_ONLY));
    // extract file with extractor
  let (content, metadata) = extractor.extract_file_to_string(file_path).unwrap();
  println!("{}", content);
  println!("{:?}", metadata);
}

🔥 Performance

  • Extractous is fast, please don't take our word for it, you can run the benchmarks yourself. For example extracting content out of sec10 filings pdf forms, Extractous is on average ~18x faster than unstructured-io:

extractous_speedup_relative_to_unstructured

  • Not just speed it is also memory efficient, Extractous allocates ~11x less memory than unstructured-io:

extractous_memory_efficiency_relative_to_unstructured

  • You might be questioning the quality of the extracted content, gues what we even do better in that regard:

extractous_memory_efficiency_relative_to_unstructured

📄 Supported file formats

Category Supported Formats Notes
Microsoft Office DOC, DOCX, PPT, PPTX, XLS, XLSX, RTF Includes legacy and modern Office file formats
OpenOffice ODT, ODS, ODP OpenDocument formats
PDF PDF Can extracts embedded content and supports OCR
Spreadsheets CSV, TSV Plain text spreadsheet formats
Web Documents HTML, XML Parses and extracts content from web documents
E-Books EPUB EPUB format for electronic books
Text Files TXT, Markdown Plain text formats
Images PNG, JPEG, TIFF, BMP, GIF, ICO, PSD, SVG Extracts embedded text with OCR
E-Mail EML, MSG, MBOX, PST Extracts content, headers, and attachments

🤝 Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or new features to propose.

🕮 License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.