Collection of tools designed to parse documents, such as PDFs, and extract structured elements including URLs, citation contexts, tables, formulas, and figures. This toolset leverages AI-based text extraction and classification methods, providing robust solutions for various document processing needs.

The toolset includes PDF converters to text and parsers, featuring AI-based text extraction and classification methods. For example, it can classify URI citation contexts by resource type, such as software or datasets, and determine the intent behind software mentions. The toolset also utilizes PDF-structured information, such as abstracts and authors, to automate common library tasks. These tasks include metadata extraction, topic detection, citation analysis, language detection, and the description of images and media.

License

#LANL O number O4805 © 2024. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

License