extract-info-from-pdf-paper

This Python script uses pdfminer.six, PyPDF2, pdf2image to extract information (text, image) from pdf paper.

pip3 install -r requirement.txt

python3 extract-text-image.py

Check out output folder for the result of extraction of this script.

Change the input and output file path in the Python script.

pdf_path = "2305.02301.pdf"
out_text_path = "output/2305.02301.txt"

Windows Usage

Download the Windows version of poppler:https://github.com/oschwartz10612/poppler-windows/releases
After decompression, move the "poppler-23.11.0" subfolder to C:\Program Files.
Add environment variable: "C:\Program Files\poppler-23.11.0\Library\bin", save and exit
run

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
output		output
2305.02301.pdf		2305.02301.pdf
LICENSE		LICENSE
README.md		README.md
extract-text-image.py		extract-text-image.py
requirement.txt		requirement.txt