Creates a script to combine images and masks extracted from pdf by pdfimages. Images are copied with stripping to remove metadata to make them deterministic, otherwise there is a timestamp in them when they are extracted from the PDF which is annoying.
dnf install poppler-utils cmake gcc-c++ ImageMagick fdupes
# If you are using the tar file you can skip the git submodule as it will fail because its not a git repo
git submodule update --init
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make
An example of running it would be
mkdir images
cd images
pdfimages -list ../MyPDF.pdf > list.txt
pdfimages -p -all ../MyPDF.pdf mypdf
pdfimages_combine mypdf > script.sh
bash script.sh
Optionally get rid of duplicates
fdupes -N --delete output
Check the resulting script and run it if you want. NB pdfimages_combine creates a subdirectory called output
to put the results in.
MuPDF is a very similar tools to pdfimages - sometimes it does better at image extract.
dnf install mupdf
Example command is
mutool extract MyPDF.pdf
Ghostscript is really usefull, especially for those map pages when you want all the map labels because you can "print" a PDF to PNG's or JPEG's. You can override the papersize so your images are not letter/A4 sized.
dnf install ghostscript
Some example commands
Convert multi page PDF to PNG files
gs -dNOPAUSE -dBATCH -sDEVICE=png16m -r600 -dGraphicsAlphaBits=4 -sOutputFile="image-%d.png" MyPDF.pdf
gs -dNOPAUSE -dBATCH -sDEVICE=png16m –dFirstPage=3 –dLastPage=4 -sOutputFile="image-%d.png" MyPDF.pdf
Useful other switchs are:
-r300
- Save image at 300 DPI-dGraphicsAlphaBits=4
- Highest quality output-sPAPERSIZE=a4
- Change the paper size-sPageList=1,3,5
- List of pages rather than range
More info at: https://www.ghostscript.com/doc/current/Use.htm#PDF_switches