pdf2text2pdf

Scripts to extract text layer from PDFs and rebuild lighter PDFs

These scripts were developed to:

extract Unicode text from a PDF file (or DjVu file), with the position and size of every word in every page, and
re-build a lighter version of the PDF file that ONLY includes the text layer

This allows you to substantially reduce the size of the PDF file, and potentially to implement full text search functionalities.

The first folder, pdf2text, includes a number of Perl and shell scripts to extract the text layer from a PDF or DjVu file, and return a text file with the position and size of every word in the PDF.

The second folder, text2pdf, contains a Python script to build a PDF file from such text file.

The Python script is still a beta version that needs testing fixing. In theory it should support every language included in UTF-8, although we are still far from that.

If you would like to contribute please send your comments, Thanks!

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
pdf2text		pdf2text
text2pdf		text2pdf
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf2text2pdf

Scripts to extract text layer from PDFs and rebuild lighter PDFs

About

Releases

Packages

Languages

License

flppgg/pdf2text2pdf

Folders and files

Latest commit

History

Repository files navigation

pdf2text2pdf

Scripts to extract text layer from PDFs and rebuild lighter PDFs

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages