This repository contains four Python scripts for document processing:
- JPG to Text Extractor: Extracts text from JPG images using Tesseract OCR.
- DOCX Combiner: Combines multiple
.docx
files into a single document while preserving formatting. - JPG/PNG to TIFF Converter: Converts JPG and PNG images to TIFF format.
- PDF to JPG Converter: Converts PDF files to JPG images.
- Image Preprocessing: Adaptive thresholding and denoising using OpenCV.
- Text Extraction: Utilizes Tesseract OCR to extract text from processed images.
- GUI Interface: Easy folder selection for input and output directories.
- Combines Multiple DOCX Files: Merges multiple
.docx
files into a single document. - Preserves Formatting: Retains the original formatting of paragraphs and tables from the source documents.
- GUI for File Selection: Utilizes a simple GUI for selecting the input files and output location.
- Image Conversion: Converts JPG and PNG images to TIFF format.
- GUI Interface: Easy folder selection for input and output directories.
- PDF to Image Conversion: Converts PDF files to JPG images.
- GUI Interface: Easy folder selection for input and output directories.
- Python 3.x
- python-docx
- pytesseract
- OpenCV
- Pillow
- Tkinter
-
Install Python: Ensure you have Python 3.x installed on your system.
-
Install Required Python Packages:
pip install pytesseract pillow opencv-python-headless python-docx
-
Install Tesseract OCR: Download and install Tesseract OCR from here.
Update the tesseract_cmd
variable in the JPG to Text Extractor and PDF to JPG Converter scripts to match the installation path of Tesseract OCR on your system:
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
-
Clone the Repository:
git clone https://github.com/yourusername/document-processing-tools.git cd document-processing-tools
-
Run the Script:
python jpg_to_text_extractor.py
Replace
jpg_to_text_extractor.py
with the actual name of your script file. -
Select Folders:
- Use the GUI to select the input folder containing JPG images.
- Select the output folder where extracted text files will be saved.
-
Start Extraction: Click the "Start Extraction" button in the GUI to begin the process.
-
Run the Script:
python docx_combiner.py
Replace
docx_combiner.py
with the actual name of your script file. -
Select Files:
- Use the GUI to select the
.docx
files you want to combine. - Select the location and name for the output file.
- Use the GUI to select the
-
Combine and Save: The script will combine the selected
.docx
files into a single document and save it to the specified location.
-
Run the Script:
python jpg_png_to_tiff_converter.py
Replace
jpg_png_to_tiff_converter.py
with the actual name of your script file. -
Select Folders:
- Use the GUI to select the input folder containing JPG and PNG images.
- Select the output folder where the converted TIFF images will be saved.
-
Start Conversion: Click the "Start Conversion" button in the GUI to begin the process.
-
Run the Script:
python pdf_to_jpg_gui.py
Replace
pdf_to_jpg_gui.py
with the actual name of your script file. -
Select Folders:
- Use the GUI to select the input folder containing PDF files.
- Select the output folder where the converted JPG images will be saved.
-
Start Conversion: Click the "Start Conversion" button in the GUI to begin the process.
Preprocesses the image to enhance OCR accuracy by applying adaptive thresholding and denoising.
Processes each JPG image in the input folder, extracts text using Tesseract OCR, and saves the text to the output folder.
Launches a Tkinter-based GUI for selecting input and output folders and starting the text extraction process.
Launches a file dialog to select multiple .docx
files for combining.
Appends paragraphs from the source document to the target document, preserving formatting.
Appends tables from the source document to the target document, preserving formatting.
Combines the selected .docx
files into a single document and saves it to the specified output path.
The main function that orchestrates file selection, combining, and saving the combined document.
Converts JPG and PNG images in the input folder to TIFF format and saves them to the output folder.
Launches a Tkinter-based GUI for selecting input and output folders and starting the conversion process.
Preprocesses the image to enhance OCR accuracy by applying adaptive thresholding and denoising.
Processes each JPG image in the input folder, extracts text using Tesseract OCR, and saves the text to the output folder.
Launches a Tkinter-based GUI for selecting input and output folders and starting the text extraction process.
This project is licensed under the GNU General Public License. See the LICENSE file for details.
Contributions are welcome! Please feel free to submit a pull request or open an issue.