Automate merging of DOC, DOCX, and PDF files with word frequency analysis. Streamlines document consolidation for large-scale projects.
- Converts DOC and DOCX files to PDF format
- Merges all PDF files into a single document
- Converts the merged PDF back to DOCX format
- Performs word frequency analysis on the final document
- Generates a detailed audit log of all operations
- Python 3.x
- Required Python packages:
- PyPDF2
- pdf2docx
- win32com
- tqdm
- Clone this repository:
git clone https://github.com/yourusername/document-merger-analyzer.git
- Navigate to the project directory:
cd document-merger-analyzer
- Install the required packages:
pip install PyPDF2 pdf2docx pywin32 tqdm
- Run the script:
python document_processor.py
- Follow the prompts to:
- Specify the input folder containing your documents
- Name the output DOCX file
- Enter words for frequency analysis
Example interaction:
Enter the path to the folder containing the files: C:\Users\YourName\Documents\InputFolder
Enter the name for the final output DOCX file (e.g., final_document.docx): merged_output.docx
Enter words to search for (one per line). Press Enter on a blank line to finish:
important
critical
urgent
The script will process the files and save the merged document and audit log in your Documents folder.
This script was created to handle a specific project involving merging hundreds of DOC files with some PDF files mixed in. It may require modifications for different use cases.
Christopher D. van der Kaay, Ph.D.
This project is licensed under the MIT License - see the LICENSE file for details.