This Python program analyzes the textual content of documents to compute similarities between each pair of documents. It utilizes basic text processing techniques and mathematical computations to provide a measure of similarity, outputting results as a percentage similarity between documents.
- Input multiple documents interactively through the console.
- Normalize and clean the text data to remove non-alphanumeric characters and handle different case inputs.
- Calculate term frequencies for each document.
- Determine cosine similarity scores between each pair of documents.
- List the top similarities based on user input.
- Python 3.x
- No external libraries are required, uses only the standard library modules
collections
,itertools
,math
, andre
.
No installation is required, just ensure you have Python 3.x installed on your system.
- Run the script in a Python environment.
- Enter the number of documents you want to analyze when prompted.
- Input each document one by one when prompted.
- After all documents are entered, the program will automatically calculate and display the similarity scores.
- Enter the number of top similarities you want to view to get the most similar document pairs.
- Input Handling: The program starts by asking for the number of documents. Each document is input one by one. Text cleaning is performed to standardize the input.
- Text Preprocessing: Each document's text is converted to lowercase, and all non-alphanumeric characters are removed. Text is then split into words to create a list of terms.
- Similarity Calculation: The program calculates term frequencies and uses these to compute cosine similarities between each pair of documents.
- Output: Similarity scores are presented as percentages, and the program can also list the top document pairs with the highest similarities based on user input.
$ python document_similarity.py
Amount of documents: 3
Document No. 1
Enter your document here:
Hello world
Document No. 1 added successfully
Document No. 2
Enter your document here:
Hello there
Document No. 2 added successfully
Document No. 3
Enter your document here:
Another document
Document No. 3 added successfully
The similarity between Document No: 1 and Document No: 2 is: 50.0 %
The similarity between Document No: 1 and Document No: 3 is: 0.0 %
The similarity between Document No: 2 and Document No: 3 is: 0.0 %
Enter a Number between 1 and 3
Find the top similar documents: 2
1 The 50.0% similarity, come from document No: 1 and Document No: 2
Contributions to this project are welcome. If you have ideas for improvements or notice any issues, please feel free to fork the repository and submit a pull request with your changes.
This project is licensed under the MIT License. You are permitted to use, modify, and distribute the software as needed, provided that this license is included with any substantial usage of the work.