Document Similarity Analyzer

Overview

This Python program analyzes the textual content of documents to compute similarities between each pair of documents. It utilizes basic text processing techniques and mathematical computations to provide a measure of similarity, outputting results as a percentage similarity between documents.

Features

Input multiple documents interactively through the console.
Normalize and clean the text data to remove non-alphanumeric characters and handle different case inputs.
Calculate term frequencies for each document.
Determine cosine similarity scores between each pair of documents.
List the top similarities based on user input.

Requirements

Python 3.x
No external libraries are required, uses only the standard library modules collections, itertools, math, and re.

Installation

No installation is required, just ensure you have Python 3.x installed on your system.

Usage

Run the script in a Python environment.
Enter the number of documents you want to analyze when prompted.
Input each document one by one when prompted.
After all documents are entered, the program will automatically calculate and display the similarity scores.
Enter the number of top similarities you want to view to get the most similar document pairs.

How It Works

Input Handling: The program starts by asking for the number of documents. Each document is input one by one. Text cleaning is performed to standardize the input.
Text Preprocessing: Each document's text is converted to lowercase, and all non-alphanumeric characters are removed. Text is then split into words to create a list of terms.
Similarity Calculation: The program calculates term frequencies and uses these to compute cosine similarities between each pair of documents.
Output: Similarity scores are presented as percentages, and the program can also list the top document pairs with the highest similarities based on user input.

Example

$ python document_similarity.py
Amount of documents: 3
Document No. 1
Enter your document here:
Hello world
Document No. 1 added successfully

Document No. 2
Enter your document here:
Hello there
Document No. 2 added successfully

Document No. 3
Enter your document here:
Another document
Document No. 3 added successfully

The similarity between Document No: 1 and Document No: 2 is: 50.0 %
The similarity between Document No: 1 and Document No: 3 is: 0.0 %
The similarity between Document No: 2 and Document No: 3 is: 0.0 %

Enter a Number between 1 and 3
Find the top similar documents: 2

1 The 50.0% similarity, come from document No: 1 and Document No: 2

Contributions

Contributions to this project are welcome. If you have ideas for improvements or notice any issues, please feel free to fork the repository and submit a pull request with your changes.

License

This project is licensed under the MIT License. You are permitted to use, modify, and distribute the software as needed, provided that this license is included with any substantial usage of the work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Document Similarity Analyzer

Overview

Features

Requirements

Installation

Usage

How It Works

Example

Contributions

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Document Similarity Analyzer

Overview

Features

Requirements

Installation

Usage

How It Works

Example

Contributions

License