Skip to content

Implementation of the tf-idf algorithm to calculate the similarity between documents using cosine similarity.

Notifications You must be signed in to change notification settings

abdo-essam/tf-idf-algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tf-idf-algorithm

Implementation of the tf-idf algorithm to calculate the similarity between documents using cosine similarity.

Faculty Of Computer And Artificial Intelligence Cairo University FCAI-CU

Information Retrieval Assignment

The TfIdf class contains methods to calculate term frequency (tf) and inverse document frequency (idf). The tfCalculator method takes an array of all the words in a document and a term to check as input and returns the term frequency of the term in the document. The idfCalculator method takes a list of arrays, each containing all the words in a document and a term to check as input and returns the inverse document frequency score of the term across all the documents.

The CosineSimilarity class has a method cosineSimilarity to calculate the cosine similarity between two document vectors. It takes two document vectors as input, calculates the dot product, magnitude of each vector, and returns the cosine similarity score.

The DocumentParser class has methods to read files, tokenize the documents into terms, and create a term frequency-inverse document frequency (tf-idf) vector for each document. The parseFiles method reads all the files in a given folder and stores the terms of each document in an array. The tfIdfCalculator method uses the TfIdf class to calculate the tf-idf score for each term in each document and stores the document vectors in a list. The getCosineSimilarity method calculates the cosine similarity between all pairs of documents and prints the results.

Overall, this code can be used to calculate the similarity between a set of documents based on the words they contain. It is commonly used in information retrieval and text mining tasks.

About

Implementation of the tf-idf algorithm to calculate the similarity between documents using cosine similarity.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages