The set of classes to classify and analysis of texts
We consider text processing in the area of:
- preprocessing: remove punctuation, stemming of the words
- calculate the Jaccard Similarity of the texts: in classical way and by the MinHash Algorithm
Project has been created in Python 3.7.5. Main libraries:
- Natural Language Toolkit (NLTK)
- datasketch
- unittest