-
Notifications
You must be signed in to change notification settings - Fork 42
Backend: TF IDF
The tfidf
backend implements a baseline algorithm for automated subject indexing. The idea is to count the frequencies of terms (words) used in documents about each subject, use the TF-IDF algorithm to weight the term frequencies so that rare words are more important than frequently occurring ones, and to create an index for matching term frequencies in new documents to those about specific subjects. The implementation is based on the topic modelling library Gensim.
It is really easy to get started using the TF-IDF backend since it doesn't require any algorithm-specific configuration.
See also the Annif-tutorial exercise about TFIDF project.
[tfidf-en]
name=TF-IDF English
language=en
backend=tfidf
analyzer=snowball(english)
limit=100
vocab=yso
Load a vocabulary:
annif load-vocab yso /path/to/Annif-corpora/vocab/yso-skos.ttl
Train the model:
annif train tfidf-en /path/to/Annif-corpora/training/yso-finna-en.tsv.gz
Test the model with a single document:
cat document.txt | annif suggest tfidf-en
Evaluate a directory full of files in fulltext document corpus format:
annif eval tfidf-en /path/to/documents/
- Home
- Getting started
- System requirements
- Optional features and dependencies
- Usage with Docker
- Architecture
- Commands
- Web user interface
- REST API
- Corpus formats
- Project configuration
- Analyzers
- Transforms
- Language detection
- Hugging Face Hub integration
- Achieving good results
- Reusing preprocessed training data
- Running as a WSGI service
- Backward compatibility between Annif releases
- Backends
- Development flow, branches and tags
- Release process
- Creating a new backend