Releases: NatLibFi/Annif
Annif 0.57
Training of NN ensemble models can now be performed in parallel (running suggest operations simultaneously for all source projects) on multiple CPUs; this is controlled by using the --jobs
parameter of the train
command. The compatibility of Annif with DVC is improved by supporting TOML file format for configuring Annif projects. The --force
option is added to the loadvoc
command that can be used to replace an existing vocabulary instead of updating it. This release includes many small maintenance tasks for the CI/CD pipeline, e.g. migrating Docker image builds to GitHub Actions from the Drone platform.
Omikuji, TensorFlow and Connexion dependencies are upgraded to the latest available versions; retraining of projects should not be necessary.
New features:
#526/#567 Add --force option to loadvoc CLI command
Improvements:
#429/#568 Perform suggest operations in parallel using multiprocessing in nn_ensemble
#547/#560 Support TOML as a configuration file format alongside CFG/INI for DVC compatibility
Maintenance:
#570 Use fulltext corpus in MLLM tests which is much faster
#571 Docker builds on GitHub Actions CI/CD
#572 Update Dockerfile v0.57
#573 Ensure setuptools and wheel are installed & up-to-date for tests in GitHub Actions CI
#574 Avoid running duplicated tests on PRs in GitHub Actions CI
#575 Resolve some Warnings by tests
#576 Enable pip cache in GitHub Actions CI
#577 Improved Project links in PyPI page
#578 Update dependencies v0.57
#581/#582 Add tags trigger to GH Actions CI/CD workflow
Annif 0.56
This release introduces a new spaCy analyzer and takes care of many maintenance tasks. The CLI usage is improved by shortening the startup time of some commands, the Docker images are now easier to customize, there are improvements to the eval command, and minor bugs are fixed.
The spaCy analyzer enables support for some new languages and can improve subject suggestion results. The spaCy analyzer and the language-specific models need to be installed separately. The Docker image distributed via quay.io includes the spaCy analyzer and the English language model, but no other languages.
The maintenance tasks include upgrading many dependencies, notably Omikuji to v0.4. The Omikuji upgrade brings faster training and predictions as well as reduced memory usage, but the Annif projects using the omikuji backend need to be retrained. The projects using other backends should not require retraining, although warnings may be shown in some cases.
The support for Python 3.6 is removed, which is necessitated by the dependency upgrades.
This release also removes the Maui and vw_multi (Vowpal Wabbit) backends.
New features:
#374/#527/#563 spaCy analyzer
Improvements:
#514/#544 Optimize startup time using local & lazy imports
#548 Allow selecting installed optional dependencies in Docker build
#545/#558 Select metrics for eval command using an option
#546/#557 Output eval metrics as a JSON file compatible with DVC
Bug fixes:
#552/#554 LMDB can overflow (credit: @mo-fu)
#562 Add missing import of annif.eval in MLLM backend
Maintenance:
#549 Update dependencies for v0.56
#550 Drop Python 3.6 support
#541 Remove Maui and Vowpal Wabbit multi backends
#551 Remove swagger-tester dependency
#542/#555 Add CITATION.cff file
#553 Update Scrutinizer config
#561 Set a 10 minute timeout for GitHub Actions CI jobs
#565 Avoid coverage 6.3 as it causes some tests to hang
Annif 0.55
This release includes a new language filtering feature. This input-transform filters out sentences of the intput text whose language is different than the project language. The language detection is performed with Compact Language Detector v3 via pycld3
. pycld3
is an optional dependency of Annif, see the installation page.
Also minor bug fixes and dependency updates are included.
The Maui and vw_multi (Vowpal Wabbit) backends have been marked as deprecated in this release and they will be removed in the next release 0.56. Removing is motivated by making codebase more compact and thus easier to maintain. The MLLM and nn_ensemble backends offer similar functionality as Maui and vw_multi.
Note that the notes for the previous release (Annif 0.54) initially missed to mention the added support for the input-transform feature.
New features:
#464/#507 Language filtering in input text
Improvements:
#536 Allow rdflib version 6.*
Bug fixes:
#533/#534 Adjust flask and click versions to avoid dependency mismatches
Maintenance:
#530 Add deprecation warning to Maui & vw_multi train commands
#492/#529 Update Docker base image to Debian Bullseye to upgrade Voikko library
Annif 0.54.1
This is a patch release that fixes bugs surfaced and found after 0.54.0 release. In particular, installation using pip was not working correctly due to a missing dependency on the dateutil package.
Bugs fixed:
#523 Make Drone builds start on all git tag events
#524 Add MLLM classifier sanity check
#525 Much faster updating of existing large vocabulary
#528 Declare dateutil dependency
Annif 0.54
This release adds a new --jobs
parameter for the annif train
command, which allows easy control of the number of threads/CPUs when training MLLM, fasttext and Omikuji backends. Many other improvements are included that speed up the MLLM backend, especially in the case of a large vocabulary. Also a few minor bugs have been fixed.
Edit: Also introduces support for adding new text-input transformation operations to Annif. Previously the input-limiting feature was implemented as a backend mechanism (#446, #452), which was set up in a project configuration e.g. with a setting input_limit=5000
; now the input-limiting feature is implemented as a more general input-text transform and it can be set up in the project configuration with transform=limit(5000)
.
New features:
#512 Support jobs parameter in train command
Edit: #496 Support for adding input-transformation operations
Improvements:
#500 Implement custom MeanLayer in nn_ensemble
#511/#483 Process training docs in parallel in MLLM backend
#513/#519 Keep serialized dump of SKOS graph to save parsing time
#518 Use least frequent token as key in TokenSetIndex used by MLLM
#520 Optimize limit_mask creation
Bug fixes:
#510/#502 Use set as container of uris instead of list in DocumentFile
#515/#453 Allow NN ensemble to be used for parallel eval
#517 Skip unimportant subjects in _vector_to_list_suggestion
#522/#521 Allow private projects to be accessed from CLI
Annif 0.53.2
Annif 0.53.1
This patch release fixes a bug which prevented training the SVC backend on fulltext corpus.
Annif 0.53
This release adds two new backends, YAKE and SVC. The YAKE backend is a wrapper around the YAKE library, which performs lexical unsupervised keyword extraction. There is no need for training data. See the YAKE wiki page for more information. In future Annif releases, it would be possible to extend YAKE support so that it can be used to suggest new terms for a vocabulary (the keywords that are not found in the vocabulary).
The SVC backend implements Linear Support Vector Classification. It is well suited for multiclass (but not multilabel) classification, for example classifying documents with the Dewey Decimal Classification or the 20 Newsgroups classification. It requires relatively little training data, and is suitable for classifications of up to around 10,000 classes. See the SVC wiki page for more information.
This release also upgrades many dependencies, which enables all Annif backends to run on Python 3.9 (previously nn_ensemble backend was available only for 3.6-3.8). The Docker image uses now Python 3.8 instead of 3.7.
Note that nn_ensemble models are not compatible across Python versions: e.g. a model trained on Python 3.7 can be used only on Python 3.7. Training the nn_ensemble models shows a CustomMaskWarning
, but it is harmless (caused by a TensorFlow bug) and can be ignored.
Due to the update of scikit-learn, using TFIDF, MLLM or Omikuji models trained on older Annif versions will show warnings about the TfidfVectorizer
. To the best of our knowledge, these are harmless and can be ignored. You have to retrain the models to get rid of the warnings.
This release includes also many minor improvements and bug fixes.
New features:
#486 New SVC (support vector classification) backend using scikit-learn
#439/#461 YAKE backend
#490/#494 Make --version option show Annif version
Improvements:
#488 Add support for ngram setting in omikuji backend
Maintenance:
#499 Update dependencies v0.53
#487 Upgrade scikit-learn to 0.24.2
#498 Update Dockerfile
Bug fixes:
#484/#495 Show error when training MLLM on empty corpus
#489 Add Codecov Action to GH workflow for uploading reports
#491 Raise NotSupportedException for attempt to train YAKE
#497 Remove execute permissions of some files
Annif 0.52
This release includes a new MLLM backend which is a Python implementation of the Maui-like Lexical Matching algorithm. It was inspired by the Maui algorithm (by Alyona Medelyan), but not a direct reimplementation. It is meant for long full-text documents and like Maui, it needs to be trained with a relatively small number (hundreds or thousands) of manually indexed documents so that the algorithm can choose the right mix of heuristics that achieves best results on a particular document collection. See the MLLM Wiki page for more information.
New features include the possibility to configure two project parameters:
min_token_length
can be set in the analyzer parameters; e.g. setting the value to 2 allows the word "UK" to pass to a backend, while with the default value (3) the word is filtered out by the analyzerlr
can be set in the neural-network ensemble project configuration to define the learning rate.
The STWFSA backend has been updated to use a newer version of the stwfsapy library. Old STWFSA models are not compatible with the new version so any STWFSA projects must be retrained. The release includes also several minor improvements and bug fixes.
New features:
#462 New lexical backend MLLM
#456/#468 Allow configuration of token min length (credit: mo-fu)
#475 Allow configuration of nn ensemble learning rate (credit: mo-fu)
Improvements:
#478/#479 Update stwfsa to 0.2.* (credit: mo-fu)
#472 Cleanup suggestion tests
#480 Optimize check for deprecated subject IDs using a set
Maintenance:
#474 Use GitHub Actions as CI service
Bug fixes:
#470/#471 Make sure suggestion scores are in the range 0.0-1.0
#477 Optimize the optimize command
#481 Backwards compatibility fix for the token_min_length setting
#482 MLLM fix: don't include use_hidden_labels in hyperopt, it won't have any effect
Annif 0.51
This release includes a new STWFSA backend which is a wrapper around STWFSAPY, a lexical algorithm based on finite state automata. It achieves best results with short texts, i.e., titles and author keywords, and is best suited for English language data.
The NN ensemble backend has been improved with better handling of source weights. Retraining NN ensemble models after updating Annif to this version is recommended, since the quality of results can decrease if old models are used. A new option for several CLI commands has been added: --docs-limit/-d
option can be used to limit the number of documents to process, for example to create learning-curve data. Also several bugs have been fixed.
New features:
#438 Lexical STWFSAPY Backend (credit @mo-fu)
#465 Limit document number CLI option
Improvements:
#457/#458 Improved handling of source weights in NN ensemble
Bug fixes:
#454/#455 Address SonarCloud complaints
#459/#460 Pass limit parameter to Maui Server during train
#463 Fix TruncatingCorpus iterator