Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev to master #701

Merged
merged 10 commits into from
Sep 26, 2024
78 changes: 78 additions & 0 deletions docs/customize/vocabularies/names.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,3 +120,81 @@ invenio vocabularies import \
--vocabulary names \
--origin /path/to/ORCID_2021_10_summaries.tar.gz
```

### Using ORCiD Public Data Sync

*Introduced in InvenioRDM v13*

#### Installing Required Dependencies

First, you should install the required `s3fs` dependency. This can be achieved by adding the following to the `Pipfile` in your instance:

```toml
[packages]
...
invenio-vocabularies = {extras = ["s3fs"]}
...
```

#### Configuring ORCiD Public Data Sync

InvenioRDM supports loading names using the ORCiD Public Data Sync. To set this up, you need to create a definition file named `names-orcid.yaml` with the following content:

```yaml
names:
readers:
- type: orcid-data-sync
- type: xml
transformers:
- type: orcid
writers:
- type: async
args:
writer:
type: names-service
batch_size: 1000
write_many: true
```

#### Customizing the Sync Interval

Optionally, you can specify the sync interval for the orcid-data-sync reader by adding arguments. If not specified, the default sync interval is one day. The supported arguments for defining the interval are:

• years
• months
• weeks
• days
• hours
• minutes
• seconds
• microseconds

Here is an example of how to set a custom sync interval of 10 days:

```yaml
names:
readers:
- type: orcid-data-sync
args:
since:
days: 10
- type: xml
transformers:
- type: orcid
writers:
- type: async
args:
writer:
type: names-service
batch_size: 1000
write_many: true
```
#### Running the Import Command

To run an import using the names-orcid.yaml file, use the vocabularies import command as shown below:

```shell
invenio vocabularies import \
--vocabulary names \
--filepath ./names-orcid.yaml
```
102 changes: 102 additions & 0 deletions docs/reference/search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Searching in InvenioRDM
_Introduced in InvenioRDM v13_

### InvenioRDM Suggest API

The suggest API endpoint (`/api/{resource}?suggest={search_input}`) provides an interface for real-time search suggestions. It leverages OpenSearch's `multi_match` query to search across multiple fields within a specified index, returning relevant suggestions based on user input.

#### Endpoint Structure

**URL:** `/api/{resource}?suggest={search_input}`
**Method:** GET

Each index in InvenioRDM can have its own configuration to customize how the suggest API behaves. This includes defining which fields are searchable and other settings provided by the `multi_match` query API.

## How to use suggest API?

InvenioRDM's Suggest API is designed to provide search suggestions by using a `multi_match` query. It can be configured for all the indices using the `SuggestQueryParser` class that can be imported from `invenio-records-resources` module. The fields are analyzed using custom analyzers at index time which apply filters like `asciifolding` for accent search and `edge_ngram` for prefix search.

Check the [official documentation](https://opensearch.org/docs/2.0/opensearch/ux/) and the [reference](#reference) section below for more context on the `edge_ngram` filter and custom analyzers.

### When to Use the Suggest API

- **Typo Tolerance & Auto-completion:** Helps correct typos (using `fuzziness` at search time analyzing) and completes partial inputs.
- **Large, Diverse Datasets:** Useful for datasets with a wide variety of terms, like names or titles.
- **Pre-query Optimization:** Reduces unnecessary searches by suggesting relevant terms.

### When Not to Use the Suggest API

- **Small or Specific Datasets:** Less beneficial for well-defined datasets.
- **Performance Constraints:** Because the suggest API creates large amounts of tokens using the `edge_ngram` filter, it is important to observe how it affects the index size.
- A reasonable trade-off might involve an index size increase of up to 20-30% if it significantly improves search speed and relevance.
- A 10-20% improvement in response times might justify a moderate increase in index size.

For more information check the [official documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html).

## Key Considerations for Customizing Index Mappings

### Size

- Field Type Selection: Use lightweight field types (e.g., keyword over text where appropriate) to minimize index size.
- Usage of custom analyzers or filters should be as limited as possible to prevent index bloating.

### Speed

- Search Performance: Keeping size in mind, apply custom analyzers that include `edge_ngram` filter to provide quick suggestions, and optimize for frequently queried fields to enhance search speed.
- Analyzer and filter selection: Configure only when necessary to improve search time.

## Fine tuning the search

Boosting affects the relevance score of documents. A higher boost value means a stronger influence on the search ranking. Determine which fields are most critical for your search relevance (e.g., titles, authors, keywords).

- **Relevance Adjustment:** Boosting of field(s) can be done by using the caret operator **(^)** followed by a number. For example:
* `name^100` will boost the name field by a factor of 100.
* Asterisk **(\*)** can be used to apply boosting to all the subfields. `i18n_titles.*^50`

- **Balance and Tuning:** Use boosting judiciously to avoid skewing results too heavily towards particular fields. Assign boost factors based on the importance of each field. Higher values increase the influence of matches in that field.

Tuning search where multiple fields are searched upon is essential to make sure that relevant results are always returned first. Taking the affiliations index as an example, the key fields are `name`, `acronym` and `title.{subfields}`. Since affiliations are usually searched by name, name is given more weight to boost its relevance.

```
"name^80", "acronym^40", "title.*^20"
```

## Reference

### Analyzers

Analyzers allow for searches to match documents which are not exact matches. For example, matching different cases, without accents, parts of words, mis-spellings, etc. Fundamentally all analyzers must contain one tokenizer. A tokenizer essentially splits an input search into parts. Additionally an analyzer may optionally have one or many character filters and/or token filters.

- A **character filter** is applied first and takes the entire input and adds, removes or changes any characters depending on our needs.
- The [**tokenizer**](https://opensearch.org/docs/latest/analyzers/tokenizers/index/) then splits the input into parts (words)
- Finally the [**token filter**](https://opensearch.org/docs/latest/analyzers/token-filters/index/) acts similarly to a tokenizer, but is applied to each input "word"

Read more about analyzers on [the OpenSearch official docs](https://opensearch.org/docs/latest/analyzers/).

Analyzers can be applied to both the search input and when the document is indexed. In most cases we want to apply the same analyzer to the search input and during indexing so that there is not unexpected behaviour.

- [**Normalizers**](https://opensearch.org/docs/latest/analyzers/normalizers/) — Simpler and mainly used to improve the matching of keyword search. The `keyword` type is the simplest way in which data can be stored and by default works as an exact match search. Using a normalizer you can add, remove and alter the input into exactly one other token which is stored and searched for.

### Character filters

Character filters take the stream of characters before tokenization and can add, remove or replace characters according to the rules and type of filter. There are 3 types - mapping filter, pattern replace filter and HTML stripping filter.

For our indices in InvenioRDM, we are currently using a custom pattern replace filter that uses regex to remove special characters!

```
"char_filter": {
"strip_special_chars": {
"type": "pattern_replace",
"pattern": "[\\p{Punct}\\p{S}]",
"replacement": ""
}
}
```

### Tokenizers and token filters

We are using the following tokenizers and token filters in some of our indices in InvenioRDM:

- **[ngram and edge_ngram](https://opensearch.org/docs/latest/analyzers/tokenizers/index/#partial-word-tokenizers)** — Both of these are used for matching parts of words, n-gram creates n sized chunks ("car" ngram(1,2) -> "c", "ca", "a", "ar") and edge_ngram creates chunks from the beginning of the word ("dog" edge_ngram(1,3) -> "d", "do", "dog"). Edge N-gram enables prefix searching and is preferred as it produces less tokens. Additionally it is recommended that these are used as token filters so that they produce tokens on each word rather than between words.
- **[uax_url_email](https://opensearch.org/docs/latest/analyzers/tokenizers/index/#word-tokenizers)** — If it is likely that searches and/or documents will contain URLs or emails, it is better to use this tokenizer. If a standard tokenizer is used the URL/email will be split on the special characters which results in behaviour which may be unexpected (searching tim@apple.com will return documents with apple in them for example)
- **[asciifolding](https://opensearch.org/docs/latest/analyzers/token-filters/index/)** — Allows characters to match with many different representations, especially relevant for non-English languages. For example ä -> a, Å -> A, etc.
160 changes: 160 additions & 0 deletions docs/releases/upgrading/upgrade-v13.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Upgrading from v12 to v13.0

!!! warning "THIS RECIPE IS A WORK IN PROGRESS"

## Prerequisites

The steps listed in this article require an existing local installation of InvenioRDM v12.

!!! warning "Backup"

Always backup your database and files before you try to perform an upgrade.

!!! info "Older Versions"

In case you have an InvenioRDM installation older than v12, you can gradually upgrade
to v12 and afterwards continue from here.

## Upgrade Steps

Make sure you have the latest `invenio-cli` installed. For InvenioRDM v13, it
should be v1.5.0+

```bash
$ invenio-cli --version
invenio-cli, version 1.5.0
```

!!! info "Virtual environments"

In case you are not inside a virtual environment, make sure that you prefix each `invenio`
command with `pipenv run`.

**Local development**

Changing the Python version in your development environment highly
depends on your setup, so we won't cover it here.
One way would be to use [PyEnv](https://github.com/pyenv/pyenv).

!!! warning "Risk of losing data"

Your virtual environment folder a.k.a., `venv` folder, may contain uploaded files. If you kept the default
location, it is in `<venv folder>/var/instance/data`. If you need to keep those files,
make sure you copy them over to the new `venv` folder in the same location.
The command `invenio files location list` shows the file upload location.

If you upgraded your python version, you should recreate your virtual environment before
running `invenio-cli` or `pipenv` commands below.


### Upgrade InvenioRDM

Python 3.9 or 3.11 or 3.12 is required to run InvenioRDM v12.

There are two options to upgrade your system:

#### Upgrade option 1: In-place

This approach upgrades the dependencies in place. Your virtual environment for the
v11 version will be gone afterwards.

```bash
cd <my-site>

# Upgrade to InvenioRDM v12
invenio-cli packages update 13.0.0

# Re-build assets
invenio-cli assets build
```

#### Upgrade option 2: New virtual environment

This approach will create a new virtual environment and leaves the v11 one as-is.
If you are using a docker image on your production instance this will be the
option you choose.

##### Step 1
- create a new virtual environment
- activate your new virtual environment
- install `invenio-cli` by `pip install invenio-cli`

##### Step 2
Update the file `<my-site>/Pipfile`.

```diff
[packages]
---invenio-app-rdm = {extras = [...], version = "~=12.0.0"}
+++invenio-app-rdm = {extras = [...], version = "~=13.0.0"}
```

##### Step 3
Update the `Pipfile.lock` file:

```bash
invenio-cli packages lock
```

##### Step 4
Install InvenioRDM v13:

```bash
invenio-cli install
```

### Database migration

Execute the database migration:

```bash
invenio alembic upgrade
```

### Data migration


Execute the data migration:

### TODO


### Rebuild search indices

```bash
invenio index destroy --yes-i-know
invenio index init
invenio rdm rebuild-all-indices
```

From v12 onwards, record statistics will be stored in search indices rather than the
database. These indices are created through some *index templates* machinery
rather than having indices registered directly in `Invenio-Search`. As such, the
search indices for statistics are not affected by `invenio index destroy
--yes-i-know` and are totally functional after the rebuild step.

### New roles

### TODO

### New configuration variables

```bash
from invenio_app_rdm import __version__
ADMINISTRATION_DISPLAY_VERSIONS = [
("invenio-app-rdm", f"v{__version__}"),
("{{ cookiecutter.project_shortname }}", "v1.0.0"),
]
```

## Big Changes

- feature: invenio jobs module, periodic tasks administration panel
- feature: invenio vocabularies entries deprecation
- improvement: search mappings and analyzers to improve performance

### TODO

## OPEN PROBLEMS


### TODO
3 changes: 3 additions & 0 deletions docs/releases/version-v13.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# InvenioRDM v13.0

Draft
2 changes: 1 addition & 1 deletion docs/releases/versions/version-v10.0.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ In addition to the many bugs fixed, this release introduces custom fields both f

### Custom Fields

You can now add custom fields to [bibliographic records](https://inveniordm.docs.cern.ch/customize/metadata/custom_fields/records/) and [communities](https://inveniordm.docs.cern.ch/customize/metadata/custom_fields/communities/) data models. InvenioRDM supports a wide variety of field types and UI widgets: you can find the full list in the [custom fields](https://inveniordm.docs.cern.ch/customize/custom_fields/records/#reference) and the [UI widgets](https://inveniordm.docs.cern.ch/reference/widgets/) documentation pages.
You can now add custom fields to [bibliographic records](../../customize/metadata/custom_fields/records.md) and [communities](../../customize/metadata/custom_fields/communities.md) data models. InvenioRDM supports a wide variety of field types and UI widgets: you can find the full list in the [custom fields](../../customize/metadata/custom_fields/records.md#reference) and the [UI widgets](../../reference/custom_fields/widgets.md) documentation pages.

You can also extend the default components or implement your owns. To get more information, refer to the [custom fields development section](../../develop/howtos/custom_fields.md) in the documentation.

Expand Down
Loading