- Load different language models
- Load text data from various sources
- For instance:
- (NL) Delpher, ANP, StatenGeneraal
- (EN) COHA
- (Optional) implement text indexing with ElasticSearch for quick lookup of words
- For instance:
- Extract contextualized embeddings for specific words, tokens, or phrases
- (Interactive) embeddings visualization
- Measure semantic shifts across configurable metadata dimensions, including
- time period
- region
- source (e.g. different newspapers)
- Extensive configuration, for instance:
- which layer(s) to use
- how to measure distances
- Load language model
- use the HuggingFace
transformers
library to download models- allow for authentication to access private models
- use the HuggingFace
- Load text data
- object oriented design with abstract classes to be implemented for custom formats
- provide methods for (memory-efficient) loading, text search, lazy evaluation
- implement corpus readers for plain text, CSV
- implement corpus slicing on metadata dimensions (e.g. time period, region, source)
- Corpus statistics
- based on corpus readers
- statistical analyses:
- document properties (e.g. lengths)
- document distribution per metadata dimension
- Embeddings visualization
- find texts (e.g. sentences) containing a specific word
- to be implemented by corpus reader
- plain string search or index-based search (TBD, depends on corpus size)
- Compute contextualized word embeddings with
transformers
library
- find texts (e.g. sentences) containing a specific word
- Semantic shift metrics
- Find clusters across corpus: k-means with user-defined
k
- Find clusters (~meanings) across corpus (e.g. Mean Shift, Hierarchical Clustering)
- Apply clustering per corpus slice
- Measure semantic shift
- Find clusters across corpus: k-means with user-defined
Notebooks with example implementations: