Skip to content

Latest commit

 

History

History
57 lines (48 loc) · 2.56 KB

design.md

File metadata and controls

57 lines (48 loc) · 2.56 KB

Embeddings Analysis Toolset

Workflow

The Workflow

User Requirements

  • Load different language models
  • Load text data from various sources
    • For instance:
      • (NL) Delpher, ANP, StatenGeneraal
      • (EN) COHA
    • (Optional) implement text indexing with ElasticSearch for quick lookup of words
  • Extract contextualized embeddings for specific words, tokens, or phrases
  • (Interactive) embeddings visualization
  • Measure semantic shifts across configurable metadata dimensions, including
    • time period
    • region
    • source (e.g. different newspapers)
  • Extensive configuration, for instance:
    • which layer(s) to use
    • how to measure distances

Software Design

Architecture

  • Load language model
    • use the HuggingFace transformers library to download models
      • allow for authentication to access private models
  • Load text data
    • object oriented design with abstract classes to be implemented for custom formats
    • provide methods for (memory-efficient) loading, text search, lazy evaluation
    • implement corpus readers for plain text, CSV
    • implement corpus slicing on metadata dimensions (e.g. time period, region, source)
  • Corpus statistics
    • based on corpus readers
    • statistical analyses:
      • document properties (e.g. lengths)
      • document distribution per metadata dimension
  • Embeddings visualization
    • find texts (e.g. sentences) containing a specific word
      • to be implemented by corpus reader
      • plain string search or index-based search (TBD, depends on corpus size)
      • Compute contextualized word embeddings with transformers library
  • Semantic shift metrics

Notebooks with example implementations: