Skip to content

Latest commit

 

History

History
530 lines (379 loc) · 20.2 KB

arc42.md

File metadata and controls

530 lines (379 loc) · 20.2 KB

DatAasee Architecture Documentation

Version: 0.2

The Metadata-Lake (MDL) DatAasee gathers research metadata and bibliographic data from a multitude of sources, associates them with underlying data-sets, and provides a HTTP API for access, that is utilized by a (prototype) web frontend.

The main goal of the Metadata-Lake is to provision a one-stop shop for research data discovery at university libraries, research libraries, academic libraries, and scientific libraries.

Sections:

  1. Introduction & Goals
  2. Constraints
  3. Context & Scope
  4. Solution Strategy
  5. Building Block View
  6. Runtime View
  7. Deployment View
  8. Crosscutting Concepts
  9. Architectural Decisions
  10. Quality Requirements
  11. Risks & Technical Debt
  12. Glossary

Summary:

  • Data Architecture: Data-Lake with Metadata Catalog
  • Software Architecture: 3-Tier Architecture
    • Data-Tier Model: Graph of wide, denormalized one-big-table vertex
    • Logic-Tier Type: Semantic layer
    • Presentation-Tier Type: HTTP-API

1. Introduction & Goals

1.1 Requirements Overview

Given research and bibliographic (meta)data is maintained in various distributed databases and there is no central access point to browse, search, or locate data-sets. The metadata-lake:

  • ... allows users to search, filter and browse metadata (and data).
  • ... incorporates metadata of research outputs as well as bibliographic metadata.
  • ... cleans, normalizes, and provides metadata.
  • ... facilitates exports of data/metadata bundles to external repositories.
  • ... integrates with other services and processes.

Overview

  • The database is the core component (included)
  • The backend encapsulates the database and spans the API (included)
  • A frontend uses the API (optionally included)
  • Imports of sources to the database via the backend (through the API)
  • Exports to services are triggered externally (through the API)
  • Consumers can interact (through the API)

1.2 Quality Goals

Quality Goal Associated Scenarios
Functional Suitability F0
Transferability T0
Compatibility C0
Operability O0
Maintainability M0, M1

2. Constraints

2.1 Technical Constraints

Constraint Explanation
Cloud Deployability To integrate into existing infrastructure and operation environments, a containered service is required.
Interoperability Data pipelining is required to be compatible to existing systems such as databases.
Extensibility Components such as metadata schemas, data pipelines, and metadata exports are required to be extensible.

2.2 Organisational Constraints

Constraint Explanantion
OAI-PMH Many existing data sources provide a OAI-PMH API which needs to be supported.
S3 File-based ingest has to be also performed via object storage, particularly Ceph's S3 API.
K8 If possible Kubernetes should be supported (in addition to Compose).

2.3 Conventions

Technical

Standard Function
JSON Serialization language for all external messages
JSON:API External message format standardization
JSON Schema External message content validation
YAML Internal processor (and prototype frontend) declaration language
StrictYAML Preferred declaration language dialect
OpenAPI External API definition and documentation format
MD5 Raw metadata checksums
XXH64 Identifier Hashing
Base64URL Identifier Encoding
Compose Deployment and orchestration

Content

Standard Function
DataCite Core metadata vocabulary
FRBR Entity relationships
Fields of Science Scientific classification
SPDX License List Software license names
ISO 8601 Data and time formatting
ISO 639-1 Language name abbreviations
DOI Preferred resource identifier
ORCID Preferred creator identifier

Documentation

Standard Function
divio Software documentation structure
arc42 Software architecture documentation
yasql Database schema documentation

3. Context & Scope

Context

3.1 Business Context

Channel Description
Interact All unpriviledged functionality
Search Query metadata records
Control Monitor, trigger ingests and backups (priviledged)
Forward Send metadata record(s) to service
Import Ingest metadata records from source system

3.2 Technical Context

Channel Description
Interact Unpriviledged HTTP API
Search Requested and responded through HTTP API
Control Priviledged HTTP API
Forward Performed via HTTP
Import Pulled via HTTP

4. Solution Strategy

  • Three-tier architecture:
    • HTTP-API is the primary presentation layer (part of the backend)
    • Web frontend (exclusively using API) is secondary presentation tier
  • Two main components:
    • Database (data tier)
    • Backend (state-less application tier)
  • All components are packaged in containers for:
    • infrastructure compatibility
    • cloud deployability
  • All messaging happens via HTTP APIs:
    • internal between components (containers)
    • external via endpoints (including frontend)
  • Source codes and external messages are in plain text and in standardized formats:
    • External messages are in JSON, formatted as JSON-API, and documented by JSON-Schemas.
    • Declarative sources are in YAML, following StrictYAML.
  • Separate horizontal scaling of database and backend for high availability:
    • Database has replication capability
    • Backend has no state, hence unproblematic
  • Further components are optional:
    • Storage not necessary since only metadata is handled, payload data referenced
    • Web-Frontend uses HTTP API (prototype is included)
  • Declarative realization for high level of abstraction via:
    • Internal Queries: ArcadeDB SQL (external queries may use various query languages)
    • Processes: Configuration-based + Bloblang (data mapping language)

5. Building Block View

Level 0 (Outside View)

Outside View

DatAasee

  • Imports metadata from source systems (DB) via pull
  • Provides API to interact with metadata (endpoints)
  • Exports metadata to other services (triggered via endpoints)

Source Databases (External)

  • Known URLs (ie service or database endpoints) holding metadata
  • Bulk ingested
  • Pollable regularly for updates

Prototype Web-Frontend (Optional)

  • Included prototype frontend
  • External to core system
  • Template and documentation for a production frontend

Level 1 (Inside View)

Inside View

Database

  • Container holding a ArcadeDB database system
  • This core component stores and serves all metadata
  • A system backup saves its database

Backend

  • Container holding a Benthos stream processor
  • This component spans the external API endpoints and translates between data formats as well as between API and database
  • Has no state

Prototype Web-Frontend (Optional)

  • Container holding a Lowdefy web-frontend
  • This optional component renders a web-based user interface
  • Uses API endpoints, (but from the internal network, thus the frontend does not need the external port)

Level 2 (Container View)

Database

Database

  • The native schema is created via SQL (during build)
  • Enumerated types are inserted via SQL (during build)
  • The initialization script loads the schema and preloaded data

Backend

Backend

  • The HTTP API endpoints are setup
  • Custom configurable components (templates) are defined
  • Reusable fixed components (resources) are defined

Prototype Web-Frontend

Frontend

  • Pages are defined via YAML
  • Static assets (images and styles) are loaded
  • Reused template blocks are loaded

6. Runtime View

Processes

/ready Endpoint

Ready Endpoint

See ready endoint docu.


/api Endpoint

API Endpoint

See api endoint docu.


/schema Endpoint

Schema Endpoint

See schema endoint docu.


/attributes Endpoint

Attributes Endpoint

See attributes endoint docu.


/stats Endpoint

Stats Endpoint

See stats endoint docu.


/metadata Endpoint

Metadata Endpoint

See metadata endoint docu.


/insert Endpoint

Insert Endpoint

See insert endoint docu.


/ingest Endpoint

Ingest Endpoint

See ingest endoint docu.


/backup Endpoint

Backup Endpoint

See backup endoint docu.


/health Endpoint

Health Endpoint

See health endoint docu.


7. Deployment View

Overview

Level 0

See compose.yaml for deployment details.


8. Crosscutting Concepts

Internal Concepts

  • All components are separately containerized.
  • All communication between components is performed via HTTP and in JSON.

Security Concepts

  • Read access is granted to every user without limitation.
  • Write access (trigger ingest or backup, insert record) is only granted to the "admin" user.

Development Concepts

  • Container images are multi-stage with a generic base stage and a custom develop and release stage.
  • All images run a health check.

Operational Concepts

  • All components provide (internal) ready endpoints and write logs to the standard output.
  • Secrets are mounted as files.

9. Architectural Decisions

Timestamp Title
Status ...
Decision ...
Consequences ...
2024-07-04 Indirect Processor Dependency Updates
Status Approved
Decision Indirect processor dependency updates do not cause a (minor) version update.
Consequences A release image build (of the current version) can be triggered and processor dependencies are updated in the process.
2024-06-03 API Licensing
Status Approved
Decision The OpenAPI license definition is additionally licensed under CC-BY.
Consequences Easier third-party reimplementation of the DatAasee API.
2024-02-21 Use OAI vs Non-OAI metadata format variants
Status Approved
Decision Non-OAI variants of the DC and DataCite formats are supported.
Consequences More lenient, and less strict ingest of fields.
2024-01-17 Compose-only Deployment
Status Approved
Decision Deployment is solely distributed and initiated by the compose.yaml.
Consequences The compose file and orchestrator have central importance.
2023-11-20 Database Storage
Status Approved
Decision Database uses in-container storage, only backups are stored outside.
Consequences Faster database at the price of fixed savepoints.
2023-08-24 Record Identifier
Status Approved
Decision Use xxhash64 / SHA256 of ingested or inserted raw record.
Consequences Identifier is reproducible but not a URL.
2023-08-08 Ingest Modularity
Status Approved
Decision Ingest sources are passed via API to the backend.
Consequences Sources can be maintained outside and appended during runtime.
2023-05-16 Graph Edges
Status Approved
Decision Graph edges are only set by ingest (or other automatic) processes, not by a user.
Consequences Edge semantics need to be machine-interpretable.
2022-12-07 Frontend Language
Status Approved
Decision Use English language only for frontend and metadata labels and comments.
Consequences Additional translations (German) are not prepared for now.
2022-10-10 Only Virtual Storage
Status Approved
Decision No explicit storage component for data, only metadata is managed.
Consequences No interface or instance ie to Ceph is developed, but URL references (to data storage) are stored.
2022-10-05 API-only Frontend
Status Approved
Decision The HTTP API is the sole frontend, further frontends are only expressions of the API.
Consequences Web frontend can only use API frontend
2022-10-04 Declarative First
Status Approved
Decision Prefer declarative (YAML-based) approaches for defining processes and interfaces to reduce free coding and increase robustness.
Consequences Frontrunners Benthos as backend, and Lowdefy (or uteam) as prototype web-frontend.
2022-09-16 Multi-model Database
Status Approved
Decision Use (property)-graph / document / key-value database as central catalog component for maximal flexible data model.
Consequences Frontrunner ArcadeDB (or OrientDB) as database.

10. Quality Requirements

10.1 Quality Requirements

Quality Category Quality ID Description
Functional Suitability Appropriateness F0 DatAasee should fulfill the expected overall functionality.
Transferability Installability T0 Installation should work in various container-based environments.
Compatibility Interoperability C0 The available protocols (and format parsers) should fit the most common systems.
Operability Ease of Use O0 The API should be self-describing, well documented, and following standards and best practices.
Maintainability Modularity M0 New protocols, format parsers or other pipelines should be implementable without too much effort.
Maintainability Reusability M1 The protocol and format parser codes serve as sample and documentation.

10.2 Quality Scenarios

ID Scenario
F0 Stakeholder project evaluation
T0 Setup of DatAasee by a new operator
C0 Ingesting from a new source system
O0 User and (downstream) developer API Usage
M0 Extending the compatibility to new systems
M1 Development of a follow-up project to DatAasee

11. Risks & Technical Debt

Risk Description Mitigation
DBMS project might cease ArcadeDB is a small project which has small-project risks However, ArcadeDB is derived from OrientDB, which could be a replacement (but not drop-in).
Processor project might complicate Benthos was acquired by "Red Panda" who may change its license or of the connectors Using hard fork bento or self-maintain.

12. Glossary

Term Acronym Definition
Metadata MD All statements about a (tangible or digital) information object.
Metadata-Set A record containing metadata.
Intra Metadata Metadata about the underlying data.
Inter Metadata Metadata about data related to the underlying data.
Descriptive Metadata Metadata describing the underlying data.
Process Metadata Metadata about lineage.
Technical Metadata Metadata about format and structure.
Administrative Metadata Metadata about accessibility.
Social Metadata Metadata about usage and discoverability.
Database DB Collection of related records.
Database Management System DBMS The software running the databases.
Backend BE Software component encoding the internal logic.
Frontend FE (Web-based) software component presenting a user interface.
Container CTR Software packaged into standardized unit for operating-system-level virtualization.
Data Catalog DCAT Inventory of databases.
Metadata Catalog MDCAT Inventory of databases of metadata.
Data Lake DL Structured, semi-structures, and unstructured data architecture.
Metadata Lake MDL Structured, semi-structures, and unstructured data architecture for metadata management.
Extract-Transform-Load ETL A typical ingestion process for structured data.
Extract-Load-Transform ELT A typical ingestion process for unstructured data.
Extract-transform-Load-Transform EtLT An ingestion process for semi-structured data.
Declarative Programming Programming style of expressing logic without prescribing control flow ("what", not "how").
Low-Code Functionality assembly using high-level prefabricated components.
Declarative Low-Code Defining an application only by configuration of components (and minimal explicit transformations).
Application Programming Interface API Specification and implementation of a way for software to interact (here HTTP API).
Domain Specific Language DSL A formal language designed for a particular application.
Command-Query-Responsibility-Segregation CQRS API pattern separating read and write requests.