DatAasee Architecture Documentation
Version: 0.2
The Metadata-Lake (MDL) DatAasee gathers research metadata and bibliographic
data from a multitude of sources, associates them with underlying data-sets, and
provides a HTTP API for access, that is utilized by a (prototype) web frontend.
The main goal of the Metadata-Lake is to provision a one-stop shop for research
data discovery at university libraries, research libraries, academic libraries,
and scientific libraries.
Sections:
Introduction & Goals
Constraints
Context & Scope
Solution Strategy
Building Block View
Runtime View
Deployment View
Crosscutting Concepts
Architectural Decisions
Quality Requirements
Risks & Technical Debt
Glossary
Summary:
Data Architecture: Data-Lake with Metadata Catalog
Software Architecture: 3-Tier Architecture
Data-Tier Model: Graph of wide, denormalized one-big-table vertex
Logic-Tier Type: Semantic layer
Presentation-Tier Type: HTTP-API
1.1 Requirements Overview
Given research and bibliographic (meta)data is maintained in various distributed
databases and there is no central access point to browse, search, or locate
data-sets. The metadata-lake:
... allows users to search, filter and browse metadata (and data).
... incorporates metadata of research outputs as well as bibliographic metadata.
... cleans, normalizes, and provides metadata.
... facilitates exports of data/metadata bundles to external repositories.
... integrates with other services and processes.
The database is the core component (included)
The backend encapsulates the database and spans the API (included)
A frontend uses the API (optionally included)
Imports of sources to the database via the backend (through the API)
Exports to services are triggered externally (through the API)
Consumers can interact (through the API)
Quality Goal
Associated Scenarios
Functional Suitability
F0
Transferability
T0
Compatibility
C0
Operability
O0
Maintainability
M0, M1
2.1 Technical Constraints
Constraint
Explanation
Cloud Deployability
To integrate into existing infrastructure and operation environments, a containered service is required.
Interoperability
Data pipelining is required to be compatible to existing systems such as databases.
Extensibility
Components such as metadata schemas, data pipelines, and metadata exports are required to be extensible.
2.2 Organisational Constraints
Constraint
Explanantion
OAI-PMH
Many existing data sources provide a OAI-PMH API which needs to be supported.
S3
File-based ingest has to be also performed via object storage, particularly Ceph's S3 API.
K8
If possible Kubernetes should be supported (in addition to Compose).
Standard
Function
JSON
Serialization language for all external messages
JSON:API
External message format standardization
JSON Schema
External message content validation
YAML
Internal processor (and prototype frontend) declaration language
StrictYAML
Preferred declaration language dialect
OpenAPI
External API definition and documentation format
MD5
Raw metadata checksums
XXH64
Identifier Hashing
Base64URL
Identifier Encoding
Compose
Deployment and orchestration
Standard
Function
divio
Software documentation structure
arc42
Software architecture documentation
yasql
Database schema documentation
Channel
Description
Interact
All unpriviledged functionality
Search
Query metadata records
Control
Monitor, trigger ingests and backups (priviledged)
Forward
Send metadata record(s) to service
Import
Ingest metadata records from source system
Channel
Description
Interact
Unpriviledged HTTP
API
Search
Requested and responded through HTTP
API
Control
Priviledged HTTP
API
Forward
Performed via HTTP
Import
Pulled via HTTP
Three-tier architecture:
HTTP-API is the primary presentation layer (part of the backend)
Web frontend (exclusively using API) is secondary presentation tier
Two main components:
Database (data tier)
Backend (state-less application tier)
All components are packaged in containers for:
infrastructure compatibility
cloud deployability
All messaging happens via HTTP APIs:
internal between components (containers)
external via endpoints (including frontend)
Source codes and external messages are in plain text and in standardized formats:
External messages are in JSON, formatted as JSON-API, and documented by JSON-Schemas.
Declarative sources are in YAML, following StrictYAML.
Separate horizontal scaling of database and backend for high availability:
Database has replication capability
Backend has no state, hence unproblematic
Further components are optional:
Storage not necessary since only metadata is handled, payload data referenced
Web-Frontend uses HTTP API (prototype is included)
Declarative realization for high level of abstraction via:
Internal Queries: ArcadeDB SQL (external queries may use various query languages)
Processes: Configuration-based + Bloblang (data mapping language)
Imports metadata from source systems (DB) via pull
Provides API to interact with metadata (endpoints)
Exports metadata to other services (triggered via endpoints)
Source Databases (External)
Known URLs (ie service or database endpoints) holding metadata
Bulk ingested
Pollable regularly for updates
Prototype Web-Frontend (Optional)
Included prototype frontend
External to core system
Template and documentation for a production frontend
Container holding a ArcadeDB database system
This core component stores and serves all metadata
A system backup saves its database
Container holding a Benthos stream processor
This component spans the external API endpoints and translates between data formats as well as between API and database
Has no state
Prototype Web-Frontend (Optional)
Container holding a Lowdefy web-frontend
This optional component renders a web-based user interface
Uses API endpoints, (but from the internal network, thus the frontend does not need the external port)
The native schema is created via SQL (during build)
Enumerated types are inserted via SQL (during build)
The initialization script loads the schema and preloaded data
The HTTP API endpoints are setup
Custom configurable components (templates) are defined
Reusable fixed components (resources) are defined
Pages are defined via YAML
Static assets (images and styles) are loaded
Reused template blocks are loaded
See ready
endoint docu .
See api
endoint docu .
See schema
endoint docu .
See attributes
endoint docu .
See stats
endoint docu .
See metadata
endoint docu .
See insert
endoint docu .
See ingest
endoint docu .
See backup
endoint docu .
See health
endoint docu .
See compose.yaml
for deployment details.
All components are separately containerized.
All communication between components is performed via HTTP and in JSON.
Read access is granted to every user without limitation.
Write access (trigger ingest or backup, insert record) is only granted to the "admin" user.
Container images are multi-stage with a generic base stage and a custom develop and release stage.
All images run a health check.
All components provide (internal) ready
endpoints and write logs to the standard output.
Secrets are mounted as files.
9. Architectural Decisions
Timestamp
Title
Status
...
Decision
...
Consequences
...
2024-07-04
Indirect Processor Dependency Updates
Status
Approved
Decision
Indirect processor dependency updates do not cause a (minor) version update.
Consequences
A release image build (of the current version) can be triggered and processor dependencies are updated in the process.
2024-06-03
API Licensing
Status
Approved
Decision
The OpenAPI license definition is additionally licensed under CC-BY.
Consequences
Easier third-party reimplementation of the DatAasee API.
2024-02-21
Use OAI vs Non-OAI metadata format variants
Status
Approved
Decision
Non-OAI variants of the DC and DataCite formats are supported.
Consequences
More lenient, and less strict ingest of fields.
2024-01-17
Compose-only Deployment
Status
Approved
Decision
Deployment is solely distributed and initiated by the compose.yaml
.
Consequences
The compose file and orchestrator have central importance.
2023-11-20
Database Storage
Status
Approved
Decision
Database uses in-container storage, only backups are stored outside.
Consequences
Faster database at the price of fixed savepoints.
2023-08-24
Record Identifier
Status
Approved
Decision
Use xxhash64 / SHA256 of ingested or inserted raw record.
Consequences
Identifier is reproducible but not a URL.
2023-08-08
Ingest Modularity
Status
Approved
Decision
Ingest sources are passed via API to the backend.
Consequences
Sources can be maintained outside and appended during runtime.
2023-05-16
Graph Edges
Status
Approved
Decision
Graph edges are only set by ingest (or other automatic) processes, not by a user.
Consequences
Edge semantics need to be machine-interpretable.
2022-12-07
Frontend Language
Status
Approved
Decision
Use English language only for frontend and metadata labels and comments.
Consequences
Additional translations (German) are not prepared for now.
2022-10-10
Only Virtual Storage
Status
Approved
Decision
No explicit storage component for data, only metadata is managed.
Consequences
No interface or instance ie to Ceph is developed, but URL references (to data storage) are stored.
2022-10-05
API-only Frontend
Status
Approved
Decision
The HTTP API is the sole frontend, further frontends are only expressions of the API.
Consequences
Web frontend can only use API frontend
2022-10-04
Declarative First
Status
Approved
Decision
Prefer declarative (YAML-based) approaches for defining processes and interfaces to reduce free coding and increase robustness.
Consequences
Frontrunners Benthos as backend, and Lowdefy (or uteam) as prototype web-frontend.
2022-09-16
Multi-model Database
Status
Approved
Decision
Use (property)-graph / document / key-value database as central catalog component for maximal flexible data model.
Consequences
Frontrunner ArcadeDB (or OrientDB) as database.
10.1 Quality Requirements
Quality Category
Quality
ID
Description
Functional Suitability
Appropriateness
F0
DatAasee should fulfill the expected overall functionality.
Transferability
Installability
T0
Installation should work in various container-based environments.
Compatibility
Interoperability
C0
The available protocols (and format parsers) should fit the most common systems.
Operability
Ease of Use
O0
The API should be self-describing, well documented, and following standards and best practices.
Maintainability
Modularity
M0
New protocols, format parsers or other pipelines should be implementable without too much effort.
Maintainability
Reusability
M1
The protocol and format parser codes serve as sample and documentation.
ID
Scenario
F0
Stakeholder project evaluation
T0
Setup of DatAasee by a new operator
C0
Ingesting from a new source system
O0
User and (downstream) developer API Usage
M0
Extending the compatibility to new systems
M1
Development of a follow-up project to DatAasee
11. Risks & Technical Debt
Risk
Description
Mitigation
DBMS project might cease
ArcadeDB
is a small project which has small-project risks
However, ArcadeDB
is derived from OrientDB
, which could be a replacement (but not drop-in).
Processor project might complicate
Benthos
was acquired by "Red Panda" who may change its license or of the connectors
Using hard fork bento
or self-maintain.
Term
Acronym
Definition
Metadata
MD
All statements about a (tangible or digital) information object.
Metadata-Set
A record containing metadata.
Intra Metadata
Metadata about the underlying data.
Inter Metadata
Metadata about data related to the underlying data.
Descriptive Metadata
Metadata describing the underlying data.
Process Metadata
Metadata about lineage.
Technical Metadata
Metadata about format and structure.
Administrative Metadata
Metadata about accessibility.
Social Metadata
Metadata about usage and discoverability.
Database
DB
Collection of related records.
Database Management System
DBMS
The software running the databases.
Backend
BE
Software component encoding the internal logic.
Frontend
FE
(Web-based) software component presenting a user interface.
Container
CTR
Software packaged into standardized unit for operating-system-level virtualization.
Data Catalog
DCAT
Inventory of databases.
Metadata Catalog
MDCAT
Inventory of databases of metadata.
Data Lake
DL
Structured, semi-structures, and unstructured data architecture.
Metadata Lake
MDL
Structured, semi-structures, and unstructured data architecture for metadata management.
Extract-Transform-Load
ETL
A typical ingestion process for structured data.
Extract-Load-Transform
ELT
A typical ingestion process for unstructured data.
Extract-transform-Load-Transform
EtLT
An ingestion process for semi-structured data.
Declarative Programming
Programming style of expressing logic without prescribing control flow ("what", not "how").
Low-Code
Functionality assembly using high-level prefabricated components.
Declarative Low-Code
Defining an application only by configuration of components (and minimal explicit transformations).
Application Programming Interface
API
Specification and implementation of a way for software to interact (here HTTP API).
Domain Specific Language
DSL
A formal language designed for a particular application.
Command-Query-Responsibility-Segregation
CQRS
API pattern separating read and write requests.