DatAasee Architecture Documentation

Version: 0.2

The Metadata-Lake (MDL) DatAasee gathers research metadata and bibliographic data from a multitude of sources, associates them with underlying data-sets, and provides a HTTP API for access, that is utilized by a (prototype) web frontend.

The main goal of the Metadata-Lake is to provision a one-stop shop for research data discovery at university libraries, research libraries, academic libraries, and scientific libraries.

Sections:

Introduction & Goals
Constraints
Context & Scope
Solution Strategy
Building Block View
Runtime View
Deployment View
Crosscutting Concepts
Architectural Decisions
Quality Requirements
Risks & Technical Debt
Glossary

Summary:

Data Architecture: Data-Lake with Metadata Catalog
Software Architecture: 3-Tier Architecture
- Data-Tier Model: Graph of wide, denormalized one-big-table vertex
- Logic-Tier Type: Semantic layer
- Presentation-Tier Type: HTTP-API

1. Introduction & Goals

1.1 Requirements Overview

Given research and bibliographic (meta)data is maintained in various distributed databases and there is no central access point to browse, search, or locate data-sets. The metadata-lake:

... allows users to search, filter and browse metadata (and data).
... incorporates metadata of research outputs as well as bibliographic metadata.
... cleans, normalizes, and provides metadata.
... facilitates exports of data/metadata bundles to external repositories.
... integrates with other services and processes.

The database is the core component (included)
The backend encapsulates the database and spans the API (included)
A frontend uses the API (optionally included)
Imports of sources to the database via the backend (through the API)
Exports to services are triggered externally (through the API)
Consumers can interact (through the API)

1.2 Quality Goals

Quality Goal	Associated Scenarios
Functional Suitability	F0
Transferability	T0
Compatibility	C0
Operability	O0
Maintainability	M0, M1

2. Constraints

2.1 Technical Constraints

Constraint	Explanation
Cloud Deployability	To integrate into existing infrastructure and operation environments, a containered service is required.
Interoperability	Data pipelining is required to be compatible to existing systems such as databases.
Extensibility	Components such as metadata schemas, data pipelines, and metadata exports are required to be extensible.

2.2 Organisational Constraints

Constraint	Explanantion
OAI-PMH	Many existing data sources provide a OAI-PMH API which needs to be supported.
S3	File-based ingest has to be also performed via object storage, particularly Ceph's S3 API.
K8	If possible Kubernetes should be supported (in addition to Compose).

2.3 Conventions

Technical

Standard	Function
JSON	Serialization language for all external messages
JSON:API	External message format standardization
JSON Schema	External message content validation
YAML	Internal processor (and prototype frontend) declaration language
StrictYAML	Preferred declaration language dialect
OpenAPI	External API definition and documentation format
MD5	Raw metadata checksums
XXH64	Identifier Hashing
Base64URL	Identifier Encoding
Compose	Deployment and orchestration

Content

Standard	Function
DataCite	Core metadata vocabulary
FRBR	Entity relationships
Fields of Science	Scientific classification
SPDX License List	Software license names
ISO 8601	Data and time formatting
ISO 639-1	Language name abbreviations
DOI	Preferred resource identifier
ORCID	Preferred creator identifier

Documentation

Standard	Function
divio	Software documentation structure
arc42	Software architecture documentation
yasql	Database schema documentation

3. Context & Scope

3.1 Business Context

Channel	Description
Interact	All unpriviledged functionality
Search	Query metadata records
Control	Monitor, trigger ingests and backups (priviledged)
Forward	Send metadata record(s) to service
Import	Ingest metadata records from source system

3.2 Technical Context

Channel	Description
Interact	Unpriviledged `HTTP` API
Search	Requested and responded through `HTTP` API
Control	Priviledged `HTTP` API
Forward	Performed via `HTTP`
Import	Pulled via `HTTP`

4. Solution Strategy

Three-tier architecture:
- HTTP-API is the primary presentation layer (part of the backend)
- Web frontend (exclusively using API) is secondary presentation tier
Two main components:
- Database (data tier)
- Backend (state-less application tier)
All components are packaged in containers for:
- infrastructure compatibility
- cloud deployability
All messaging happens via HTTP APIs:
- internal between components (containers)
- external via endpoints (including frontend)
Source codes and external messages are in plain text and in standardized formats:
- External messages are in JSON, formatted as JSON-API, and documented by JSON-Schemas.
- Declarative sources are in YAML, following StrictYAML.
Separate horizontal scaling of database and backend for high availability:
- Database has replication capability
- Backend has no state, hence unproblematic
Further components are optional:
- Storage not necessary since only metadata is handled, payload data referenced
- Web-Frontend uses HTTP API (prototype is included)
Declarative realization for high level of abstraction via:
- Internal Queries: ArcadeDB SQL (external queries may use various query languages)
- Processes: Configuration-based + Bloblang (data mapping language)

5. Building Block View

Level 0 (Outside View)

DatAasee

Imports metadata from source systems (DB) via pull
Provides API to interact with metadata (endpoints)
Exports metadata to other services (triggered via endpoints)

Source Databases (External)

Known URLs (ie service or database endpoints) holding metadata
Bulk ingested
Pollable regularly for updates

Prototype Web-Frontend (Optional)

Included prototype frontend
External to core system
Template and documentation for a production frontend

Level 1 (Inside View)

Database

Container holding a ArcadeDB database system
This core component stores and serves all metadata
A system backup saves its database

Backend

Container holding a Benthos stream processor
This component spans the external API endpoints and translates between data formats as well as between API and database
Has no state

Prototype Web-Frontend (Optional)

Container holding a Lowdefy web-frontend
This optional component renders a web-based user interface
Uses API endpoints, (but from the internal network, thus the frontend does not need the external port)

Level 2 (Container View)

Database

The native schema is created via SQL (during build)
Enumerated types are inserted via SQL (during build)
The initialization script loads the schema and preloaded data

Backend

The HTTP API endpoints are setup
Custom configurable components (templates) are defined
Reusable fixed components (resources) are defined

Prototype Web-Frontend

Pages are defined via YAML
Static assets (images and styles) are loaded
Reused template blocks are loaded

6. Runtime View

Processes

7. Deployment View

Level 0

See compose.yaml for deployment details.

8. Crosscutting Concepts

Internal Concepts

All components are separately containerized.
All communication between components is performed via HTTP and in JSON.

Security Concepts

Read access is granted to every user without limitation.
Write access (trigger ingest or backup, insert record) is only granted to the "admin" user.

Development Concepts

Container images are multi-stage with a generic base stage and a custom develop and release stage.
All images run a health check.

Operational Concepts

All components provide (internal) ready endpoints and write logs to the standard output.
Secrets are mounted as files.

9. Architectural Decisions

Timestamp	Title
Status	...
Decision	...
Consequences	...

2024-07-04	Indirect Processor Dependency Updates
Status	Approved
Decision	Indirect processor dependency updates do not cause a (minor) version update.
Consequences	A release image build (of the current version) can be triggered and processor dependencies are updated in the process.

2024-06-03	API Licensing
Status	Approved
Decision	The OpenAPI license definition is additionally licensed under CC-BY.
Consequences	Easier third-party reimplementation of the DatAasee API.

2024-02-21	Use OAI vs Non-OAI metadata format variants
Status	Approved
Decision	Non-OAI variants of the DC and DataCite formats are supported.
Consequences	More lenient, and less strict ingest of fields.

2024-01-17	Compose-only Deployment
Status	Approved
Decision	Deployment is solely distributed and initiated by the `compose.yaml`.
Consequences	The compose file and orchestrator have central importance.

2023-11-20	Database Storage
Status	Approved
Decision	Database uses in-container storage, only backups are stored outside.
Consequences	Faster database at the price of fixed savepoints.

2023-08-24	Record Identifier
Status	Approved
Decision	Use xxhash64 / SHA256 of ingested or inserted raw record.
Consequences	Identifier is reproducible but not a URL.

2023-08-08	Ingest Modularity
Status	Approved
Decision	Ingest sources are passed via API to the backend.
Consequences	Sources can be maintained outside and appended during runtime.

2023-05-16	Graph Edges
Status	Approved
Decision	Graph edges are only set by ingest (or other automatic) processes, not by a user.
Consequences	Edge semantics need to be machine-interpretable.

2022-12-07	Frontend Language
Status	Approved
Decision	Use English language only for frontend and metadata labels and comments.
Consequences	Additional translations (German) are not prepared for now.

2022-10-10	Only Virtual Storage
Status	Approved
Decision	No explicit storage component for data, only metadata is managed.
Consequences	No interface or instance ie to Ceph is developed, but URL references (to data storage) are stored.

2022-10-05	API-only Frontend
Status	Approved
Decision	The HTTP API is the sole frontend, further frontends are only expressions of the API.
Consequences	Web frontend can only use API frontend

2022-10-04	Declarative First
Status	Approved
Decision	Prefer declarative (YAML-based) approaches for defining processes and interfaces to reduce free coding and increase robustness.
Consequences	Frontrunners Benthos as backend, and Lowdefy (or uteam) as prototype web-frontend.

2022-09-16	Multi-model Database
Status	Approved
Decision	Use (property)-graph / document / key-value database as central catalog component for maximal flexible data model.
Consequences	Frontrunner ArcadeDB (or OrientDB) as database.

10. Quality Requirements

10.1 Quality Requirements

Quality Category	Quality	ID	Description
Functional Suitability	Appropriateness	F0	DatAasee should fulfill the expected overall functionality.
Transferability	Installability	T0	Installation should work in various container-based environments.
Compatibility	Interoperability	C0	The available protocols (and format parsers) should fit the most common systems.
Operability	Ease of Use	O0	The API should be self-describing, well documented, and following standards and best practices.
Maintainability	Modularity	M0	New protocols, format parsers or other pipelines should be implementable without too much effort.
Maintainability	Reusability	M1	The protocol and format parser codes serve as sample and documentation.

10.2 Quality Scenarios

ID	Scenario
F0	Stakeholder project evaluation
T0	Setup of DatAasee by a new operator
C0	Ingesting from a new source system
O0	User and (downstream) developer API Usage
M0	Extending the compatibility to new systems
M1	Development of a follow-up project to DatAasee

11. Risks & Technical Debt

Risk	Description	Mitigation
DBMS project might cease	`ArcadeDB` is a small project which has small-project risks	However, `ArcadeDB` is derived from `OrientDB`, which could be a replacement (but not drop-in).
Processor project might complicate	`Benthos` was acquired by "Red Panda" who may change its license or of the connectors	Using hard fork `bento` or self-maintain.

12. Glossary

Term	Acronym	Definition
Metadata	MD	All statements about a (tangible or digital) information object.
Metadata-Set		A record containing metadata.
Intra Metadata		Metadata about the underlying data.
Inter Metadata		Metadata about data related to the underlying data.
Descriptive Metadata		Metadata describing the underlying data.
Process Metadata		Metadata about lineage.
Technical Metadata		Metadata about format and structure.
Administrative Metadata		Metadata about accessibility.
Social Metadata		Metadata about usage and discoverability.
Database	DB	Collection of related records.
Database Management System	DBMS	The software running the databases.
Backend	BE	Software component encoding the internal logic.
Frontend	FE	(Web-based) software component presenting a user interface.
Container	CTR	Software packaged into standardized unit for operating-system-level virtualization.
Data Catalog	DCAT	Inventory of databases.
Metadata Catalog	MDCAT	Inventory of databases of metadata.
Data Lake	DL	Structured, semi-structures, and unstructured data architecture.
Metadata Lake	MDL	Structured, semi-structures, and unstructured data architecture for metadata management.
Extract-Transform-Load	ETL	A typical ingestion process for structured data.
Extract-Load-Transform	ELT	A typical ingestion process for unstructured data.
Extract-transform-Load-Transform	EtLT	An ingestion process for semi-structured data.
Declarative Programming		Programming style of expressing logic without prescribing control flow ("what", not "how").
Low-Code		Functionality assembly using high-level prefabricated components.
Declarative Low-Code		Defining an application only by configuration of components (and minimal explicit transformations).
Application Programming Interface	API	Specification and implementation of a way for software to interact (here HTTP API).
Domain Specific Language	DSL	A formal language designed for a particular application.
Command-Query-Responsibility-Segregation	CQRS	API pattern separating read and write requests.

Files

arc42.md

Latest commit

History

arc42.md

File metadata and controls

DatAasee Architecture Documentation

1. Introduction & Goals

1.1 Requirements Overview

1.2 Quality Goals

2. Constraints

2.1 Technical Constraints

2.2 Organisational Constraints

2.3 Conventions

Technical

Content

Documentation

3. Context & Scope

3.1 Business Context

3.2 Technical Context

4. Solution Strategy

5. Building Block View

Level 0 (Outside View)

DatAasee

Source Databases (External)

Prototype Web-Frontend (Optional)

Level 1 (Inside View)

Database

Backend

Prototype Web-Frontend (Optional)

Level 2 (Container View)

Database

Backend

Prototype Web-Frontend

6. Runtime View

Processes

/ready Endpoint

/api Endpoint

/schema Endpoint

/attributes Endpoint

/stats Endpoint

/metadata Endpoint

/insert Endpoint

/ingest Endpoint

/backup Endpoint

/health Endpoint

7. Deployment View

Level 0

8. Crosscutting Concepts

Internal Concepts

Security Concepts

Development Concepts

Operational Concepts

9. Architectural Decisions

10. Quality Requirements

10.1 Quality Requirements

10.2 Quality Scenarios

11. Risks & Technical Debt

12. Glossary

`/ready` Endpoint

`/api` Endpoint

`/schema` Endpoint

`/attributes` Endpoint

`/stats` Endpoint

`/metadata` Endpoint

`/insert` Endpoint

`/ingest` Endpoint

`/backup` Endpoint

`/health` Endpoint