Deploy • Ingest • Architecture • To Do • Links
This framework is based on the DataHub ingestion architecture which "supports an extremely flexible ingestion architecture that can support push, pull, asynchronous and synchronous models". For the time being, this Framework only supports aynchrounous communication.
For a deep dive into metadata ingestion architectures, check out this awesome artice by Shirshanka Das from Linkedin.
Prerequisites:
- Azure subscription with
Contributor
access. - Latest version of Terraform (getting started guide here)
az login
terraform init
terraform apply
Developers are encouraged to write plugins to integrate with various data sources. The plugins/example
directory contains example code that can be used as a starting point for writing a plugin. This example follows the sidecar pattern.
At a high level, a plugin is reponsible for:
- Capturing metadata changes
- Modelling the metadata using the Schema Registry
- Emitting mettadata events
To get started:
- Install Docker
- Install Az CLI
Next, run the following snippet. The example code will send a mock Metadata Event to the Metaverse api.
Execution is succesful when MCE Published
is printed to the console.
az login
export CONNECTION_STRING=$(az eventhubs \
namespace authorization-rule keys list \
--resource-group rg-metaverse-resources \
--namespace-name ehns-metaverse \
--name RootManageSharedAccessKey \
--query primaryConnectionString \
--out tsv)
export EVENTHUB=eh-metaverse
docker run -it \
--env CONNECTION_STRING="$CONNECTION_STRING" \
--env RESOURCE_GORUP="$RESOURCE_GORUP" \
$(docker build -q .)
The below diagram best illistrates how the framework operates. The left side shows how the internal components interact, the right is how a Metadata Management system might consume and produce metadata events.
- Metadata producers retrieve a schema from the Schema Service. This serves as a contract which allows both producing an consuming applications to evolve independently.
- Metadata producers emit metrics which are serialised using this shared schema. These events recieved by the Event Streaming service.
- Matadata events are persisted to the Object Storage service where they are retained indefinately.
The context diagram shows how this system fits into the world around it. See c4model.com
The docs/architecture
directory contains lightweight architecture decision records (ADR). The understand ADRs, see this blog post.
For examples, check out this repo.
- Create infrastructure code for Eventhub and Datalake
- Getting started guide for infrastructure developers
- Create a single infrastructure test to be used as a reference
- Create sample code for metadata producers
- Document Schema Registry creation
- Automate Schema Registry creation
- Create infrastructure test for Schema Registry
- Document infrastructure testing
- Apply data governanace and security controls for Datalake
- Apply security permissions for services
- Apply security permissions for users
- Export resource manager audit logs to Log Analytics
https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions https://martinfowler.com/eaaDev/EventSourcing.html https://martinfowler.com/articles/platform-prerequisites.html https://github.com/alphagov/govuk-aws/tree/master/docs/architecture/decisions