This repository contains an example how to integrate the Data Mesh Manager into a Google Cloud Platform (GCP) account and automate the permission granting in BigQuery based on agreed data useage agreements.
It reads the Data Mesh Manager events API and uses serverless GCP functionality like Cloud Functions, Firestore, Google Storace, Secretmanager, PubSub and Cloud Scheduler.
The infrastructure is set up by using Terraform.
- We do not handle deleted data usage agreements. So make sure to deactivate data usage agreements before deleting them. Otherwise, permissions will be kept existent.
- Not all kinds of output ports are supported at this point. Currently, we support only BigQuery tables with views.
For a better understanding of how the integration works, see this simple architecture diagram. Arrows show access direction.
┌─────────────────┐
│ │
│Data Mesh Manager│
│ │
└─────────────────┘
▲ ▲
│ │
│ │
│ │
┌─────────────────────────────────────────┼───────────┼─────────────────────────────────────────────┐
│ │ │ │
│ │ │ 4. read usage agreement information │
│ 1. pull events │ │ │
│ ┌──────────────────────────┘ └───────────────────────────────┐ │
│ │ │ │
│ │ │ │
│ │ │ │
│ ┌────────┴────────┐ ──────────────── ── ┌─────────┴─────────┐ │
│ │ poll_feed │ 2. write │ dmm_events │ │ 3. trigger │ manage_permissions│ │
│ │ ├──────────────►│ │ ├──────────────►│ │ │
│ │[Cloud Function]│ │ [PubSub Topic] │ │ │ [Cloud Function] │ │
│ └─────────────────┘ ──────────────── ── └─────────┬─────────┘ │
│ │ │
│ │5. manage │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ BigQuery │ │
│ │ Authorized │ │
│ │ View │ │
│ └────────────────┘ │
│ │
│ │
│ [GCP Integration]│
└───────────────────────────────────────────────────────────────────────────────────────────────────┘
- Execution: The function runs every minute, scheduled using a Cloud Scheduler Job putting a message in a PubSub topic which triggers the function.
- Reading Events from Data Mesh Manager: It reads all unprocessed events from the Data Mesh Manager API.
- Sending Events to PubSub: These events are then sent to a PubSub topic for further processing.
- Tracking Last Event ID: To ensure proper resumption of processing, the function remembers the last event ID by storing it in a Firestore document. This allows subsequent executions of the function to start processing from the correct feed position.
- Execution: The function is triggered by new messages in the PubSub topic.
- Filtering Relevant Events: The function selectively processes events based on their type. It focuses on events of the type
DataUsageAgreementActivatedEvent
andDataUsageAgreementDeactivatedEvent
. - DataUsageAgreementActivatedEvent: When a
DataUsageAgreementActivatedEvent
occurs, the function authorizes the BigQuery view against the source BigQuery dataset. These policies allow access from a producing data product's output port to a consuming data product. This will skip events if a policy already exists. The data usage agreement in Data Mesh Manager is tagged withgcp-integration
andgcp-integration-active
. - DataUsageAgreementDeactivatedEvent: When a
DataUsageAgreementDeactivatedEvent
occurs, the function removes the permissions from the consuming data product to access the output port of the producing data product. This will skip events, if no corresponding policy ist found. The data usage agreement in Data Mesh Manager is tagged withgcp-integration
andgcp-integration-inactive
. - Extra Information: To effectively process the events, the function may retrieve additional information from the Data Mesh Manager API. This information includes details about the data usage agreement, data products involved, and the teams associated with them.
To allow the integration to work, your data products in Data Mesh manager must contain some metadata in their custom fields.
A consuming data product requires information about its BigQuery table id. We use the notation of the data product specification here.
dataProductSpecification: 0.0.1
info:
id: example_consumer_id
name: Example Consumer Data Product
owner:
teamId: example_team_id
custom:
gcp-table-id: <project-name>.<dataset-name>.<table-name>
A providing data product also requires information about the BigQuery table id. We use the notation of the data product specification here.
dataProductSpecification: 0.0.1
info:
id: example_provider_id
name: Example Provider Data Product
owner:
teamId: example_team_id
custom:
gcp-table-id: <project-name>.<dataset-name>.<table-name>
outputPorts:
- id: example_output_port_id
- Setup Terraform Variables: An example of a minimum configuration can be found here. Copy this file and name the copy
terraform.tfvars
. Set your credentials. - Login Into GCP: There are multiple options to authenticate detailed in the terraform provider documentation.
- Deployment: Nothing more than
terraform apply
is needed to deploy everything to your gcp project.
This project is distributed under the MIT License. It includes various open-source dependencies, each governed by its respective license.
For more details, please refer to the LICENSES file.