Skip to content

Commit

Permalink
Merge pull request #80 from databrickslabs/feature/v0.0.8
Browse files Browse the repository at this point in the history
Merging feature/v0.0.8 release
  • Loading branch information
ravi-databricks authored Aug 6, 2024
2 parents 3555aaa + a15c517 commit 1bc01ba
Show file tree
Hide file tree
Showing 123 changed files with 120,551 additions and 26,981 deletions.
20 changes: 9 additions & 11 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,21 @@ on:

jobs:
release:
runs-on: ${{ matrix.os }}
strategy:
max-parallel: 1
matrix:
python-version: [ 3.9 ]
os: [ ubuntu-latest ]
runs-on: ubuntu-latest
environment: release
permissions:
# Used to authenticate to PyPI via OIDC and sign the release's artifacts with sigstore-python.
id-token: write
# Used to attach signing artifacts to the published release.
contents: write

steps:
- uses: actions/checkout@v1

- name: Set up Python ${{ matrix.python-version }}
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
python-version: 3.9
cache: 'pip' # caching pip dependencies
cache-dependency-path: setup.py

Expand All @@ -35,6 +36,3 @@ jobs:

- name: Publish a Python distribution to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
user: __token__
password: ${{ secrets.LABS_PYPI_TOKEN }}
3 changes: 0 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -151,9 +151,6 @@ deployment-merged.yaml
.idea/
.vscode/

# ignore integration test onboarding file.
integration-tests/conf/dlt-meta/onboarding.json

.databricks
.databricks-login.json
demo/conf/onboarding.json
Expand Down
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,18 @@
# Changelog
## [v.0.0.8]
- Added dlt append_flow api support: [PR](https://github.com/databrickslabs/dlt-meta/pull/58)
- Added dlt append_flow api support for silver layer: [PR](https://github.com/databrickslabs/dlt-meta/pull/63)
- Added support for file metadata columns for autoloader: [PR](https://github.com/databrickslabs/dlt-meta/pull/56)
- Added support for Bring your own custom transformation: [Issue](https://github.com/databrickslabs/dlt-meta/issues/68)
- Added support to Unify PyPI releases with GitHub OIDC: [PR](https://github.com/databrickslabs/dlt-meta/pull/62)
- Added demo for append_flow and file_metadata options: [PR](https://github.com/databrickslabs/dlt-meta/issues/74)
- Added Demo for silver fanout architecture: [PR](https://github.com/databrickslabs/dlt-meta/pull/83)
- Added documentation in docs site for new features: [PR](https://github.com/databrickslabs/dlt-meta/pull/64)
- Added unit tests to showcase silver layer fanout examples: [PR](https://github.com/databrickslabs/dlt-meta/pull/67)
- Fixed issue for No such file or directory: '/demo' :[PR](https://github.com/databrickslabs/dlt-meta/issues/59)
- Fixed issue DLT-META CLI onboard command issue for Azure: databricks.sdk.errors.platform.ResourceAlreadyExists :[PR](https://github.com/databrickslabs/dlt-meta/issues/51)
- Fixed issue Changed dbfs.create to mkdirs for CLI: [PR](https://github.com/databrickslabs/dlt-meta/pull/53)
- Fixed issue DLT-META CLI should use pypi lib instead of whl : [PR](https://github.com/databrickslabs/dlt-meta/pull/79)

## [v.0.0.7]
- Added dlt-meta cli documentation and readme with browser support: [PR](https://github.com/databrickslabs/dlt-meta/pull/45)
Expand Down
14 changes: 10 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,9 @@
---

# Project Overview
`DLT-META` is a metadata-driven framework designed to work with [Delta Live Tables](https://www.databricks.com/product/delta-live-tables). This framework enables the automation of bronze and silver data pipelines by leveraging metadata recorded in an onboarding JSON file. This file, known as the Dataflowspec, serves as the data flow specification, detailing the source and target metadata required for the pipelines.

`DLT-META` is a metadata-driven framework based on Databricks [Delta Live Tables](https://www.databricks.com/product/delta-live-tables) (aka DLT) which lets you automate your bronze and silver data pipelines.

With this framework you need to record the source and target metadata in an onboarding json file which acts as the data flow specification aka Dataflowspec. A single generic `DLT` pipeline takes the `Dataflowspec` and runs your workloads.
In practice, a single generic DLT pipeline reads the Dataflowspec and uses it to orchestrate and run the necessary data processing workloads. This approach streamlines the development and management of data pipelines, allowing for a more efficient and scalable data processing workflow

### Components:

Expand Down Expand Up @@ -128,11 +127,18 @@ If you want to run existing demo files please follow these steps before running
```

```commandline
databricks labs dlt-meta onboard
dlt_meta_home=$(pwd)
```

```commandline
export PYTHONPATH=$dlt_meta_home
```
```commandline
databricks labs dlt-meta onboard
```
![onboardingDLTMeta.gif](docs/static/images/onboardingDLTMeta.gif)


Above commands will prompt you to provide onboarding details. If you have cloned dlt-meta git repo then accept defaults which will launch config from demo folder.
![onboardingDLTMeta_2.gif](docs/static/images/onboardingDLTMeta_2.gif)

Expand Down
197 changes: 185 additions & 12 deletions demo/README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,42 @@
# [DLT-META](https://github.com/databrickslabs/dlt-meta) DEMO's
1. [DAIS 2023 DEMO](#dais-2023-demo): Showcases DLT-META's capabilities of creating Bronze and Silver DLT pipelines with initial and incremental mode automatically.
2. [Databricks Techsummit Demo](#databricks-tech-summit-fy2024-demo): 100s of data sources ingestion in bronze and silver DLT pipelines automatically.
3. [Append FLOW Autoloader Demo](#append-flow-autoloader-file-metadata-demo): Write to same target from multiple sources using [dlt.append_flow](https://docs.databricks.com/en/delta-live-tables/flows.html#append-flows) and adding [File metadata column](https://docs.databricks.com/en/ingestion/file-metadata-column.html)
4. [Append FLOW Eventhub Demo](#append-flow-eventhub-demo): Write to same target from multiple sources using [dlt.append_flow](https://docs.databricks.com/en/delta-live-tables/flows.html#append-flows) and adding [File metadata column](https://docs.databricks.com/en/ingestion/file-metadata-column.html)
5. [Silver Fanout Demo](#silver-fanout-demo): This demo showcases the implementation of fanout architecture in the silver layer.



# DAIS 2023 DEMO
## [DAIS 2023 Session Recording](https://www.youtube.com/watch?v=WYv5haxLlfA)
This Demo launches Bronze and Silver DLT pipleines with following activities:
This Demo launches Bronze and Silver DLT pipelines with following activities:
- Customer and Transactions feeds for initial load
- Adds new feeds Product and Stores to existing Bronze and Silver DLT pipelines with metadata changes.
- Runs Bronze and Silver DLT for incremental load for CDC events

### Steps:
1. Launch Terminal/Command promt
1. Launch Terminal/Command prompt

2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)

3. ```git clone https://github.com/databrickslabs/dlt-meta.git ```
3. ```commandline
git clone https://github.com/databrickslabs/dlt-meta.git
```

4. ```cd dlt-meta```
4. ```commandline
cd dlt-meta
```

5. Set python environment variable into terminal
```commandline
dlt_meta_home=$(pwd)
```
export PYTHONPATH=<<local dlt-meta path>>

```commandline
export PYTHONPATH=$dlt_meta_home
```

6. Run the command ```python demo/launch_dais_demo.py --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --cloud_provider_name=aws --dbr_version=13.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-demo-automated_new```
6. Run the command ```python demo/launch_dais_demo.py --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-demo-automated```
- cloud_provider_name : aws or azure or gcp
- db_version : Databricks Runtime Version
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
Expand Down Expand Up @@ -53,17 +65,28 @@ This demo will launch auto generated tables(100s) inside single bronze and silve

2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)

3. ```git clone https://github.com/databrickslabs/dlt-meta.git ```
3. ```commandline
git clone https://github.com/databrickslabs/dlt-meta.git
```

4. ```cd dlt-meta```
4. ```commandline
cd dlt-meta
```

5. Set python environment variable into terminal
```commandline
dlt_meta_home=$(pwd)
```
export PYTHONPATH=<<local dlt-meta path>>

```commandline
export PYTHONPATH=$dlt_meta_home
```

6. Run the command ```python demo/launch_techsummit_demo.py --source=cloudfiles --cloud_provider_name=aws --dbr_version=13.3.x-scala2.12 --dbfs_path=dbfs:/techsummit-dlt-meta-demo-automated ```
- cloud_provider_name : aws or azure or gcp
6. Run the command
```commandline
python demo/launch_techsummit_demo.py --source=cloudfiles --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/techsummit-dlt-meta-demo-automated
```
- cloud_provider_name : aws or azure
- db_version : Databricks Runtime Version
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
- you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token
Expand All @@ -82,4 +105,154 @@ This demo will launch auto generated tables(100s) inside single bronze and silve

- Copy the displayed token

- Paste to command prompt
- Paste to command prompt


# Append Flow Autoloader file metadata demo:
This demo will perform following tasks:
- Read from different source paths using autoloader and write to same target using append_flow API
- Read from different delta tables and write to same silver table using append_flow API
- Add file_name and file_path to target bronze table for autoloader source using [File metadata column](https://docs.databricks.com/en/ingestion/file-metadata-column.html)

1. Launch Terminal/Command prompt

2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)

3. ```commandline
git clone https://github.com/databrickslabs/dlt-meta.git
```

4. ```commandline
cd dlt-meta
```

5. Set python environment variable into terminal
```commandline
dlt_meta_home=$(pwd)
```

```commandline
export PYTHONPATH=$dlt_meta_home
```

6. ```commandline
python demo/launch_af_cloudfiles_demo.py --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/tmp/DLT-META/demo/ --uc_catalog_name=ravi_dlt_meta_uc
```

- cloud_provider_name : aws or azure or gcp
- db_version : Databricks Runtime Version
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
- uc_catalog_name: Unity catalog name
- you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token

![af_am_demo.png](docs/static/images/af_am_demo.png)

# Append Flow Eventhub demo:
- Read from different eventhub topics and write to same target tables using append_flow API

### Steps:
1. Launch Terminal/Command prompt

2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)

3. ```commandline
git clone https://github.com/databrickslabs/dlt-meta.git
```

4. ```commandline
cd dlt-meta
```
5. Set python environment variable into terminal
```commandline
dlt_meta_home=$(pwd)
```
```commandline
export PYTHONPATH=$dlt_meta_home
```
6. Eventhub
- Needs eventhub instance running
- Need two eventhub topics first for main feed (eventhub_name) and second for append flow feed (eventhub_name_append_flow)
- Create databricks secrets scope for eventhub keys
- ```
commandline databricks secrets create-scope eventhubs_dltmeta_creds
```
- ```commandline
databricks secrets put-secret --json '{
"scope": "eventhubs_dltmeta_creds",
"key": "RootManageSharedAccessKey",
"string_value": "<<value>>"
}'
```
- Create databricks secrets to store producer and consumer keys using the scope created in step 2

- Following are the mandatory arguments for running EventHubs demo
- cloud_provider_name: Cloud provider name e.g. aws or azure
- dbr_version: Databricks Runtime Version e.g. 15.3.x-scala2.12
- uc_catalog_name : unity catalog name e.g. ravi_dlt_meta_uc
- dbfs_path: Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines e.g. dbfs:/tmp/DLT-META/demo/
- eventhub_namespace: Eventhub namespace e.g. dltmeta
- eventhub_name : Primary Eventhubname e.g. dltmeta_demo
- eventhub_name_append_flow: Secondary eventhub name for appendflow feed e.g. dltmeta_demo_af
- eventhub_producer_accesskey_name: Producer databricks access keyname e.g. RootManageSharedAccessKey
- eventhub_consumer_accesskey_name: Consumer databricks access keyname e.g. RootManageSharedAccessKey
- eventhub_secrets_scope_name: Databricks secret scope name e.g. eventhubs_dltmeta_creds
- eventhub_port: Eventhub port

7. ```commandline
python3 demo/launch_af_eventhub_demo.py --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/tmp/DLT-META/demo/ --uc_catalog_name=ravi_dlt_meta_uc --eventhub_name=dltmeta_demo --eventhub_name_append_flow=dltmeta_demo_af --eventhub_secrets_scope_name=dltmeta_eventhub_creds --eventhub_namespace=dltmeta --eventhub_port=9093 --eventhub_producer_accesskey_name=RootManageSharedAccessKey --eventhub_consumer_accesskey_name=RootManageSharedAccessKey --eventhub_accesskey_secret_name=RootManageSharedAccessKey --uc_catalog_name=ravi_dlt_meta_uc
```

![af_eh_demo.png](docs/static/images/af_eh_demo.png)


# Silver Fanout Demo
- This demo will showcase the onboarding process for the silver fanout pattern.
- Run the onboarding process for the bronze cars table, which contains data from various countries.
- Run the onboarding process for the silver tables, which have a `where_clause` based on the country condition specified in [silver_transformations_cars.json](https://github.com/databrickslabs/dlt-meta/blob/main/demo/conf/silver_transformations_cars.json).
- Run the Bronze DLT pipeline which will produce cars table.
- Run Silver DLT pipeline, fanning out from the bronze cars table to country-specific tables such as cars_usa, cars_uk, cars_germany, and cars_japan.

### Steps:
1. Launch Terminal/Command prompt

2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)

3. ```commandline
git clone https://github.com/databrickslabs/dlt-meta.git
```

4. ```commandline
cd dlt-meta
```
5. Set python environment variable into terminal
```commandline
dlt_meta_home=$(pwd)
```
```commandline
export PYTHONPATH=$dlt_meta_home
6. Run the command ```python demo/launch_silver_fanout_demo.py --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-silver-fanout```
- cloud_provider_name : aws or azure
- db_version : Databricks Runtime Version
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
- you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token.

- - 6a. Databricks Workspace URL:
- - Enter your workspace URL, with the format https://<instance-name>.cloud.databricks.com. To get your workspace URL, see Workspace instance names, URLs, and IDs.

- - 6b. Token:
- In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down.

- On the Access tokens tab, click Generate new token.

- (Optional) Enter a comment that helps you to identify this token in the future, and change the token’s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the Lifetime (days) box empty (blank).

- Click Generate.

- Copy the displayed token

- Paste to command prompt

![silver_fanout_workflow.png](docs/static/images/silver_fanout_workflow.png)

![silver_fanout_dlt.png](docs/static/images/silver_fanout_dlt.png)
27 changes: 27 additions & 0 deletions demo/conf/afam_silver_transformations.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
[
{
"target_table": "customers",
"select_exp": [
"address",
"email",
"firstname",
"id",
"lastname",
"operation_date",
"operation",
"_rescued_data"
]
},
{
"target_table": "transactions",
"select_exp": [
"id",
"customer_id",
"amount",
"item_count",
"operation_date",
"operation",
"_rescued_data"
]
}
]
Loading

0 comments on commit 1bc01ba

Please sign in to comment.