Module 6 Exactly Once Improvement

In this module, to prevent the rare cases that duplicated data are ingested, you will add a new Azure function to parse Databricks metadata files and enhance ADX ingestion functions to avoid duplicated data ingestion.

We aim to provision the light yellow rectangle areas in the following system architecture diagram.

Module Goal

Deploy Databricks parser metadata functions.
Enhance ADX Ingestion functions to check duplicated files.

Module Preparation

Azure Subscription
Powershell Core (version 6.x up) environment (PowerShell runs on Windows, macOS, and Linux platforms)
Azure CLI (Azure CLI is available to install in Windows, macOS and Linux environments)
Scripts provided in this module:
- update-ingestion-event-grid-for-dbsmetadata.ps1
- create-table-storage-and-update-keyvault.ps1
- create-dbsmetadatahandler-function.ps1
- deploy-dbsmetadatahandler-function.ps1
- ingestion-function-enable-duplicate-check.ps1
Azure Functions Core tools

Make sure you have all the preparation items ready and let's start.

References

Exactly Once Mechanism in Apache Spark
Streaming with File Sink: Problems with recovery if you change checkpoint or output directories g

Step 1: Update Event Grid and Storage Queue

We need to change the Event Grid setting and create a new storage queue for the new Databricks MetaData handler Function. Modify following additional parameters in the provision-config.json file. You should modify the configuration values according to your needs.

{
    "EventGrid": {
        "DBSMetadataQueueName": "databricks-output-metadata",
        "DBSMetadataQueueCount": "1",
        "DBSMetadataEventFilters": [{"key": "Subject", "operatorType": "StringEndsWith", "values": ["0","1","2","3","4","5","6","7","8","9",".compact"]}, {"key": "Subject", "operatorType": "StringContains", "values": ["_spark_metadata"]}],
        "EventGridTemplatePath":"../Azure/event-trigger/StorageEventTrigger.json"
    }
}

Then run update-ingestion-event-grid-for-dbsmetadata.ps1 to update Event-Grid and create a Storage Queue.

When the script is finished, you can verify the resource creation result in Azure Portal.

You will find two new storage queues.

There is a new filter that will monitor Databricks metadata files in Event Grid .

Step 2: Deploy File Name Check Table storage

We will create a table storage to track each ingested file and use the record to prevent ingesting the same file twice. Please update provision-config.json file and modify the Table Storage configuration values according to your needs.

{
        "TableStorageAccountName":"tables",
        "TableStorageSku":"Standard_RAGRS",
        "TableTemplatePath":"../Azure/datalake/StorageTable.son"
}

Then run create-table-storage-and-update-keyvault.ps1 with parameters.

When the script is finished, you can verify the resource creation result in Azure Portal.

Step 3: Deploy Databricks MetaData handler

In this step we will deploy the new Databricks metadata handler Azure functions. Setup the following additional parameters in the provision-config.json file. You should modify the configuration values according to your needs.

{
 "Functions": {
        "dbsMetadataHandlerFunction": {
            "FunctionName": "dbsmetafunc",
            "Path": "databricksmetadatahandler",
            "FunctionFolder": "metadatahandler",
            "MetadataHandlerfuncTemplatePath": "../Azure/function/FunctionApp.json",
            "MetadataHandlerfuncSettingsTemplatePath": "../Azure/function/appsettings/dbsmetafunc.json",
            "IngestionSasTokenName": "ingestiontoken"
        }
    }
}

Run create-dbsmetadatahandler-function.ps1 to create Azure Functions resource for Databricks meta-data handling functions.

Then run deploy-dbsmetadatahandler-function.ps1 to deploy the Azure function code to the created Azure Function resources.

When the script is finished, you can verify the resource creation result in Azure Portal.

Step 4: Update Ingestion Azure Functions

By now we have deployed a new Databricks metadata handler function which will make sure we only pass the files from a completed Databricks data process batch to the downstream ingestion pipeline.

The next step is to enable the duplicated file check in Ingestion Azure Functions. The Ingestion Azure Function we deployed in module 3 already has the feature to log processed file name and prevent the file with the same name been ingested again. To enable the feature, we need to set the "IS_DUPLICATE_CHECK" setting in the function's configuration to TRUE and set "STORAGE_TABLE_ACCOUNT" to the name of the table storage created in Step 2 . You can modify it through Azure Portal or using ingestion-function-enable-duplicate-check.ps1 script.

Step 5: Test Data Ingestion

Follow the steps listed in Module 4, you can test if the updated data ingestion pipeline works correctly.

You can also use Azure Storage Explorer to check if the file name check table in table storage has logged all the ingested files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Module6.md

Module6.md

Module 6 Exactly Once Improvement

Step 1: Update Event Grid and Storage Queue

Step 2: Deploy File Name Check Table storage

Step 3: Deploy Databricks MetaData handler

Step 4: Update Ingestion Azure Functions

Step 5: Test Data Ingestion

Files

Module6.md

Latest commit

History

Module6.md

File metadata and controls

Module 6 Exactly Once Improvement

Step 1: Update Event Grid and Storage Queue

Step 2: Deploy File Name Check Table storage

Step 3: Deploy Databricks MetaData handler

Step 4: Update Ingestion Azure Functions

Step 5: Test Data Ingestion