AzureLambda

Lambda Architecture Using Azure PaaS Services

This process creates an ACS cluster that has several containers on it writing to Event Hubs. From there the flow forks to:

Batch Layer – Event Hubs writes to Storage Account with Event Hubs Archive feature, then an hourly spark jobs runs over the avro files produced during that hour and consolidates them into parquet files. These files can be used for offline reporting, data exploration, hive tables, etc.

Speed Layer – Stream Analytics reads from Event Hubs and writes the data to CosmosDB (using the native API), from which data can be read and used for online processing.

Serving Layer - This layer is not included yet.

The script automates all aspects of the deployment and creates a functional data pipeline in less than 1 hour (mostly the time it takes to provision the HDI cluster and the ACS cluster).

It is possible to change the (JSON) data schema written to Event Hubs, without having to change any of the existing components - this was planned to be absolutely schema agnostic.

General Parameters

$geoLocation="West Europe"

Usage

Please run PowerShell as Administrator

PS > <directory>\invokeLambda.ps1

Component Overview

Azure Container Services (ACS)

https://docs.microsoft.com/en-us/azure/container-service/container-service-intro

Parameters

$dcosAgentCount = 1 # Number of agents
$dcosAgentVMSize = "Standard_D3_v2"
$dcosMasterVMSize = "Standard_D3_v2"
$dcosLinuxAdminUsername = "azureuser"
$dcosMasterCount = 1 # Use 3 for production
$dcosSshRSAPublicKey = "ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAQEAlwUbj59tAoinx6BqJXID4Ej2Xa5m3tsI3jQpVDOiyniR6hvIS+quuTayc2cyB6w3vyLXdFBwWvdPOuxxNoGpzA+N0k9uBym216oa4uLbxiCmuo6rbTiseYBjS/7Y/NCwLsAPbqyRdbyGVgp7gmRusVS3gEXt8mRGEszSAOYYKXq8vsOvzoq0BgpOypLQojKmkw7+YXleMwYJ8ac9EM6R8w3sECJpPR7dyOQJn6ZA+eHvMft87lo/Q0xu1yS1UB4RDoNwF3E3e4ej+37pAacRr+IHHPrFW8UKV9lmpruDEf/4k8njmatE8Mhwk31v/OGCri2gDAMVE+hQlm1cFjum1Q== rsa-key-20170430" # BE SURE TO CHANGE THIS IF YOU ARE GOING TO USE ANY KIND OF PRODUCTION DATA ON THIS CLUSTER
# Please make sure that the network settings do not collide with other deployments in your subscription
$dcosAgentprivateSubnet="10.0.0.0/16" 
$dcosAgentpublicSubnet="10.1.0.0/16" 
$dcosFirstConsecutiveStaticIP="172.16.0.5" 
$dcosMasterSubnet="172.16.0.0/24"

Azure Event Hubs

https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-what-is-event-hubs

This is currently created with 1 Throughput Unit.

Parameters

$ehArchiveTime = 300 # Time in seconds
$ehArchiveSize = 314572800 # Size in bytes

Azure Blob Storage

https://docs.microsoft.com/en-us/azure/storage/storage-introduction#blob-storage

Spark on Azure HDInsight

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview

Parameters

$hdiSparkVersion = "2.0" # Spark Version
$hdiClusterLoginUserName = "azureuser" # Ambari login user name
$hdiClusterLoginPassword = "Ab12345678!1" # Cluster password for all accounts
$hdiSshUserName = "azureuserssh" # SSH User

Azure Stream Analytics

https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction

Parameters

$saNumberOfStreamingUnits = 12

Azure CosmosDB

This is Azure's global scale NoSQL database.

Since its native Powershell cmdlets don't support all the operations required by this automation script, I used the following project as a baseline to what is used here: https://github.com/savjani/Azure-DocumentDB-Powershell-Cmdlets

The collection is created with the SQL API (which is the Native DocumentDB API) The partitionKey is hardcoded as "id"

Parameters

$docdbConsistencyLevel = "BoundedStaleness" # Strong | BoundedStaleness | Session | ConsistentPrefix | Eventual
$docdbMaxStalenessPrefix= 100 # When used with Bounded Staleness consistency, this value represents the number of stale requests tolerated. Accepted range for this value is 1 – 2,147,483,647.
$docdbMaxIntervalInSeconds = 5 # When used with Bounded Staleness consistency, this value represents the time amount of staleness (in seconds) tolerated. Accepted range for this value is 1 - 100.
$docdbDBName="DB1" # Database name
$docdbCollName="coll1" # Collection name

TODOs

Automate Event Hubs Throughput Units scaling.
Streamline script outputs
Add error handling
Add "Deployment Size" parameter and scale services accordingly.
Set CosmosDB partitionKey and RU/s as parameters.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
ARMTemplates		ARMTemplates
Azure-DocumentDB-Powershell-Cmdlets-master		Azure-DocumentDB-Powershell-Cmdlets-master
Desc		Desc
StreamAnalytics/Template		StreamAnalytics/Template
cosmos-cmdlets		cosmos-cmdlets
dcos		dcos
hdinsight		hdinsight
ARMTemplateDCOS.params.json		ARMTemplateDCOS.params.json
ARMTemplateDeployEHArchive.params.json		ARMTemplateDeployEHArchive.params.json
ARMTemplateDeployLambda.params.json		ARMTemplateDeployLambda.params.json
README.md		README.md
invokeLambda.ps1		invokeLambda.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AzureLambda

General Parameters

Usage

Component Overview

Azure Container Services (ACS)

Parameters

Azure Event Hubs

Parameters

Azure Blob Storage

Spark on Azure HDInsight

Parameters

Azure Stream Analytics

Parameters

Azure CosmosDB

Parameters

TODOs

About

Releases

Packages

Languages

aaliraaza/AzureLambda

Folders and files

Latest commit

History

Repository files navigation

AzureLambda

General Parameters

Usage

Component Overview

Azure Container Services (ACS)

Parameters

Azure Event Hubs

Parameters

Azure Blob Storage

Spark on Azure HDInsight

Parameters

Azure Stream Analytics

Parameters

Azure CosmosDB

Parameters

TODOs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages