Lambda Architecture Using Azure PaaS Services
This process creates an ACS cluster that has several containers on it writing to Event Hubs. From there the flow forks to:
Batch Layer – Event Hubs writes to Storage Account with Event Hubs Archive feature, then an hourly spark jobs runs over the avro files produced during that hour and consolidates them into parquet files. These files can be used for offline reporting, data exploration, hive tables, etc.
Speed Layer – Stream Analytics reads from Event Hubs and writes the data to CosmosDB (using the native API), from which data can be read and used for online processing.
Serving Layer - This layer is not included yet.
The script automates all aspects of the deployment and creates a functional data pipeline in less than 1 hour (mostly the time it takes to provision the HDI cluster and the ACS cluster).
It is possible to change the (JSON) data schema written to Event Hubs, without having to change any of the existing components - this was planned to be absolutely schema agnostic.
$geoLocation="West Europe"
Please run PowerShell as Administrator
PS > <directory>\invokeLambda.ps1
https://docs.microsoft.com/en-us/azure/container-service/container-service-intro
$dcosAgentCount = 1 # Number of agents
$dcosAgentVMSize = "Standard_D3_v2"
$dcosMasterVMSize = "Standard_D3_v2"
$dcosLinuxAdminUsername = "azureuser"
$dcosMasterCount = 1 # Use 3 for production
$dcosSshRSAPublicKey = "ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAQEAlwUbj59tAoinx6BqJXID4Ej2Xa5m3tsI3jQpVDOiyniR6hvIS+quuTayc2cyB6w3vyLXdFBwWvdPOuxxNoGpzA+N0k9uBym216oa4uLbxiCmuo6rbTiseYBjS/7Y/NCwLsAPbqyRdbyGVgp7gmRusVS3gEXt8mRGEszSAOYYKXq8vsOvzoq0BgpOypLQojKmkw7+YXleMwYJ8ac9EM6R8w3sECJpPR7dyOQJn6ZA+eHvMft87lo/Q0xu1yS1UB4RDoNwF3E3e4ej+37pAacRr+IHHPrFW8UKV9lmpruDEf/4k8njmatE8Mhwk31v/OGCri2gDAMVE+hQlm1cFjum1Q== rsa-key-20170430" # BE SURE TO CHANGE THIS IF YOU ARE GOING TO USE ANY KIND OF PRODUCTION DATA ON THIS CLUSTER
# Please make sure that the network settings do not collide with other deployments in your subscription
$dcosAgentprivateSubnet="10.0.0.0/16"
$dcosAgentpublicSubnet="10.1.0.0/16"
$dcosFirstConsecutiveStaticIP="172.16.0.5"
$dcosMasterSubnet="172.16.0.0/24"
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-what-is-event-hubs
This is currently created with 1 Throughput Unit.
$ehArchiveTime = 300 # Time in seconds
$ehArchiveSize = 314572800 # Size in bytes
https://docs.microsoft.com/en-us/azure/storage/storage-introduction#blob-storage
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview
$hdiSparkVersion = "2.0" # Spark Version
$hdiClusterLoginUserName = "azureuser" # Ambari login user name
$hdiClusterLoginPassword = "Ab12345678!1" # Cluster password for all accounts
$hdiSshUserName = "azureuserssh" # SSH User
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction
$saNumberOfStreamingUnits = 12
This is Azure's global scale NoSQL database.
Since its native Powershell cmdlets don't support all the operations required by this automation script, I used the following project as a baseline to what is used here: https://github.com/savjani/Azure-DocumentDB-Powershell-Cmdlets
The collection is created with the SQL API (which is the Native DocumentDB API) The partitionKey is hardcoded as "id"
$docdbConsistencyLevel = "BoundedStaleness" # Strong | BoundedStaleness | Session | ConsistentPrefix | Eventual
$docdbMaxStalenessPrefix= 100 # When used with Bounded Staleness consistency, this value represents the number of stale requests tolerated. Accepted range for this value is 1 – 2,147,483,647.
$docdbMaxIntervalInSeconds = 5 # When used with Bounded Staleness consistency, this value represents the time amount of staleness (in seconds) tolerated. Accepted range for this value is 1 - 100.
$docdbDBName="DB1" # Database name
$docdbCollName="coll1" # Collection name
- Automate Event Hubs Throughput Units scaling.
- Streamline script outputs
- Add error handling
- Add "Deployment Size" parameter and scale services accordingly.
- Set CosmosDB partitionKey and RU/s as parameters.