Examples showing how Spark streaming applications can be simulated and data persisted to Azure blob, Hive table and Azure SQL Table with Azure Servicebus Eventhubs as flow control manager.
spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventHubsEventCount spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars <EventSizeInChars --partition-count $partitionCount --batch-interval-in-seconds --checkpoint-directory --event-count-folder --job-timeout-in-minutes $timeoutInMinutes
spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventhubsToAzureBlobAsJSON spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars --partition-count $partitionCount --batch-interval-in-seconds --checkpoint-directory --event-count-folder --event-store-folder --job-timeout-in-minutes
spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventhubsToHiveTable spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars --partition-count --batch-interval-in-seconds --checkpoint-directory --event-count-folder --event-hive-table --job-timeout-in-minutes
spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventhubsToSQLTable spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars --partition-count --batch-interval-in-seconds --checkpoint-directory --event-count-folder --sql-server-fqdn --sql-database-name --database-username --database-password --event-sql-table --job-timeout-in-minutes
spark-submit --master yarn-cluster --class com.microsoft.spark.streaming.simulations.workloads.EventhubsEventCount --num-executors 24
--executor-memory 2G --executor-cores 1 --driver-memory 4G spark-streaming-data-persistence-simulations-2.0.0.jar
--eventhubs-namespace 'sparkeventhubswestus' --eventhubs-name 'eventhubs8westus' --partition-count 8 --batch-interval-in-seconds 15
--checkpoint-directory 'hdfs://mycluster/EventCheckpoint-15-8-16' --event-count-folder '/EventCount-15-8-16/EventCount15'
--job-timeout-in-minutes -1
- Use any string for --eventhubs-namespace and --eventhubs-name for launching the Spark streaming application.
- Use the same strings to restart the Spark streaming application to use the saved checkpoint.
- Use any integer for --partition-count and use at least double that number for --num-executors
In order to build and run the examples, you need to have:
- Java 1.8 SDK.
- Maven 3.x
- Scala 2.11
mvn clean
mvn package