spark-streaming-data-persistence-simulations

Examples showing how Spark streaming applications can be simulated and data persisted to Azure blob, Hive table and Azure SQL Table with Azure Servicebus Eventhubs as flow control manager.

Usage

EventhubsEventCount

spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventHubsEventCount spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars <EventSizeInChars --partition-count $partitionCount --batch-interval-in-seconds --checkpoint-directory --event-count-folder --job-timeout-in-minutes $timeoutInMinutes

EventhubsToAzureBlobAsJSON

spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventhubsToAzureBlobAsJSON spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars --partition-count $partitionCount --batch-interval-in-seconds --checkpoint-directory --event-count-folder --event-store-folder --job-timeout-in-minutes

EventhubsToHiveTable

spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventhubsToHiveTable spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars --partition-count --batch-interval-in-seconds --checkpoint-directory --event-count-folder --event-hive-table --job-timeout-in-minutes

EventhubsToSQLTable

spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventhubsToSQLTable spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars --partition-count --batch-interval-in-seconds --checkpoint-directory --event-count-folder --sql-server-fqdn --sql-database-name --database-username --database-password --event-sql-table --job-timeout-in-minutes

Example:

  spark-submit --master yarn-cluster --class com.microsoft.spark.streaming.simulations.workloads.EventhubsEventCount --num-executors 24 
  --executor-memory 2G --executor-cores 1 --driver-memory 4G spark-streaming-data-persistence-simulations-2.0.0.jar 
  --eventhubs-namespace 'sparkeventhubswestus' --eventhubs-name 'eventhubs8westus' --partition-count 8 --batch-interval-in-seconds 15 
  --checkpoint-directory 'hdfs://mycluster/EventCheckpoint-15-8-16' --event-count-folder '/EventCount-15-8-16/EventCount15' 
  --job-timeout-in-minutes -1

Note:

Use any string for --eventhubs-namespace and --eventhubs-name for launching the Spark streaming application.
Use the same strings to restart the Spark streaming application to use the saved checkpoint.
Use any integer for --partition-count and use at least double that number for --num-executors

Build Prerequisites

In order to build and run the examples, you need to have:

Java 1.8 SDK.
Maven 3.x
Scala 2.11

Build Command

mvn clean
mvn package

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml
scalastyle-config.xml		scalastyle-config.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-streaming-data-persistence-simulations

Usage

EventhubsEventCount

EventhubsToAzureBlobAsJSON

EventhubsToHiveTable

EventhubsToSQLTable

Example:

Note:

Build Prerequisites

Build Command

About

Releases

Packages

Languages

License

hdinsight/spark-streaming-data-persistence-simulations

Folders and files

Latest commit

History

Repository files navigation

spark-streaming-data-persistence-simulations

Usage

EventhubsEventCount

EventhubsToAzureBlobAsJSON

EventhubsToHiveTable

EventhubsToSQLTable

Example:

Note:

Build Prerequisites

Build Command

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages