GitHub - msb1/scala-kafka-streaming-spark-flink-akka: Data streaming to/from three major event/processing frameworks

scala-kafka-streaming-spark-flink-akka

Data streaming to/from three major event/processing frameworks

Spark Structured Streaming
- Includes homogeneous data processed through multimodel:
  1. fours models are trained with 10000 data records stored in MongoDB
  2. Spark MLib logistic regression, support vector classifier, gradient boosted trees, random forest
  3. Both training and streaming apps are launched in scala with a SparkLauncher app
- Includes heterogeneous data processed through multi-pipelines (one of hardest problems in spark structured streaming):
  1. heterogeneous data with pipeline for each data stream is one of most difficult spark streaming problems
  2. dynamically configure schema for each model-pipeline to be included
  3. can have arbitrary number of data streams once a model and pipeline have been trained for data stream
  4. cannot use spark sql writestream due to checkpoint issues that arise after dataframe union or with separating dataframes into separate streams
  5. implement Kafka Producer directly in foreach after writestream in process method
Apache Flink
- fours models are trained with 10000 data records stored in MongoDB
- logistic regression from deeplearning4j, support vector classifier from libSVM (source), gradient boosted trees and random forest from XGBoost
- Models are incorporated into map functions within flink streaming event framework
- Data events are filtered prior to the map functions to direct which model should be used for a given event
Akka Streams with Alpakka Kafka
- Same models are used from Apache Flink case
- models are incorporated through map functions similarly
- data simulator (EpdGen) is also included in this file for testing trained application
- In each case, the original stream is branched four times after some initial preprocessing.
- Each branch filters on the kafka consumer key to determine which of four models is used for a given data record event

The Kafka Streams and Apache Flink frameworks enable the direct application of tensorflow/keras models in the JVM framework with Scala (or Java). The saved weight from a tensorflow/keras model can be directly imported into DeepLearning4j and used in streaming data applications.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Flink		Flink
Kafka		Kafka
Spark		Spark
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scala-kafka-streaming-spark-flink-akka

Data streaming to/from three major event/processing frameworks

Spark Structured Streaming

Apache Flink

Akka Streams with Alpakka Kafka

The Kafka Streams and Apache Flink frameworks enable the direct application of tensorflow/keras models in the JVM framework with Scala (or Java). The saved weight from a tensorflow/keras model can be directly imported into DeepLearning4j and used in streaming data applications.

About

Releases

Packages

Languages

msb1/scala-kafka-streaming-spark-flink-akka

Folders and files

Latest commit

History

Repository files navigation

scala-kafka-streaming-spark-flink-akka

Data streaming to/from three major event/processing frameworks

Spark Structured Streaming

Apache Flink

Akka Streams with Alpakka Kafka

The Kafka Streams and Apache Flink frameworks enable the direct application of tensorflow/keras models in the JVM framework with Scala (or Java). The saved weight from a tensorflow/keras model can be directly imported into DeepLearning4j and used in streaming data applications.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages