Skip to content

msb1/scala-kafka-streaming-spark-flink-akka

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scala-kafka-streaming-spark-flink-akka

Data streaming to/from three major event/processing frameworks
  1. Spark Structured Streaming
    • Includes homogeneous data processed through multimodel:
      1. fours models are trained with 10000 data records stored in MongoDB
      2. Spark MLib logistic regression, support vector classifier, gradient boosted trees, random forest
      3. Both training and streaming apps are launched in scala with a SparkLauncher app
    • Includes heterogeneous data processed through multi-pipelines (one of hardest problems in spark structured streaming):
      1. heterogeneous data with pipeline for each data stream is one of most difficult spark streaming problems
      2. dynamically configure schema for each model-pipeline to be included
      3. can have arbitrary number of data streams once a model and pipeline have been trained for data stream
      4. cannot use spark sql writestream due to checkpoint issues that arise after dataframe union or with separating dataframes into separate streams
      5. implement Kafka Producer directly in foreach after writestream in process method
  2. Apache Flink
    • fours models are trained with 10000 data records stored in MongoDB
    • logistic regression from deeplearning4j, support vector classifier from libSVM (source), gradient boosted trees and random forest from XGBoost
    • Models are incorporated into map functions within flink streaming event framework
    • Data events are filtered prior to the map functions to direct which model should be used for a given event
  3. Akka Streams with Alpakka Kafka
    • Same models are used from Apache Flink case
    • models are incorporated through map functions similarly
    • data simulator (EpdGen) is also included in this file for testing trained application
    • In each case, the original stream is branched four times after some initial preprocessing.
    • Each branch filters on the kafka consumer key to determine which of four models is used for a given data record event
The Kafka Streams and Apache Flink frameworks enable the direct application of tensorflow/keras models in the JVM framework with Scala (or Java). The saved weight from a tensorflow/keras model can be directly imported into DeepLearning4j and used in streaming data applications.

About

Data streaming to/from three major event/processing frameworks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages