-
- Includes homogeneous data processed through multimodel:
- fours models are trained with 10000 data records stored in MongoDB
- Spark MLib logistic regression, support vector classifier, gradient boosted trees, random forest
- Both training and streaming apps are launched in scala with a SparkLauncher app
- Includes heterogeneous data processed through multi-pipelines (one of hardest problems in spark structured streaming):
- heterogeneous data with pipeline for each data stream is one of most difficult spark streaming problems
- dynamically configure schema for each model-pipeline to be included
- can have arbitrary number of data streams once a model and pipeline have been trained for data stream
- cannot use spark sql writestream due to checkpoint issues that arise after dataframe union or with separating dataframes into separate streams
- implement Kafka Producer directly in foreach after writestream in process method
- Includes homogeneous data processed through multimodel:
-
- fours models are trained with 10000 data records stored in MongoDB
- logistic regression from deeplearning4j, support vector classifier from libSVM (source), gradient boosted trees and random forest from XGBoost
- Models are incorporated into map functions within flink streaming event framework
- Data events are filtered prior to the map functions to direct which model should be used for a given event
-
- Same models are used from Apache Flink case
- models are incorporated through map functions similarly
- data simulator (EpdGen) is also included in this file for testing trained application
- In each case, the original stream is branched four times after some initial preprocessing.
- Each branch filters on the kafka consumer key to determine which of four models is used for a given data record event
-
Notifications
You must be signed in to change notification settings - Fork 0
msb1/scala-kafka-streaming-spark-flink-akka
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Data streaming to/from three major event/processing frameworks
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published