It is a reference application (which we will constantly improve) showing how to easily leverage and integrate Spark Structured Streaming, Apache Cassandra, and Apache Kafka for streaming computations.
We need to calculate streaming Word Count.
git clone https://github.com/knoldus/structured-streaming-application.git
cd structured-streaming-application
If this is your first time running SBT, you will be downloading the internet.
cd structured-streaming-application
sbt clean compile
1.Download the latest Cassandra and open the compressed file.
2.Start Cassandra - you may need to prepend with sudo, or chown /var/lib/cassandra. On the command line:
./apache-cassandra-{version}/bin/cassandra -f
4.Start the Kafka Server
cd kafka_2.11-0.10.2.1
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
1.Set Environment Variables. Eg,
export BOOTSTRAP_SERVERS_CONFIG="localhost:9092"
export TOPIC="knolx"
export CASSANDRA_HOSTS="localhost"
export CASSANDRA_KEYSPACE="knolx"
export SPARK_MASTER="local"
export SPARK_APP_NAME="knolx"
export CHECKPOINT_DIR="/tmp/knolx"
2.Start Structured Streaming Application
cd /path/to/structured-streaming-application
sbt run
Multiple main classes detected, select one to run:
[1] knolx.kafka.DataStreamer
[2] knolx.spark.StructuredStreamingWordCount
Enter number: 2
3.Start the Kafka data feed In a second shell run:
cd /path/to/structured-streaming-application
sbt run
Multiple main classes detected, select one to run:
[1] knolx.kafka.DataStreamer
[2] knolx.spark.StructuredStreamingWordCount
Enter number: 1
After a few seconds you should see data by entering this in the cqlsh shell:
cqlsh> select * from wordcount;
This confirms that data from the app has published to Kafka, and the data is streaming from Spark to Cassandra.