In this workshop you’ll implement a data pipeline, using NiFi to ingest data and write it to Kudu tables.
-
Lab 1 - On the Apache NiFi, obtain data
-
Lab 2 - On the NiFi cluster, prepare the data, process each record save results to Kudu.
In this lab you will run a simple Python script that simulates IoT sensor data from some hypothetical machines, and send the data to a MQTT broker (mosquitto). The MQTT broker plays the role of a gateway that is connected to many and different type of sensors through the "mqtt" protocol. Your cluster comes with an embedded MQTT broker that the simulation script publishes to. For convenience, we will use NiFi to run the script rather than Shell commands.
-
Go to Apache NiFi and add a Processor (ExecuteProcess) to the canvas.
-
Right-click the processor, select Configure (or, alternatively, just double-click the processor). On the PROPERTIES tab, set the properties shown below to run our Python simulate script.
Command: python3 Command Arguments: /opt/demo/simulate.py
-
In the SCHEDULING tab, set to Run Schedule: 1 sec
Alternatively, you could set that to other time intervals: 1 sec, 30 sec, 1 min, etc…
-
In the SETTINGS tab, check the "success" relationship in the AUTOMATICALLY TERMINATED RELATIONSHIPS section. Click Apply.
-
You can then right-click to Start this simulator runner.
-
Right-click and select Stop after a few seconds and look at the provenance. You’ll see that it has run a number of times and produced results.
Cloudera Edge Flow Management gives you a visual overview of all MiNiFi agents in your environment, and allows you to update the flow configuration for each one, with versioning control thanks to the NiFi Registry integration. In this lab, you will create the MiNiFi flow and publish it for the MiNiFi agent to pick it up.
-
Open the EFM Web UI at http://<public_dns>:10080/efm/ui/. Ensure you see your minifi agent’s heartbeat messages in the Events Monitor. Click on the info icon on a heartbeat record to see the details of the heartbeat.
-
You can then select the Flow Designer tab (). To build a dataflow, select the desired class (
iot-1
) from the table and click OPEN. Alternatively, you can double-click on the desired class. -
Add a ConsumeMQTT Processor to the canvas, by dragging the processor icon to the canvas, selecting the ConsumeMQTT processor type and clicking on the Add button. Once the processor is on the canvas, double-click it and configure it with below settings:
Broker URI: tcp://edge2ai-1.dim.local:1883 Client ID: minifi-iot Topic Filter: iot/# Max Queue Size: 60
And ensure you scroll down on the properties page to set the Topic Filter and Max Queue Size:
-
Add a Remote Process Group (RPG) to the canvas and configure it as follows:
URL: http://edge2ai-1.dim.local:8080/nifi Transport Protocol: HTTP
-
At this point you need to connect the ConsumerMQTT processor to the RPG. For this, you first need to add an Input Port to the remote NiFi server. Open the NiFi Web UI at
http://<public_dns>:8080/nifi/
and drag the Input Port to the canvas. Call it something like "from Gateway". -
To terminate the NiFI Input Port let’s, for now, add a Funnel to the canvas…
-
… and setup a connection from the Input Port to it. To setup a connection, hover the mouse over the Input Port until an arrow symbol is shown in the center. Click on the arrow, drag it and drop it on the Funnel to connect the two elements.
-
Right-click on the Input Port and start it. Alternatively, click on the Input Port to select it and then press the start ("play") button on the Operate panel:
-
You will need the ID of the Input Port to complete the connection of the ConsumeMQTT processor to the RPG (NiFi). Double-click on the Input Port and copy its ID.
-
Back to the Flow Designer, connect the ConsumeMQTT processor to the RPG. The connection requires an ID and you can paste here the ID you copied from the Input Port. Make sure that there are NO SPACES!
Double-click the connection to check the configuration:
-
The Flow is now complete, but before publishing it, create the Bucket in the NiFi Registry so that all versions of your flows are stored for review and audit. Open the NiFi Registry at
http://<public_dns>:18080/nifi-registry
, click on the wrench/spanner icon () on the top-right corner on and create a bucket calledIoT
(ATTENTION: the bucket name is CASE-SENSITIVE). -
You can now publish the flow for the MiNiFi agent to automatically pick up. Click Publish, add a descriptive comment for your changes and click Apply.
-
Go back to the NiFi Registry Web UI and click on the NiFi Registry name, next to the Cloudera logo. If the flow publishing was successful, you should see the flow’s version details in the NiFi Registry.
-
At this point, you can test the edge flow up until NiFi. Start the NiFi simulator (ExecuteProcess processor) again and confirm you can see the messages queued in NiFi.
-
You can stop the simulator (Stop the NiFi processor) once you confirm that the flow is working correctly.
In this lab, you will create a NiFi flow to receive the data from all gateways and push it to Kafka.
Before we start building our flow, let’s create a Process Group to help organizing the flows in the NiFi canvas and also to enable flow version control.
-
Open the NiFi Web UI, create a new Process Group and name it something like Process Sensor Data.
-
We want to be able to version control the flows we will add to the Process Group. In order to do that, we first need to connect NiFi to the NiFi Registry. On the NiFi global menu, click on "Controller Settings", navigate to the "Registry Clients" tab and add a Registry client with the following URL:
Name: NiFi Registry URL: http://edge2ai-1.dim.local:18080
-
On the NiFi Registry Web UI, add another bucket for storing the Sensor flow we’re about to build'. Call it
SensorFlows
: -
Back on the NiFi Web UI, to enable version control for the Process Group, right-click on it and select Version > Start version control and enter the details below. Once you complete, a will appear on the Process Group, indicating that version control is now enabled for it.
Registry: NiFi Registry Bucket: SensorFlows Flow Name: SensorProcessGroup
-
Let’s also enable processors in this Process Group to use schemas stored in Schema Registry. Right-click on the Process Group, select Configure and navigate to the Controller Services tab. Click the
+
icon and add a HortonworksSchemaRegistry service. After the service is added, click on the service’s cog icon (), go to the Properties tab and configure it with the following Schema Registry URL and click Apply.URL: http://edge2ai-1.dim.local:7788/api/v1
-
Click on the lightning bolt icon () to enable the HortonworksSchemaRegistry Controller Service.
-
Still on the Controller Services screen, let’s add two additional services to handle the reading and writing of JSON records. Click on the button and add the following two services:
-
JsonTreeReader
, with the following properties:Schema Access Strategy: Use 'Schema Name' Property Schema Registry: HortonworksSchemaRegistry Schema Name: ${schema.name} -> already set by default!
-
JsonRecordSetWriter
, with the following properties:Schema Write Strategy: HWX Schema Reference Attributes Schema Access Strategy: Use 'Schema Name' Property Schema Registry: HortonworksSchemaRegistry
-
-
Enable the JsonTreeReader and the JsonRecordSetWriter Controller Services you just created, by clicking on their respective lightning bolt icons ().
-
Double-click on the newly created process group to expand it.
-
Inside the process group, add a new Input Port and name it "Sensor Data"
-
We need to tell NiFi which schema should be used to read and write the Sensor data. For this we’ll use an UpdateAttribute processor to add an attribute to the FlowFile indicating the schema name.
Add an UpdateAttribute processor by dragging the processor icon to the canvas:
-
Double-click the UpdateAttribute processor and configure it as follows: