Objective: In this exercise, we will merge two files and store the result in a HANA Database. To do this we will introduce you to the Structured Data operators as well as the Data Workflow operators. To reduce complexity this exercise has split the task into two parts:
-
Read the
performance.csv
from an object store and after a projection and aggregation of the data we store the it to a HANA table. -
Add the
configuration.csv
file and after a projection join it with the "transform" branch of the first part.
If you have already some experience with the "Structured Data Operators" operators then you might like to try it by your own instead of going through the detailed description of the exercises. Here are the main tasks you have to do:
- Create a new graph
- Use the
Structured File Consumer
operator read in the fileperformance.csv
which was generated in the previous exercise . - Add a
Data Transform
operator to your graph - Add the
Projection
operator and name the target columns and convert to the data types("C0" : "DATE" - date, "C1": "CELLID" - int32, "C2": "KEY1" - float32, "C3": "KEY" - float32
- Add a
Aggregation
operator and automap the input to the target. - Click on the "edit"-icon of the target fields "KEY1" and "KEY2" and choose as "Aggregation Type": AVG. By this you reduce the table to having only a unique record for each day and device ("CELLID"). This you can then later join with the "configuration.csv" file where you have only a daily setting.
- Add a "Project" operator with additional target columns "NOM_KEY1"/float32 and "NOM_KEY2"/float32 and initialize them with "0". Add a filter for the "CELLID" and "DATE" to get finally get only one record for validation. This step is only necessary for an intermediate test.
- Add an "output" operator
- On the top-most pipeline level add a table producer with your target HANA table (<USER>_CELLSTATUS) and mode overwrite.
- Add a "Workflow Terminator" operator
Save, run and check the result.
After completing these steps you will have created a first projection, aggregation and saving data transformation.
-
Create a new graph using the + icon in the upper left corner of the modeler
Note: Unlike the
List Files
operator that you used in the previous exercise many other operators like theStructured Data
operators require an input in order to beginning the pipeline execution. For this reason we must include theWorkflow Trigger
operator to our graph. -
Add the operators
Workflow Trigger
andStructured File Consumer
to the canvas and connect them.
Right click the Structured File Consumer
operator and open the configuration menu to parameterize it.
-
(Optional) Rename the Label to
Performance
. This makes the pipeline more readable -
Set Storage type (drop-down menu):
sdl
-
In sdl Connection, click on pencil-icon and set
- Configuration Type:
Configuration Manager
- Connection ID:
DI_DATA_LAKE
- Save
- Configuration Type:
-
Select the source file by clicking on monitor-icon :
/shared/DAT263/input/performance.csv
-
You may click on the Data Preview button to inspect the data you are working with
-
Set Fail on string truncation:
True
Info This operators reads a small sample of the dataset to infer the length of a string data type. When this parameter is set to
True
reader will exit and fail if a string truncation has happened. This behavior happens when the length of a string in the metadata is smaller than the actual data in the source file. -
Set Fetch size:
1000
records (default)
- Click on Save (mid-top of pipeline canvas)
- In the pop-up window 'Save' enter:
- Name: <USER>.MergeCellStatus
- Description: Merge to CellStatus
- Category: <USER>
- Click "OK"
This operator is doing the core part of the whole pipeline.
-
Add
Data Transform
operator to the pipeline -
Draw a line from the outport of the
Structured File Consumer
onto theData Transform
operator. This results into a creation of a new input port of theData Transform
operator -
Double-click the
Data Transform
. This opens a new modeling perspective. -
From the operator list on the left, add the "Projection" operator, link the outport of the "input1" operator to the "Projection" operator and open the configuration of the operator by double-click on the gray-triangle.
-
Click on the automapping button and change the name and the data types at the target according to below table.
Source Target DataType C0 DATE date C1 CELLID int32 C2 KEY1 float32 C3 KEY2 float32 -
Return to the modeling view by clicking on the < icon in the top-left corner of the pipeline canvas.
-
We can now perform an aggregation, similar to that of a SQL Group By, such that that we will get one distinct record for each day and device (="CELLID"). Add an
Aggregation
operator and connect it to theProjection
operator. -
Open the configuration screen by double clicking on the
Aggregation
operator or click on the grey triangle at the bottom right corner -
Return to the
Data Transform
modeling view and add second aProjection
operator and connect it to theAggregation
operator.
- Double click the operator to configure it
- Once again an automap the source to target
- Add two additional target columns by clicking on the + icon with the following properties:
- (Optional) This step is unnecessary for the final productive pipeline but for developing it is helpful to test intermediate steps.
-
Switch to the Filter tab
-
Add a condition, similar to that of SQL WHERE condition:
"CELLID" = 1234512 AND "DATE" = '2020-11-01'
-
- Return to the
Data Transform
modeling perspective and right-click on the outport of the lastProjection
operator and selectCreate Data target
. . - By clicking on the auto-layout you can rearrange your pipeline
- Open the Configuration menu of the
output
operator. Under the General section you can verify the outgoing columns of theTransform
operator. .
- Return to the main pipeline modeling perspective by using the < button in the top-left corner of the
Data Transform
canvas. - Add a
Table Producer
operator and connect it's input to the outputData Transform
operator. - Configure the
Table Producer
with the following parameters:- Rename Label:
HANA CELLSTATUS
(or your own preferred description term) - Database type:
HANA
- HANA Connection:
- Configuration Type :
Configuration Manager
- Connection ID:
HANAC_WS
- Configuration Type :
- HANA Target Table:
TECHED.<USER>_CELLSTATUS
. Do not select an existing table by clicking on the "screen"-icon but enter the name of the table literally. This creates an non-existant table using the data and data types provided by the input data. - Mode:
overwrite
- creates a new table or overwrites an existing table with the new table structure. - Batch size:
1000
(default)
- Rename Label:
Add and connect a Workflow Terminator
and you are done! Now you can save and run the pipeline and check the result with the Metadata Explorer. If you set filter in step 11 of this example you should see only one record.
The second part of the exercise is to add a join with the configuration-file in order to compare the daily settings with the average of performance send during the day. The latter and more complicated part has been accomplished already. You learnt the general concept of the "Structured Data Operators" therefore the following exercise might be much easier.
Similar to what you already did in Exercise 1.1 :
- Add the
Workflow Trigger
andStructured File Consumer
operators - Configure
Structured File Consumer
operator:- Rename the Label to
Configuration
- Set storage type to
sdl
- Set SDL Connection to
DI_DATA_LAKE
- Choose SDL Source file to be
/shared/DAT263/input/configuration.csv
- Rename the Label to
- Connect the output of the
Structured File Consumer
operator to the existingData Tranform
operator. This will automatically create a new input port.
Double click the Data Transform
operator to open it. You will then see a second input operator input2
that provides the read data from the configuration file.
-
Add a new
Projection
operator and connect it to the newinput2
operator -
Configure the new
Projector
operator-
Automap the source columns to target columns
-
Change the names and the data types of the target columns as follows:
Source Target DataType C0 DATE date C1 CELLID int32 C2 NOM_KEY1 float32 C3 NOM_KEY2 float32
-
-
Add a
Join
operator and connect the newProjection2
operator to the topinport
of theJoin
operator -
Remove the connection from the
Aggregation
output operator to theProjection1
operator by clicking on the connection to mark it and then right-click to remove it. -
Connect the outport of the
Aggregation
operator with the bottominport
of theJoin
operator -
Configure the
Join
operator-
From the Definition view, join both input
DATE
columns by dragging one join column to the other. -
In the opened Join Definition section you should now see the join. If it does not display, simply double click the link between the two inputs. "Join_Input1"."DATE" = "Join_Input2"."DATE". Create a composite join by also joining together the
CELLID
columns from both inputs: "Join_Input1"."DATE" = "Join_Input2"."DATE" AND "Join_Input1"."CELLID" = "Join_Input2"."CELLID" -
Switch to Columns view (left top corner link)
-
Automap
the source to target -
Remove one
DATE
and oneCELLID
target columns by marking the target column and clicking on the wastebasket icon
-
-
Connect the Join
outport
with theinport
of theProjection1
that is sending its output to the "output" operator. -
Open the
Projection
configuration and map the source columnsNOM_KEY1
andNOM_KEY2
to the corresponding target columns. -
Return to the main pipeline layer, save the graph and start it.
Again you should see only one record if you chose to apply the the filter but this time it should include all columns with populated values.
Attention! If you have duplicate rows you might consider that it is because your input has already contained the duplicate rows. This may occur if you re-run the pipeline from Exercise 1 where data is appended to the source file. In such cases you need to make sure to delete any existing target file before rerunning the pipeline from Exercise 1.
You've now actually created a rather complex data transformation from two different data sources, with joins, aggregation and filtering and storing it to a different type of storage.
Solution Example: Exercise 1 - Example
Continue to Exercise 2: Validating Data Quality