Mystique Unicorn App uses streams to handle events between their microservices. One of those streams receives events with sensitive data. They have recently decided to give freedom to their customers and updated their privacy policy with an opt-out policy. This enables customers to allow
or dis-allow
their consent for data sharing. Commonly referred to as Data anonymization.
Data anonymization has been defined as a process by which personal data is irreversibly altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party.
Source: Wikipedia
As their AWS solutions architect can you help them implement this privacy policy in their stream?
AWS offers multiple capabilities to process streaming events.They are using kinesis data streams to ingest the stream of customer events. The event payload will look the event shown below. In our demo, we are going to assume only dob
and ssn
are sensitive information that needs to be scrubbed if the consent is False
{
"name": "Gnoll",
"dob": "2011-04-22",
"gender": "M",
"ssn_no": "807831104",
"data_share_consent": False,
"evnt_time": "2021-01-31T20:31:26.114917"
}
We can leverage Kinesis firehose capability to implement a lambda function that can scrub sensitive data based on the feature flag data_share_consent
. This feature flag allows us to scrub data only for customers who have chosen to opt-out from the data sharing. Once we have scrubbed the data, we will fill those fields with REDACTED_CONTENT
as standard text and add an additional boolean field data_redacted
. In future, if we want to quickly collate or further process these events this flag will be helpful.
{
"name": "Gnoll",
"dob": "REDACTED_CONTENT",
"gender": "M",
"ssn_no": "REDACTED_CONTENT",
"data_share_consent": False,
"evnt_time": "2021-01-31T20:31:26.114917",
"data_redacted": True
}
Another potential feature flat we can add is the timestamp
of when the data was scrubbed. I will leave that as an additional exervice for the you. After successfully processing the events, We will persist those events in S3. The final AWS architecture looks something like,
In this article, we will build an architecture, similar to the one shown above. We will start backwards so that all the dependencies are satisfied.
-
This demo, instructions, scripts and cloudformation template is designed to be run in
us-east-1
. With few modifications you can try it out in other regions as well(Not covered here).- π AWS CLI Installed & Configured - Get help here
- π AWS CDK Installed & Configured - Get help here
- π Python Packages, Change the below commands to suit your OS, the following is written for amzn linux 2
- Python3 -
yum install -y python3
- Python Pip -
yum install -y python-pip
- Virtualenv -
pip3 install virtualenv
- Python3 -
-
-
Get the application code
git clone https://github.com/miztiik/sensitive-data-filter-instream cd sensitive-data-filter-instream
-
-
We will use
cdk
to make our deployments easier. Lets go ahead and install the necessary components.# You should have npm pre-installed # If you DONT have cdk installed npm install -g aws-cdk # Make sure you in root directory python3 -m venv .venv source .venv/bin/activate pip3 install -r requirements.txt
The very first time you deploy an AWS CDK app into an environment (account/region), youβll need to install a
bootstrap stack
, Otherwise just go ahead and deploy usingcdk deploy
.cdk bootstrap cdk ls # Follow on screen prompts
You should see an output of the available stacks,
sensitive-data-producer-stack sensitive-data-filter-stack
-
Let us walk through each of the stacks,
-
Stack: sensitive-data-producer-stack
This stack will create a kinesis data stream and the producer lambda function. Each lambda runs for a minute ingesting stream of events. The feature flag
data_share_consent
will randomly toggle betweenTrue
andFalse
giving us both types of payload in our stream.Initiate the deployment with the following command,
cdk deploy sensitive-data-producer-stack
After successfully deploying the stack, Check the
Outputs
section of the stack. You will find thestreamDataProcessor
producer lambda function. We will invoke this function later during our testing phase. -
Stack: sensitive-data-filter-stack
This stack will create the firehose stack to receive the stream of events from kinesis data stream. This stack will also provision a lambda
Initiate the deployment with the following command,
cdk deploy sensitive-data-filter-stack
After successfully deploying the stack, Check the
Outputs
section of the stack. You will find theSensitiveDataFilter
lambda function and theFirehoseDataStore
where the customer events will be stored eventually.
-
-
-
Invoke Producer Lambda: Let us start by invoking the lambda from the producer stack
sensitive-data-producer-stack
using the AWS Console. If you want to ingest more events, use another browser window and invoke the lambda again.{ "statusCode": 200, "body": "{\"message\": {\"status\": true, \"record_count\": 1168}}" }
Here in this invocation, the prodcuer has ingested about
1168
customer events to the stream. -
Check FirehoseDataStore:
After about
60
seconds, Navigate to the data store S3 Bucket created by the firehose stacksensitive-data-filter-stack
. You will be able to find an object key similar to thissensitive-data-filter-stack-fhdatastore6289deb2-1h8i5lr61plswphi-data/2021/02/01/21/phi_data_filter-1-2021-02-01-21-44-38-2ee4ff4d-5019-4eaf-a910-9b2d1ad0ed2b
.Kinesis firehose does not have a native mechanism to set the file extension. I was not too keen on setting up another lambda to add the suffix. But the file contents should be one valid
JSON
object per line.
The contents of the file should look like this,
... {"name": "Shardmind", "dob": "REDACTED_CONTENT", "gender": "F", "ssn_no": "REDACTED_CONTENT", "data_share_consent": false, "evnt_time": "2021-01-31T22:09:45.008532", "data_redacted": true} {"name": "Kalashtar", "dob": "1942-09-05", "gender": "M", "ssn_no": "231793521", "data_share_consent": true, "evnt_time": "2021-01-31T22:09:45.591946"} {"name": "Vedalken", "dob": "1954-06-18", "gender": "F", "ssn_no": "288109737", "data_share_consent": true, "evnt_time": "2021-01-31T22:09:45.631935"} ... {"name": "Half-Orc", "dob": "REDACTED_CONTENT", "gender": "M", "ssn_no": "REDACTED_CONTENT", "data_share_consent": false, "evnt_time": "2021-01-31T22:09:45.691951", "data_redacted": true} {"name": "Lizardfolk", "dob": "REDACTED_CONTENT", "gender": "F", "ssn_no": "REDACTED_CONTENT", "data_share_consent": false, "evnt_time": "2021-01-31T22:09:45.752012", "data_redacted": true} {"name": "Half-Elf", "dob": "1951-05-19", "gender": "F", "ssn_no": "533665204", "data_share_consent": true, "evnt_time": "2021-01-31T22:09:45.811942"} ...
You can observe that the sensitive information for customer who have opted to not share has been scrubbed.
-
-
Here we have demonstrated how to use kinesis firehose and lambda function to scrub sensitive data from streaming events. You can extend this further by enriching the item before storing in S3 or partitioning it better for ingesting into data lake platforms.
-
If you want to destroy all the resources created by the stack, Execute the below command to delete the stack, or you can delete the stack from console as well
- Resources created during Deploying The Application
- Delete CloudWatch Lambda LogGroups
- Any other custom resources, you have created for this demo
# Delete from cdk cdk destroy # Follow any on-screen prompts # Delete the CF Stack, If you used cloudformation to deploy the stack. aws cloudformation delete-stack \ --stack-name "MiztiikAutomationStack" \ --region "${AWS_REGION}"
This is not an exhaustive list, please carry out other necessary steps as maybe applicable to your needs.
This repository aims to show how to perform data scrubbing from events to new developers, Solution Architects & Ops Engineers in AWS. Based on that knowledge these Udemy course #1, course #2 helps you build complete architecture in AWS.
Thank you for your interest in contributing to our project. Whether it is a bug report, new feature, correction, or additional documentation or solutions, we greatly value feedback and contributions from our community. Start here
Buy me a coffee β.
Level: 300