Skip to content

miztiik/sensitive-data-filter-instream

Repository files navigation

Scrub Sensitive Data from Event Streams - Kinesis

Mystique Unicorn App uses streams to handle events between their microservices. One of those streams receives events with sensitive data. They have recently decided to give freedom to their customers and updated their privacy policy with an opt-out policy. This enables customers to allow or dis-allow their consent for data sharing. Commonly referred to as Data anonymization.

Data anonymization has been defined as a process by which personal data is irreversibly altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party.
Source: Wikipedia

As their AWS solutions architect can you help them implement this privacy policy in their stream?

🎯 Solutions

AWS offers multiple capabilities to process streaming events.They are using kinesis data streams to ingest the stream of customer events. The event payload will look the event shown below. In our demo, we are going to assume only dob and ssn are sensitive information that needs to be scrubbed if the consent is False

{
  "name": "Gnoll",
  "dob": "2011-04-22",
  "gender": "M",
  "ssn_no": "807831104",
  "data_share_consent": False,
  "evnt_time": "2021-01-31T20:31:26.114917"
}

Miztiik Automation: Sensitive Data Filter Instream

We can leverage Kinesis firehose capability to implement a lambda function that can scrub sensitive data based on the feature flag data_share_consent. This feature flag allows us to scrub data only for customers who have chosen to opt-out from the data sharing. Once we have scrubbed the data, we will fill those fields with REDACTED_CONTENT as standard text and add an additional boolean field data_redacted. In future, if we want to quickly collate or further process these events this flag will be helpful.

Miztiik Automation: Sensitive Data Filter Instream

{
  "name": "Gnoll",
  "dob": "REDACTED_CONTENT",
  "gender": "M",
  "ssn_no": "REDACTED_CONTENT",
  "data_share_consent": False,
  "evnt_time": "2021-01-31T20:31:26.114917",
  "data_redacted": True
}

Another potential feature flat we can add is the timestamp of when the data was scrubbed. I will leave that as an additional exervice for the you. After successfully processing the events, We will persist those events in S3. The final AWS architecture looks something like, Miztiik Automation: Sensitive Data Filter Instream

In this article, we will build an architecture, similar to the one shown above. We will start backwards so that all the dependencies are satisfied.

  1. 🧰 Prerequisites

    This demo, instructions, scripts and cloudformation template is designed to be run in us-east-1. With few modifications you can try it out in other regions as well(Not covered here).

    • πŸ›  AWS CLI Installed & Configured - Get help here
    • πŸ›  AWS CDK Installed & Configured - Get help here
    • πŸ›  Python Packages, Change the below commands to suit your OS, the following is written for amzn linux 2
      • Python3 - yum install -y python3
      • Python Pip - yum install -y python-pip
      • Virtualenv - pip3 install virtualenv
  2. βš™οΈ Setting up the environment

    • Get the application code

      git clone https://github.com/miztiik/sensitive-data-filter-instream
      cd sensitive-data-filter-instream
  3. πŸš€ Prepare the dev environment to run AWS CDK

    We will use cdk to make our deployments easier. Lets go ahead and install the necessary components.

    # You should have npm pre-installed
    # If you DONT have cdk installed
    npm install -g aws-cdk
    
    # Make sure you in root directory
    python3 -m venv .venv
    source .venv/bin/activate
    pip3 install -r requirements.txt

    The very first time you deploy an AWS CDK app into an environment (account/region), you’ll need to install a bootstrap stack, Otherwise just go ahead and deploy using cdk deploy.

    cdk bootstrap
    cdk ls
    # Follow on screen prompts

    You should see an output of the available stacks,

    sensitive-data-producer-stack
    sensitive-data-filter-stack
  4. πŸš€ Deploying the application

    Let us walk through each of the stacks,

    • Stack: sensitive-data-producer-stack

      This stack will create a kinesis data stream and the producer lambda function. Each lambda runs for a minute ingesting stream of events. The feature flag data_share_consent will randomly toggle between True and False giving us both types of payload in our stream.

      Initiate the deployment with the following command,

      cdk deploy sensitive-data-producer-stack

      After successfully deploying the stack, Check the Outputs section of the stack. You will find the streamDataProcessor producer lambda function. We will invoke this function later during our testing phase.

    • Stack: sensitive-data-filter-stack

      This stack will create the firehose stack to receive the stream of events from kinesis data stream. This stack will also provision a lambda

      Initiate the deployment with the following command,

      cdk deploy sensitive-data-filter-stack

      After successfully deploying the stack, Check the Outputs section of the stack. You will find the SensitiveDataFilter lambda function and the FirehoseDataStore where the customer events will be stored eventually.

  5. πŸ”¬ Testing the solution

    1. Invoke Producer Lambda: Let us start by invoking the lambda from the producer stack sensitive-data-producer-stack using the AWS Console. If you want to ingest more events, use another browser window and invoke the lambda again.

      {
        "statusCode": 200,
        "body": "{\"message\": {\"status\": true, \"record_count\": 1168}}"
      }

      Here in this invocation, the prodcuer has ingested about 1168 customer events to the stream.

    2. Check FirehoseDataStore:

      After about 60 seconds, Navigate to the data store S3 Bucket created by the firehose stack sensitive-data-filter-stack. You will be able to find an object key similar to this sensitive-data-filter-stack-fhdatastore6289deb2-1h8i5lr61plswphi-data/2021/02/01/21/phi_data_filter-1-2021-02-01-21-44-38-2ee4ff4d-5019-4eaf-a910-9b2d1ad0ed2b.

      Kinesis firehose does not have a native mechanism to set the file extension. I was not too keen on setting up another lambda to add the suffix. But the file contents should be one valid JSON object per line.

    The contents of the file should look like this,

    ...
    {"name": "Shardmind", "dob": "REDACTED_CONTENT", "gender": "F", "ssn_no": "REDACTED_CONTENT", "data_share_consent": false, "evnt_time": "2021-01-31T22:09:45.008532", "data_redacted": true}
    {"name": "Kalashtar", "dob": "1942-09-05", "gender": "M", "ssn_no": "231793521", "data_share_consent": true, "evnt_time": "2021-01-31T22:09:45.591946"}
    {"name": "Vedalken", "dob": "1954-06-18", "gender": "F", "ssn_no": "288109737", "data_share_consent": true, "evnt_time": "2021-01-31T22:09:45.631935"}
    ...
    {"name": "Half-Orc", "dob": "REDACTED_CONTENT", "gender": "M", "ssn_no": "REDACTED_CONTENT", "data_share_consent": false, "evnt_time": "2021-01-31T22:09:45.691951", "data_redacted": true}
    {"name": "Lizardfolk", "dob": "REDACTED_CONTENT", "gender": "F", "ssn_no": "REDACTED_CONTENT", "data_share_consent": false, "evnt_time": "2021-01-31T22:09:45.752012", "data_redacted": true}
    {"name": "Half-Elf", "dob": "1951-05-19", "gender": "F", "ssn_no": "533665204", "data_share_consent": true, "evnt_time": "2021-01-31T22:09:45.811942"}
    ...

    You can observe that the sensitive information for customer who have opted to not share has been scrubbed.

  6. πŸ“’ Conclusion

    Here we have demonstrated how to use kinesis firehose and lambda function to scrub sensitive data from streaming events. You can extend this further by enriching the item before storing in S3 or partitioning it better for ingesting into data lake platforms.

  7. 🧹 CleanUp

    If you want to destroy all the resources created by the stack, Execute the below command to delete the stack, or you can delete the stack from console as well

    • Resources created during Deploying The Application
    • Delete CloudWatch Lambda LogGroups
    • Any other custom resources, you have created for this demo
    # Delete from cdk
    cdk destroy
    
    # Follow any on-screen prompts
    
    # Delete the CF Stack, If you used cloudformation to deploy the stack.
    aws cloudformation delete-stack \
      --stack-name "MiztiikAutomationStack" \
      --region "${AWS_REGION}"

    This is not an exhaustive list, please carry out other necessary steps as maybe applicable to your needs.

πŸ“Œ Who is using this

This repository aims to show how to perform data scrubbing from events to new developers, Solution Architects & Ops Engineers in AWS. Based on that knowledge these Udemy course #1, course #2 helps you build complete architecture in AWS.

πŸ’‘ Help/Suggestions or πŸ› Bugs

Thank you for your interest in contributing to our project. Whether it is a bug report, new feature, correction, or additional documentation or solutions, we greatly value feedback and contributions from our community. Start here

πŸ‘‹ Buy me a coffee

ko-fi Buy me a coffee β˜•.

πŸ“š References

  1. Docs: Kinesis Analytics Tumbling Windows - Flink

  2. Docs: Kinesis Streaming Analytics - GROUP BY

  3. Docs: Tumbling Window Using an Event Timestamp

  4. Blog: Kinesis Firehose S3 Custom Prefix

  5. Docs: Kinesis Firehose S3 Custom Prefix

  6. Docs: Kinesis Analytics IAM Role

🏷️ Metadata

miztiik-success-green

Level: 300