The DataMasque AWS RDS Masking Blueprint provides a reusable masked data provisioning pipeline using AWS Step Functions and DataMasque APIs. These AWS CloudFormation-based templates are designed to automate the masking of production RDS Databases, creating masked snapshots that are safe for use in non-production environments. This workflow can also be triggered directly from the DataMasque UI.
The diagram below illustrates the reference architecture for DataMasque in AWS. The provided CloudFormation template automates the process of masking production RDS snapshots and generating masked RDS snapshots for provisioning non-production databases. These automated steps are highlighted in blue in the diagram. Please refer to DataMasque AWS Service Catalog Template to use AWS Service Catalog product as an End User Interface to provision non-production databases using masked RDS snapshots.
The CloudFormation template deploys the following AWS resources:
- An AWS Step Functions workflow.
- Five AWS Lambda functions.
- IAM roles for the Step Functions workflow and Lambda functions.
The Step Functions workflow orchestrates tasks by invoking AWS lambda functions and DataMasque masking APIs. It irreversibly replaces sensitive data, such as PII, PCI, and PHI, with realistic, functional, and consistent masked values based on rulsets provided for the masking run.
Refer to the DataMasque Documentation for detailed information on the permissions required for the DataMasque EC2 instance to initiate this automation through the automation
UI.
Before triggering the workflow, you must create two secrets: 1. DataMasque Instance Secret: Contains authentication details for connecting to the DataMasque instance to execute masking APIs. 2. Database Connection Secret: Contains connection details for the database you want to mask.
##Database Connection Secret Naming Convention
The secret’s name must start with datamasque/
and end with connections
. For example:
/datamasque/usersdb/postgres_connections.
The secret should include the following details for the database to be masked:
The secret should include the following details for the database to be masked:
{
"username": "postgres",
"password": "xxxxx",
"engine": "postgres",
"host": "dtq-of-datamasque-2b-02-aurora-cluster.cluster-xxxx.ap-southeast-2.rds.amazonaws.com",
"port": "5432",
"dbname": "postgres",
"schema": "Prod"
}
- username: Database username for the staged database.
- password: Password for the provided username.
- engine: The database engine (e.g., postgres).
- host: RDS/Aurora endpoint of the source database (DataMasque connects to the staged database, not the source).
- port: Port number of the database.
- dbname: Name of the database.
- schema: Schema name for the masking job.
Optional Database-Specific Parameters service_name: Applicable for Oracle connections. connection_fileset: Used for MySQL and MariaDB connections.
Note: Secrets must reside in the same AWS account where the DataMasque instance is deployed.
Once the secrets are created, the workflow can be triggered from the DataMasque UI by providing:
- The Database Secret Identifier created above.
- The deployed Step Function.
- The Source RDS/Aurora Cluster Identifier.
- The DataMasque Ruleset for the masking job.
- Optional PreferredAZ parameter to deploy the staged DB in a specific AZ. To save data transfer costs the staged RDS instance by default gets created in same AZ where DataMasque ec2 instance is running.
Workflow Execution Steps
-
Selects the latest available snapshot of the source database. If none exist, it creates a new snapshot.
-
Restores an RDS instance or Aurora cluster from the snapshot in the same AWS account.
-
Creates a temporary DataMasque connection to the staged database.
-
Executes the masking job using the specified ruleset.
-
Generates a masked snapshot of the staged database.
-
Removes the temporary connection and deletes the staged database.
Note: If the Step Function execution fails, the staged database must be manually deleted if it got created by the automation.
Upon successful completion, a masked RDS snapshot is produced, ready for provisioning non-production databases.
The diagram below describes the connectivities between the DataMasque instance, AWS Lambda functions (provisioned by this template) and the staging RDS instance (provisioned by this template).
- AWS CLI: configured with appropriate credential for the target AWS account.
- AWS SAM: CLI: https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-install.html
- Python 3.9 Runtime: Installed locally.
- DataMasque Instance: Ensure it is configured with a security group that allows outbound connections to RDS instances on required ports.
- Secrets: Authentication information for connecting to the DataMasque instance.
- Security Group (ID): The SG with inbound rules allowing the DataMasque instance to connect to the staged database. The automation attaches the SG to the restored staged database.
Make sure you have created a secret with the following keys and values:
{"username":"datamasque","password":"Example$P@ssword"}
Parameter | Description |
---|---|
VpcId | VPC ID where the lambdas will be deployed. |
SubnetIds | List of Subnet IDs where the lambdas will be deployed. It's recommended to provide at least two SubnetIds for redundancy and availability. |
DatamasqueBaseUrl | DataMasque instance URL with the EC2's private IP, i.e. https://<ec2-instance-private-ip>. |
DatamasqueSecretArn | Secret with DataMasque instance credentials. |
DataMasqueSecurityGroup | The Security Group ID that allows DataMasque instance to connect o to RDS ID. |
|
- Clone this repository
- Open a terminal in the cloned repository directory
- Run
sam build
- Run
sam deploy --guided
During the guided deployment, you will be asked if you would like to save the parameters in an AWS SAM configuration
file samconfig.toml
.
An example of the configuration file is presented below:
version = 0.1
[dtq.deploy.parameters]
stack_name = "dm-masking-blueprint"
resolve_s3 = true
s3_prefix = "dm-masking-blueprint"
region = "ap-southeast-2"
profile = "datamasque-dtq"
confirm_changeset = true
capabilities = "CAPABILITY_IAM"
disable_rollback = true
parameter_overrides = "DataMasqueSecurityGroup=\"sg-xxxxxxxxxxxx\" VpcId=\"vpc-XXXXXXXXXXXXXXX\" DatamasqueSecretArn=\"arn:aws:secretsmanager:region:11111111111111:secret:dm/credentials-Jqufnp\" SubnetIds=\"subnet-XXXXXXXXXXX,subnet-XXXXXXXXXXXX\" DatamasqueBaseUrl=\"https://10.90.6.14/\""
image_repositories = []
Please ensure the following network connectivities are configured after deploying the CloudFormation Stack:
- The Security group ID provided to parameter DataMasqueSecurityGroup must allow inbound connections from the DataMasque EC2 instance on required port/s. This Security group is attached to the staged database by automation.
- The DataMasque EC2 instance must allow inbound connections from the DatamasqueRun Lambda.
- Grant permissions for the stepfunctions and lambda functions to use the KMS key configured on the source database to encrypt masked snapshots if you are not using the default RDS key. Note: this template assumes the source database uses default RDS KMS keyas every organization might have different key configuration standard.
You can also execute the step function manually.
{
"DBInstanceIdentifier": "demo-oracle-datamasque-03-rds",
"DBSecretIdentifier": "arn:aws:secretsmanager:ap-southeast-2:897875381524:secret:datamasque/demo/oracle_connections-TpvoiO",
"DataMasqueRulesetId": "4f115c49-43bb-4cbc-a5b8-55a5aa9509e0",
"PreferredAZ": "ap-southeast-2b"
}
The AWS SAM template creates a CloudWatch event rule that schedules a Step Function execution once a week which is disabled by default. To use this scheduling functionality, you will need edit the rule to specify your target DBInstanceIdentifier along with other required parameters and enable the CloudWatch event rule.
Events:
Schedule:
Type: Schedule # More info about Schedule Event Source: https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/sam-property-statemachine-schedule.html
Properties:
Description: Schedule to run the DATAMASQUE state machine weekly
Enabled: False
Schedule: "rate(7 days)"
Input: '{
"DBInstanceIdentifier": "demo-oracle-datamasque-03-rds",
"DBSecretIdentifier": "arn:aws:secretsmanager:ap-southeast-2:123456789:secret:datamasque/demo/oracle_connections-TpvoiO",
"DataMasqueRulesetId": "4f115c49-43bb-4cbc-a5b8-55a5aa9509e0",
"PreferredAZ": "ap-southeast-2b"
}'
- The staging RDS instance created will have
datamasque
postfix after the DBInstanceIdentifier:
RDS database | Endpoint |
---|---|
Source RDS instance | source-postgres-rds .xxxxxxxxxx.ap-southeast-2.rds.amazonaws.com |
Staging RDS instance | source-postgres-rds-datamasque .xxxxxxxxxx.ap-southeast-2.rds.amazonaws.com |
-
The RDS username, password and connection port will be the same as the source RDS instance.
-
The staging RDS instance created during the execution of the stepfunction will be deleted when the execution is completed.
-
The masked RDS snapshot created during the execution of the stepfunction will be preserved when the execution is completed.
The following table describes the states and details of the step function definition.
Step | Description |
---|---|
Describe DB Snapshots | Fetch the latest snapshot of the source RDS instance and create if none exists. |
CheckSnapshotStatus | Choice step to check the status of selected source RDS snapshot. |
WaitforSnapshot | 120 seconds wait step before checking the status of source RDS snapshot if not in available state. |
Describe DB Instances | Captures configuration of source RDS instance to be masked. |
Restore DB from Snapshot | Restores the source RDS snapshot with the same configuration as the source RDS. |
CheckDBAvailability | Checks the status of the stage RDS instance to ensure it's available after the restore. |
IsDBAvailable | Choice step to check if the restored stage database is in an available state. |
WaitBeforeRetry | Wait step of 60 seconds before retrying the CheckDBAvailability step. |
FailState | Common Fail step referenced by multiple steps if a failure is encountered during execution. |
Datamasque API run | Executes the DataMasque masking run on the staging database. |
IsMaskRunComplete | Choice step to check the status of the masking run. |
MaskingRunInProgress | Wait step of 60 seconds before checking the masking run status again. |
CheckMaskingRunStatus | Step to check the status of the masking run. |
CreateDBSnapshot | Step to create a snapshot of the masked staging database. |
CheckMaskedSnapshotStatus | Choice step to check the status of the masked snapshot. |
WaitforMaskedSnapshot | Wait step of 60 seconds before checking the status of the masked snapshot. |
DeleteStageDBChoice | Choice step to decide whether to delete the cluster or instance based on the source database type. |
DeleteStgClusterInstance | Step to delete the database instance that is part of the staged Aurora cluster. |
DeleteStgCluster | Step to delete the staged Aurora cluster. |
DeleteStgRDS | Step to delete the staged database instance if the source database is an RDS instance. |
OutputMaskedSnapshot | Step to display the ARN of the masked snapshot. |
Sharing snapshots encrypted with the default service key for RDS is currently not supported. Sharing a DB snapshot.
To share your encrypted snapshot with another account, you will also need to share the custom master key with the other account through KMS. Changing a key policy .
The masked snapshot can be shared with the following methods:
- Use the native mechanism within the AWS Console.
- Use an existing CI/CD pipeline to copy and re-encrypt the snapshot.
- Enable support for RDS instances hosted in a different AWS account from the one where the DataMasque instance is deployed.