Amazon SageMaker & AWS LakeFormation Integration - Vend Temporary Credentials to read data from S3 in a Pandas
and Spark
dataframe
This example demonstrates how to read data from Amazon S3 using temporary credentials vended by LakeFormation without the execution role being used having any direct access to S3. The vended credentials are used to read data from S3 into a Pandas dataframe and a Spark dataframe.
This solution contains a utility function that invokes LakeFormation APIs to grant permissions and vend temporary credentials for a table registered with the Glue Data Catalog. This utility function is invoked from a SageMaker Studio Notebook to get the temporary credentials and then use these credentials to read data from S3. Note that the temporary credentials are retrieved from a different application specific role that represents the application that wants to read the data. The application in this example is the SageMaker Studio Notebook and it is asking LakeFormation (through the credential vending utility function) to provide it the credentials so that it can read the data. It is important to note that the application itself has NO access to read the data from S3 and depends upon the temporary vended credentials to read the data; thus enforcing the coarse grained access conrol through LakeFormation.
This solution requires that you have SageMaker Studio setup in your account and have the necessary IAM permissions to setup policies for LakeFormation and S3 access as described in the next section.
This solution requires LakeFormation, Glue and IAM role setup. Each of these is described in the sections below.
There are two IAM roles involved in this solution, a SageMaker Execution Role
and an Application Role
. The SageMaker Execution Role
is used to run the notebook and it is also the one that has the required access to vend temporary credentials to the Application Role
. The Application Role
as the name suggests is a role tied to the application (in this case the SageMaker Notebook) which in itself has no access to read data from S3 but can be assigned temporary credentials by the SageMaker Execution Role
to enable it to temporarily read data from S3.
-
Create an inline policy with LakeFormation Permissions as shown below and assign it to the
SageMaker Execution Role
.{ "Version": "2012-10-17", "Statement": { "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess", "lakeformation:GetDataLakeSettings", "lakeformation:GrantPermissions", "glue:GetTable", "glue:CreateTable", "glue:GetTables", "glue:GetDatabase", "glue:GetDatabases", "glue:CreateDatabase", "glue:GetUserDefinedFunction", "glue:GetUserDefinedFunctions", "glue:GetPartition", "glue:GetPartitions" ], "Resource": "*" } }
-
Add
AmazonSageMakerFullAccess
permission (this might already be there). -
Add an inline policy to allow
sts:AssumeRole
andsts:TagSession
on the roleApplication Role
that gets fine-grained access control over the table in the Glue data catalog.{ "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Action": [ "sts:AssumeRole", "sts:TagSession" ], "Resource": [ "arn:aws:iam::<your-account-id>:role/<application-role>" ] } ] }
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": "sts:AssumeRole"
},
{
"Sid": "AllowAssumeRoleAndPassSessionTag",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account-id>:root"
},
"Action": [
"sts:AssumeRole",
"sts:TagSession"
],
"Condition": {
"StringLike": {
"aws:RequestTag/LakeFormationAuthorizedCaller":"<your-session-value-tag>"
}
}
},
{
"Effect": "Allow",
"Principal": {
"Service": "glue.amazonaws.com"
},
"Action": "sts:AssumeRole"
},
{
"Sid": "AllowPassSessionTags",
"Effect": "Allow",
"Principal": {
"AWS": "<application-role>"
},
"Action": "sts:TagSession",
"Condition": {
"StringLike": {
"aws:RequestTag/LakeFormationAuthorizedCaller": "<your-session-value-tag>"
}
}
}
]
}
On the LakeFormation administration page make the SageMaker Execution Role
a data lake admin as shown in the screenshot below.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAssumeRoleAndPassSessionTag",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account-id>:root"
},
"Action": [
"sts:AssumeRole",
"sts:TagSession"
]
},
{
"Effect": "Allow",
"Principal": {
"Service": "lakeformation.amazonaws.com"
},
"Action": "sts:AssumeRole"
},
{
"Sid": "AllowPassSessionTags",
"Effect": "Allow",
"Principal": {
"AWS": "<lf-role-to-be-assumed>"
},
"Action": "sts:TagSession",
"Condition": {
"StringLike": {
"aws:RequestTag/LakeFormationAuthorizedCaller":"<your-session-value-tag>"
}
}
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess",
"glue:GetTable",
"glue:GetTables",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:CreateDatabase",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions",
"glue:GetPartition",
"glue:GetPartitions"
],
"Resource": "*"
}
]
}
The sagemaker-lf-credential-vending.ipynb
notebook creates a Glue data catalog table and also places a file in S3 to hydrate the table. If you already have data in S3 that you would like read as a Glue data catalog table then you can do that by updating the DATABASE_NAME
, DATABASE_S3_LOCATION
and TABLE_NAME
variables in the sagemaker-lf-credential-vending.ipynb
notebook.
Once you have executed all the steps in the Setup section above you are now ready to run the code.
-
Open the
sagemaker-lf-credential-vending.ipynb
notebook and run all cells.- The SageMaker notebook
sagemaker-lf-credential-vending.ipynb
runs on aSageMaker Execution Role
. SageMaker Execution Role
is used to create a Database and a Glue data table using thecreate_table
andcreate_database
APIs.- For the purpose of this sample, we insert
dummy_data
as acsv
file within the S3 bucket of the corresponding to our table in theGlue data catalog
. - More information is given in the Notebook as comments.
- The SageMaker notebook
-
The notebook sets the
AllowFullTableExternalDataAccess
toTrue
in thesettings['DataLakeSettings']
to vend temporary credentials for the Glue table. -
The notebook uses the
get_lf_temp_credentials
provided by thelf_vend_credentials.py
module to get the temporary credentials to read data from S3 and then theread_spark_lf_data
andread_pandas_lf_data
from theread_data.py
to read data using these credentials into a Spark and Pandas DataFrame respectively. Note that the code in this notebook refers to the data it needs to read via the Glue table name rather than by its path in S3 because the LakeFormation and Glue hide those details from this application.
- Vending Temporary Credentials: This file uses
SageMaker Execution Role
that grants fine-grained access control toApplication Role
on the list of specific columns. After granting the role fine-grained access control on the requested columns theSageMaker Execution Role
performs anassume role
onApplication Role
, and gets temporary glue table credentials which contains theAccessKeyId
,SecretAccessKey
, andSessionToken
. Seeget_lf_temp_credentials
for more details.
- Reading data from S3 in
Pandas
andSpark
: This file uses the temporary credentials, S3 path, file type and the list of columns that theApplication Role
has fine-grained access to, and reads the data in aPandas
andSpark
DataFrame. Thepandas_read_lf_data
function subsets and returns the metadata in aPandas
DataFrame, and thespark_read_lf_data
function returns the metadata in aSpark
DataFrame.