ADR-11: Data analysis capability

Date:

2021-11-14

Status:

Proposed

Context:

We propose to choose Amazon data analysis services for Users Engagement Analysis Module. There are two options: Amazon Athena and Amazon Redshift.

Athena is a serverless service and does not need any infrastructure to create, manage, or scale data sets. It works directly on top of Amazon S3 data sets. It creates external tables and therefore does not manipulate S3 data sources, working as a read-only service from an S3 perspective.

On the other hand, Redshift is a petabyte-scale data warehouse used together with business intelligence tools for modern analytical solutions. Unlike Athena, Redshift requires a cluster for which we need to upload the data extracts and build tables before we can query.

	RedShift	Athena
Partitioning	Does not support direct partitioning by default Uses predefined distribution keys to optimize tables for parallel processing Poor manual partition key selection can dramatically impact query performance, so Redshift does it for you	Can partition by any key with up to 20,000 per table Supports several Serializer/Deserializer (SerDe) libraries for parsing data from different data formats: CSV, JSON, TSV, and Apache logs Does not support arrays or object identifier types
User Defined Functions	Supports UDFs with scalar and aggregate functions	Does not support UDFs
Data Formats and Types	Supports several Serializer/Deserializer (SerDe) libraries for parsing data from different data formats: CSV, JSON, TSV, and Apache logs Does not support arrays or object identifier types	Supports several Serializer/Deserializer (SerDe) libraries for parsing data from different data formats: CSV, JSON, TSV, Parquet, and ORC Beneficial due to Athena’s convenient data to query structure Supports complex data types like arrays, maps, and structs
Infrastructure	Requires a cluster to set itself up. A significant amount of time is required to prepare and set up the cluster. Once the cluster is ready to use, we need to load data into the tables. This also comes with a lag time depending on the amount of data being loaded.	Free from all such dependencies as it does not need infrastructure at all. It just creates its own external tables on top of Amazon S3 data sets.
Pricing	Based on type and number of nodes in a cluster Hourly rate for both dense compute nodes and dense storage nodes Predictable price with no penalty on excess queries, but can increase overall cost with fixed compute (SSD) and storage (HDD)	Price is per terabyte of data scanned during a query execution Minimum of 10 megabytes per query execution No charge for failed queries Most cost effective when data is compressed, partitioned, or converted to columnar format

Decision:

We propose to use AWS Glue with AWS Athena for data analysis capability. Amazon Athena is noteworthy due to its simple yet efficient quality. No initial set up is required which makes ad hoc querying easy. It’s practical for simple read and aggregated queries and is relatively cost effective. Generally, Athena works best for quickly and conveniently running queries at a low cost without needing to set up a complex infrastructure. We don’t need all the possibilities and complexity of the Redshift solution on this step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADR-11: Data analysis capability.md

ADR-11: Data analysis capability.md

ADR-11: Data analysis capability

Date:

Status:

Context:

Decision:

Files

ADR-11: Data analysis capability.md

Latest commit

History

ADR-11: Data analysis capability.md

File metadata and controls

ADR-11: Data analysis capability

Date:

Status:

Context:

Decision: