You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We propose to choose Amazon data analysis services for Users Engagement Analysis Module. There are two options: Amazon Athena and Amazon Redshift.
Athena is a serverless service and does not need any infrastructure to create, manage, or scale data sets. It works directly on top of Amazon S3 data sets. It creates external tables and therefore does not manipulate S3 data sources, working as a read-only service from an S3 perspective.
On the other hand, Redshift is a petabyte-scale data warehouse used together with business intelligence tools for modern analytical solutions. Unlike Athena, Redshift requires a cluster for which we need to upload the data extracts and build tables before we can query.
RedShift
Athena
Partitioning
Does not support direct partitioning by default
Uses predefined distribution keys to optimize tables for parallel processing
Poor manual partition key selection can dramatically impact query performance, so Redshift does it for you
Can partition by any key with up to 20,000 per table
Supports several Serializer/Deserializer (SerDe) libraries for parsing data from different data formats: CSV, JSON, TSV, and Apache logs
Does not support arrays or object identifier types
User Defined Functions
Supports UDFs with scalar and aggregate functions
Does not support UDFs
Data Formats and Types
Supports several Serializer/Deserializer (SerDe) libraries for parsing data from different data formats: CSV, JSON, TSV, and Apache logs
Does not support arrays or object identifier types
Supports several Serializer/Deserializer (SerDe) libraries for parsing data from different data formats: CSV, JSON, TSV, Parquet, and ORC
Beneficial due to Athena’s convenient data to query structure
Supports complex data types like arrays, maps, and structs
Infrastructure
Requires a cluster to set itself up. A significant amount of time is required to prepare and set up the cluster. Once the cluster is ready to use, we need to load data into the tables. This also comes with a lag time depending on the amount of data being loaded.
Free from all such dependencies as it does not need infrastructure at all. It just creates its own external tables on top of Amazon S3 data sets.
Pricing
Based on type and number of nodes in a cluster
Hourly rate for both dense compute nodes and dense storage nodes
Predictable price with no penalty on excess queries, but can increase overall cost with fixed compute (SSD) and storage (HDD)
Price is per terabyte of data scanned during a query execution
Minimum of 10 megabytes per query execution
No charge for failed queries
Most cost effective when data is compressed, partitioned, or converted to columnar format
Decision:
We propose to use AWS Glue with AWS Athena for data analysis capability. Amazon Athena is noteworthy due to its simple yet efficient quality. No initial set up is required which makes ad hoc querying easy. It’s practical for simple read and aggregated queries and is relatively cost effective. Generally, Athena works best for quickly and conveniently running queries at a low cost without needing to set up a complex infrastructure. We don’t need all the possibilities and complexity of the Redshift solution on this step.