[Integration-RFC] Flint Integration Planning And Adaptation Tutorial #144
Labels
documentation
Improvements or additions to documentation
integration
integration related content
schema
schema related issue
Flint Integration Planning And Adaptation Tutorial
The next tutorial is intended to support a general procedure template for integration developers to be able to adapt for transforming their OpenSearch index based integration into a S3 flint based integration.
Ingestion Tools
The initial step for an integration to become Flint (S3) compatible is to understand the ingestion policy. Ingestion can be achieved using many tools including:
AWS Data Firehose:
This is a fully managed service for delivering real-time streaming data to destinations such as Amazon S3. It's suitable for both proprietary API logs (by sending data to Firehose directly through its API) and standard logs. For AWS ELB logs or Apache logs, you'd typically stream these to Firehose using a log agent or AWS Lambda.
Amazon Kinesis Agent:
A pre-built Java application that offers an easy way to collect and send data to AWS services like Amazon Kinesis Streams, Kinesis Data Firehose, or CloudWatch Logs. You can configure it to monitor log files for various services, including Apache HTTP Server, and forward them to S3.
FluentBit / Data-Prepper
Both are open-source data collectors for unifying data collection and consumption for better use and understanding of data. They can be used to collect data from different sources, transform it, and send it to various destinations, including S3. They are highly flexible and can work with proprietary log formats as well as common ones like Apache logs.
AWS Lambda:
For more custom ingestion logic, such as preprocessing or filtering before storing in S3, AWS Lambda can be used. Lambda can be triggered by various AWS services (like S3 uploads, CloudWatch Logs, API Gateway for proprietary APIs) to process and ingest logs into S3.
CloudWatch Logs:
While primarily a monitoring and observability service, CloudWatch Logs can be configured to route your logs to S3 using a subscription filter and AWS Lambda or directly to S3 for AWS service logs like ELB.
Custom API Solutions: For proprietary APIs, a custom application that ingests logs via API and then uploads them to S3 might be necessary. This can be hosted on EC2 or run as a containerized application on ECS or EKS.
Log Storage Format
The critical step is determining the optimal storage format for your data within Amazon S3.
Raw / CSV
Parquet
JSON
GZIP / ZIP
Table Definition Statement
Once the ingestion process has established and the log format was selected, a table definition is needed to allow queries to run on top of S3 using the build in catalog - wherever it is Hive or Glue.
It effectively maps the data stored in S3 to a structured format that query services can understand, allowing for complex analytical queries to be executed efficiently. This includes column names, data types, and potentially partition keys, which are critical for optimizing query performance.
Using AWS Glue Data Catalog
Integrating with Hive Metastore
Automated Schema Detection and Table Creation
In the context of Flint (S3) based integrations, manually defining table schemas for large or complex datasets can be cumbersome and prone to errors.
AWS provides a powerful tool, the AWS Glue Crawler, to automate the process of schema detection and table creation, significantly simplifying data management tasks.
AWS Glue Crawler
Considerations for Using AWS Glue Crawler
Best Practices for Automated Schema Detection
Projections and materialized views table definitions
Once a table is defined, it can now used as the base of creating multiple data projections and Materialized views that reflects different aspects of the data and allow both to visualize and to summarize data in an effective and efficient way.
There are many strategists for this phase and we cant review all of them here. For better understanding and selecting a projection strategy the following considerations are needed to be reviewed:
Domain of the data - whether its Observability / Security or any other specific domain that has its particular entities and relationships which will determine the dimensions which are important to reflect and project.
Partitioning of the data - as mentioned in the previous part, the partitioning will determine the way data is collected and stored and therefor will directly effect the type of projections and visualizations
The additional dimensions in the aggregation queries are meant to allow simple filtering and grouping without the need to fetch additional data
Index & Dashboard Naming Conventions
Index name should represent the integration source:
aws_vps_flow...
aws_cloud_front...
apache...
Live queries - should have a
live
phrase indicating the purpose of this MV query including the time frame this query works on:aws_vps_flow_live_week_view
aws_cloud_front_hour_view
When a visualization uses this MV query projection it should state it in its description:
vpc_``live_requests_view_graph
Aggregated queries - should have
agg
phrase indicating the purpose of this MV query including the time frame this query works on and the aggregated type and dimensions :aws_vps_flow_agg_sum_weekly_requests_And_bytes
aws_vps_flow_agg_avg_daily_http_status
When a visualization uses this MV query projection it should state it in its description:
vpc_agg``_avg_daily_http_status_pie_chart
MV naming Parts:
integration_id
: integration identificationprojection-type:
live stream or aggregated projectiontimeframe:
time for the live stream or the aggregated timeframeprojected_fields:
fields being aggregated or view indicating showing allversion:
integration / MV version$integration_id$_$projection-type$_$timeframe$_$projected_fields$_version
Another fields important for future considerations
group_by
dimensions (fields) attached to each time frameThe next plan will display an opinionated strategy that is composed of the following parts:
Domain: Observability**, Partitioning:** Time**, Data nature:** time-series data points with
Live stream view:
The live stream view is a direct copy of the most recent data stream (weekly or daily depending on the resolution)
According to the partition nature of the data, the live stream should have the appropriate where clause that gets the correct partition in addition to the time based live stream specification
Aggregation view of Numeric field:
According to the domain this integration is covering, there will be numerous aggregations summaries of numeric data points such as
amount of requests|
duration of events
|average of bytes
and so on.Each such aggregation is mostly time based and has the following time bucket strategy:
- Daily
- Hourly
- Weekly
Bysupporting multi-scale time based aggregation we can support multiple granular resolution while preserving efficient storage and compute processes .
The following is a 60 minute MV query aggregation summary of total bytes and packets.
In addition It adds the following dimensions for filters
- direction
- src.svc_name
- dst.svc_name
- accountid
Timely window summary of a grouping query
A time window of aggregated summary of a group of values is important to be able to identify trends and anomalies . They allow to formulate a baseline to compare with.
The following is a 60 minute MV query aggregation window summary of top destination IPs ordered by amount of sent bytes
SQL support and best practice
The text was updated successfully, but these errors were encountered: