Skip to content

Commit

Permalink
Add Snowflake Iceberg
Browse files Browse the repository at this point in the history
  • Loading branch information
yarkhinephyo committed Jan 21, 2024
1 parent a9b7760 commit 143e766
Show file tree
Hide file tree
Showing 3 changed files with 109 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: post
title: "Intuition Behind the Attention Head of Transformers"
date: 2022-04-09 14:20:00 +0800
category: [Tech]
tags: [NLP, Data-Science]
tags: [Data-Science]
---

Even as I frequently use transformers for NLP projects, I have struggled with the intuition behind the multi-head attention mechanism outlined in the paper - [Attention Is All You Need](https://arxiv.org/abs/1706.03762). This post will act as a memo for my future self.
Expand Down
2 changes: 1 addition & 1 deletion _posts/2022-04-13-primer-to-sift.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: post
title: "Primer to Scale-invariant Feature Transform"
date: 2022-04-13 09:30:00 +0800
category: [Tech]
tags: [Computer-Vision, Software-Engineering]
tags: [Data-Science]
---

Scale-invariant Feature Transform, also known as SIFT, is a method to consistently represent features in an image even under different scales, rotations and lighting conditions. Since the video series by First Principles of Computer Vision covers the details very well, the post covers mainly my intuition. The topic requires prior knowledge on using [Laplacian of Gaussian](https://en.wikipedia.org/wiki/Discrete_Laplace_operator) for edge detection in images.
Expand Down
107 changes: 107 additions & 0 deletions _posts/2024-01-22-snowflake-new-streams-talk.markdown
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
layout: post
title: "Learning Points: Snowflake Iceberg, Streaming, Unistore"
date: 2024-01-21 10:30:00 +0800
category: [Tech]
tags: [Video-Takeaway]
---

This [video](https://youtu.be/Kr-Vzvkyabw?si=uo_ZV4IpUHhE9MJU) is about Snowflake Iceberg Tables, Streaming Ingest and Unistore. The presenters are N.Single, T.Jones and A.Motivala as part of the Database Seminar Series by CMU Database Group.

### Problems with Traditional Data Lakes

Traditional data lakes use file systems as the metadata layer. For example, data for each table is organized in a directory. Partitioning is implemented through nested directories. Using directories as database tables cause problems.

- Not easy to provide ACID guarantees. Multiple partition inserts were not atomic.
- Tools may directly access the file systems without consistent metadata updates.
- Schema evolution was very error prone.
- No clear way for access control.

### Apache Iceberg

- Describes how to perform updates to the table.
- Specification to achieve snapshot isolation. File memberships are defined to a snapshot.
- Easier schema evolution with Iceberg metadata.

### Snowflake Architecture

Table metadata and data are stored as Parquet files on customers' bucket.

```
+-------------------------------------------------+
Cloud services | Authentication and Authorization |
+-------------------------------------------------+
| Infra Manager | Optimizer | Transaction Manager |
+-------------------------------------------------+
| Metadata Storage (Customer's Bucket) |
+-------------------------------------------------+
+-------------------+ +-------------------+
Compute | Warehouse | | Warehouse |
+-------------------+ +-------------------+
+-------------------------------------------------+
Storage | Data (Customer's Bucket) |
+-------------------------------------------------+
```

Customers will have to provide Snowflake External Volumes on any the cloud providers with access credentials. Data and metadata files are written to the External Volume.

### Metadata Generation

Snowflake has its own files to store snapshot metadata originally. To support Iceberg format, each table commit requires generation of both Iceberg metadata and internal Snowflake metadata.

The generation of additional metadata files (Iceberg) increases query latency significantly. Thus Iceberg metadata files are generated on the background at the same time.

When Snowflake metadata files are generated, the transaction is considered commited. If the server crashes before Iceberg metadata is generated, the request would come to the new Snowflake server and the Iceberg metadata will be generated on the fly.

### How Spark Accesses Iceberg

The Iceberg SDK accesses a catalog which returns the location of metadata files in customers' buckets. Then the SDK interprets the metadata files and returns the locations of data files in an API to Spark.

```
Spark ---> Iceberg SDK ---> 1. Catalog (Hive, Glue)
| |
| -----------> 2. Storage (Snapshot Metadata)
|
------------> 3. Data Files
```

### Snowpipe Streaming

Before this feature, the original Snowpipe did continuous copying from a bucket to a table behind the scenes, in batches. However, there was no low latency, high throughput, in-order processing feature. Snowpipe Streaming provides:

- Ingestion to tables over HTTPs.
- Exactly once (?) and per-channel ordering guarantees.
- Low latency, queryable after seconds.
- High throughput, GB/s supported.
- Low overhead for configuration.

New concepts include:

- Channel - Logical partition that represents a connection from a single client to a destination table.
- Client SDK - Accepts bytes from application, writes data to cloud storage as blobs and registers them to Snowflake.
- Mixed table - Contains both BDEC (Chunks of Arrow format) that is written by the client SDK and Snowflake's propriatory FDN format. In the background, the BDEC files are rewritten into FDN format. The rewriting process is transparent to the users as queries can be done on the mixture of BDEC and FDN files. However, the rewriting process implies additional compute which will be charged to the customer.

The implementation details:

- User code uses the Snowpipe Streaming Client SDK to open a Channel and write rows in the Channel.
- Client SDK writes BDEC files to the Streaming Ingest's internal storage (Blobstore). Note that FDN files exist in the same Blobstore.
- Client SDK registers the blob via REST API to Snowflake's Frontend node.
- Frontend node fans out per-table registration requests to the Snowflake's Commit Service and provides a progress update to the client SDK.
- The Commit Service validates and deduplicates chunks per-table in memory. Then it commits by changing table version references to the new Arrow chunks (BDEC).
- Snowpipe creates regular FDN files from BDEC files. At this point, queries would reflect the newly added data in BDEC files.

### Unistore

Snowflake's product for combining transactional and analytical workload on one platform.

A new table type that works with existing snowflake tables, supports transactional features such as unique keys, referential integrity constraints and cross domain transactions.

```
CREATE HYBRID TABLE CustomerTable {
customer_id int primary key,
full_name varchar(256),
...
}
```

0 comments on commit 143e766

Please sign in to comment.