awesome-lakehouse-guide

Research Papers - Data Engineering (Lakehouse, Distributed Systems, Open Source)

Paper Name	Area	Quick Summary
BtrBlocks	File Format	BtrBlocks introduces an efficient columnar storage format aimed at optimizing compression and decompression for data lakes, particularly when dealing with large datasets in cloud environments. The paper highlights how BtrBlocks outperforms traditional formats like Apache Parquet in both compression ratio and decompression speed. By using a combination of lightweight encoding schemes and a novel floating-point compression method called `Pseudodecimal Encoding`, BtrBlocks achieves significant improvements in scan performance, making it particularly useful for cloud-native systems that rely on high-throughput network environments like AWS S3. Key benefits include a 2.2x increase in scan speed and a 1.8x reduction in costs when compared to Parquet
The Data Lakehouse: Data Warehousing & More	Lakehouse	This paper discusses the evolution of data systems, focusing on the Data Lakehouse architecture. The authors explain how traditional RDBMS-OLAP systems, foundational for data warehousing, face challenges due to their rigid architecture and limitations in handling diverse analytical workloads. The Data Lakehouse architecture addresses these shortcomings by combining the strengths of data lakes (scalable storage for diverse data types) and data warehouses (efficient query performance and ACID transactions). Overall, the paper provides side-by-side comparisons of a data lakehouse and warehouse.
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics	Lakehouse	This paper introduces the Lakehouse architecture as a next-generation data management platform designed to unify the benefits of data lakes and data warehouses. Traditional data architectures often face challenges related to data complexity, delayed processing, and the high costs of managing separate lake and warehouse environments. The Lakehouse concept addresses these issues by enabling low-cost storage with ACID transactional capabilities, data versioning, and SQL performance similar to data warehouses.
The Story of AWS Glue	Data Tools	AWS Glue is a serverless data integration service that simplifies the extraction, transformation, and loading (ETL) of data across various sources. Initially designed for batch ETL jobs, AWS Glue has evolved into a more versatile tool, supporting interactive debugging and dynamic scaling, making it suitable for a wide range of data processing tasks. One of its key components is the AWS Glue Data Catalog, which serves as a scalable metadata service, allowing users to store datasets from various sources. The paper details the architectural evolution of AWS Glue over six years, including lessons learned from customer use cases such as loading data warehouses, integrating on-premises databases, and ingesting streaming data.
Auto-WLM: Machine learning enhanced workload management in Amazon Redshift	Data Tools	This paper explores Auto-WLM (Automatic Workload Management), which is a machine learning-based system used in Amazon Redshift to manage and optimize query workloads efficiently. Traditional workload management systems rely on static configurations and manual tuning, but Auto-WLM introduces an intelligent, automated approach that adjusts resources dynamically based on real-time conditions. This capability has been implemented in production, showing improvements in query throughput and resource utilization, particularly under complex and mixed workload scenarios.
A Deep Dive into Common Open Formats for Analytical DBMSs	File Format	This paper evaluates the suitability of open columnar storage formats—Apache Arrow, Parquet, and ORC—for analytical database management systems (DBMSs). It provides an in-depth analysis of the encoding techniques, compression methods, and performance trade-offs associated with these formats. Key insights include comparisons between various encoding methods such as Bit-Packed Encoding (BP), Dictionary Encoding (DICT), and Run-Length Encoding (RLE). The paper emphasizes the trade-offs between space efficiency and query performance across these formats, providing guidance on selecting the right format for different analytical workloads.
XTABLE in Action: Seamless Interoperability in Data Lakes	Storage	The paper introduces XTable, a solution designed to enhance interoperability between open table formats (LSTs) like Delta Lake, Apache Hudi, and Apache Iceberg in data lakes. XTable facilitates seamless data translation without rewriting tables, allowing data stored in one format to be accessible in others without costly migrations. The paper goes over the internal architecture of XTable and presents some real-world applications.
Apache FlinkTM: Stream and Batch Processing in a Single Engine	Compute Engine	The paper explores Apache Flink, an open-source stream and batch processing engine. Flink provides a unified system to handle real-time data streams and batch data processing. The core of Flink's architecture is a distributed dataflow engine that enables scalable, fault-tolerant, and low-latency processing of large datasets. Key capabilities include state management, event time processing, and exactly-once semantics. The paper goes over Flink's application to workloads such as real-time analytics, machine learning, and ETL pipelines
Automated multidimensional data layouts in Amazon Redshift	Storage	This paper explores the introduction of Multidimensional Data Layouts (MDDL) in Amazon Redshift, which enhances query performance by automatically optimizing the physical data layout. Traditional data layout techniques, such as single-column and compound sort keys, are effective for some queries but struggle with multi-column filtering. MDDL improves upon this by dynamically organizing data across multiple dimensions, allowing for more efficient data pruning and skipping. The paper claims that this layout achieves up to 85% reduction in workload runtime and up to 100× faster performance on specific queries
Vortex: A Stream-oriented Storage Engine For Big Data Analytics	Storage	The paper presents Vortex, a storage engine developed within Google BigQuery to support real-time and batch data analytics. Vortex operates as a stream-first system, capable of handling both types of workloads efficiently, addressing the challenges of managing petabyte-scale data ingestion and processing. It achieves sub-second data freshness and low-latency query performance. Vortex integrates with BigQuery's distributed query engine, Dremel, and leverages Google's Colossus file system for robust, disaster-resilient storage.

Blogs

Blog Title	Tags	Quick Summary
Getting Started with Flink SQL and Apache Iceberg	`Apache Iceberg`, `Flink`	How to get started with Flink SQL and Apache Iceberg for real-time processing.
Apache Hudi (Part 1): History, Getting Started	`Apache Hudi`	Discusses the motivations behind Apache Hudi (from its inception at `Uber`) and provides insights on the various ways to learn Hudi.
How Z-Ordering in Apache Iceberg Helps Improve Performance	`Apache Iceberg`, `Optimization`	Explains how Z-ordering optimizes query performance by clustering data across multiple dimensions, reducing the need to scan unnecessary files. Although the blog is centered around Apache Iceberg, the concepts applies to other formats as well.
What is Apache XTable — Interoperability for Apache Hudi, Iceberg & Delta Lake	`Apache Iceberg`, `Apache Hudi`, `Delta Lake`, `Interoperability`	The blog discusses Apache XTable, a framework designed to enable seamless interoperability between Apache Hudi, Apache Iceberg, and Delta Lake, allowing users to manage data across these open table formats with a unified approach
Building Analytical Apps on the Lakehouse using Apache Hudi, Daft & Streamlit	`Apache Hudi`, `Daft`, `Streamlit`	Shows hands-on examples of building data applications (dashboards) directly on top of an open lakehouse platform, using Apache Hudi, Daft, and Streamlit to enable seamless data visualization and exploration
Streamlining Data Quality in Apache Iceberg with write-audit-publish & branching	`Apache Iceberg`, `Data Quality`	Explains how Apache Iceberg's Write-Audit-Publish (WAP) pattern, especially with its branching feature, enables efficient data quality checks. This approach allows data to be staged, audited, and then published only if it meets quality standards, isolating experimental data from production while leveraging branch-specific snapshots.
What is a Data Lakehouse & How does it Work?	`Apache Hudi`, `Apache Iceberg`, `Delta Lake`, `Lakehouse`	An introductory blog on the Lakehouse architecture. Explains how this architecture merges the scalability and cost benefits of data lakes with the reliability and ACID transactional support of data warehouses.
Puffins and Icebergs: Additional Stats for Apache Iceberg Tables	`Apache Iceberg`, `Stochastic streaming`	Introduces the Puffin format in Apache Iceberg, designed to enhance data query efficiency by storing additional statistics and secondary indexes. Puffin allows metadata enrichment for better query planning, with one use case being the ability to store approximate distinct counts (NDV) using data sketches.
Using Apache Hudi & Iceberg tables in Databricks with Apache XTable	`Apache Iceberg`, `Apache Hudi`, `Databricks`	This blog post provides a practical example of how to use Apache XTable to achieve interoperability between Apache Hudi, Iceberg, and Delta Lake formats to build workflows in Databricks.
Open Table Formats and the Open Data Lakehouse, In Perspective	`Apache Hudi`, `Apache Iceberg`, `Lakehouse`	The blog argues that while open table formats are intended to make data architectures more open and interoperable, organizations often remain constrained by proprietary tools and table services for essential functions, preventing a fully open data architecture.
Hudi-rs with DuckDB, Polars, Daft, DataFusion — Single-node Lakehouse	`Apache Hudi`, `DuckDB`, `Apache DataFusion`, `Daft`	The blog demonstrates how to use Apache Hudi with Rust-based libraries like DuckDB, Polars, Daft, and DataFusion to build a single-node lakehouse without relying on JVM or Spark dependencies, enabling efficient data processing within a Python ecosystem.
Ultimate Directory of Apache Iceberg Resources	`Apache Iceberg`	This blog is directory of resources for learning about the Apache Iceberg Format
Virtualization + Lakehouse + Mesh = Data At Scale	`Data Lakehouse`, `Virtualization`, `Data Mesh`	This article explains why virtualization, lakehouse and mesh are complimentary trends.
Hands-on with Apache Iceberg on Your Laptop	`Apache Iceberg`	This blog is an end to end walkthrough on your laptop of ingesting data from Spark, running analytics with Dremio and creating a visualizations in a notebook with Polars & Seaborn using the Apache Iceberg table format.

Code/Notebooks

Description	Tags
Creating Hudi Tables on Amazon S3 using Spark SQL.	`Apache Hudi`
Running Inline Clustering in Apache Hudi.	`Apache Hudi`
Creating Iceberg Tables on Amazon S3 using Spark.	`Apache Iceberg`
Implementing CDC use cases in Apache Iceberg.	`Apache Iceberg`
Lakehouse on your Laptop Docker Compose Examples	`Apache Iceberg`

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md
lkh_res.png		lkh_res.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-lakehouse-guide

Research Papers - Data Engineering (Lakehouse, Distributed Systems, Open Source)

Blogs

Code/Notebooks

About

Releases

Packages

Contributors 2

dipankarmazumdar/awesome-lakehouse-guide

Folders and files

Latest commit

History

Repository files navigation

awesome-lakehouse-guide

Research Papers - Data Engineering (Lakehouse, Distributed Systems, Open Source)

Blogs

Code/Notebooks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages