Spark Native SQL Engine

A Native Engine for Spark SQL with vectorized SIMD optimizations

Introduction

Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. Apache Arrow provided CPU-cache friendly columnar in-memory layout, its SIMD optimized kernels and LLVM based SQL engine Gandiva are also very efficient. Native SQL Engine used these technoligies and brought better performance to Spark SQL.

Key Features

Apache Arrow formatted intermediate data among Spark operator

With Spark 27396 its possible to pass a RDD of Columnarbatch to operators. We implemented this API with Arrow columnar format.

Apache Arrow based Native Readers for Parquet and other formats

A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check Arrow Data Source

Apache Arrow Compute/Gandiva based operators

We implemented common operators based on Apache Arrow Compute and Gandiva. The SQL expression was compiled to one expression tree with protobuf and passed to native kernels. The native kernels will then evaluate the these expressions based on the input columnar batch.

Native Columnar Shuffle Operator with efficient compression support

We implemented columnar shuffle to improve the shuffle performance. With the columnar layout we could do very efficient data compression for different data format.

Testing

Check out the detailed installation/testing guide for quick testing

Contact

chendi.xue@intel.com binwei.yang@intel.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Spark Native SQL Engine

Introduction

Key Features

Apache Arrow formatted intermediate data among Spark operator

Apache Arrow based Native Readers for Parquet and other formats

Apache Arrow Compute/Gandiva based operators

Native Columnar Shuffle Operator with efficient compression support

Testing

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Spark Native SQL Engine

Introduction

Key Features

Apache Arrow formatted intermediate data among Spark operator

Apache Arrow based Native Readers for Parquet and other formats

Apache Arrow Compute/Gandiva based operators

Native Columnar Shuffle Operator with efficient compression support

Testing

Contact