A Native Engine for Spark SQL with vectorized SIMD optimizations
Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. Apache Arrow provided CPU-cache friendly columnar in-memory layout, its SIMD optimized kernels and LLVM based SQL engine Gandiva are also very efficient. Native SQL Engine used these technoligies and brought better performance to Spark SQL.
With Spark 27396 its possible to pass a RDD of Columnarbatch to operators. We implemented this API with Arrow columnar format.
A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check Arrow Data Source
We implemented common operators based on Apache Arrow Compute and Gandiva. The SQL expression was compiled to one expression tree with protobuf and passed to native kernels. The native kernels will then evaluate the these expressions based on the input columnar batch.
We implemented columnar shuffle to improve the shuffle performance. With the columnar layout we could do very efficient data compression for different data format.
Check out the detailed installation/testing guide for quick testing