Releases: apache/incubator-gluten
v1.2.0
Release Notes - Gluten version 1.2.0
We are pleased to announce that Gluten v1.2.0 has been published as 1st official Apache release.
Highlights (Velox backend only)
- Support Spark 3.2.2, 3.3.1, 3.4.2, and 3.5.1 with all UTs passed(if data type supported)
- Support 31 common Spark Operators(based on Spark3.2)
- Support 266 common Spark Functions(based on Spark3.2)
- Velox codebase updated to 2024/07/05
- New RSS support: add Apache Uniffle integration
- New Data Lake support: Iceberge, Delta Lake
- New File Format Support: CSV
- Enhanced CI workflow
- Refresh Documentations in Gluten website(https://gluten.apache.org/)
- More Stability in Spill, OOM, and other cases support
- More Bug Fixing
What's Changed
- [CORE] Move all columnar rules to post-columnar transitions by @zhztheplayer in #4790
- [GLUTEN-4398][FOLLOW] Mask PullOutPostProject and PullOutPreProject id by @zwangsheng in #4815
- [GLUTEN-2956][VL] Support Spark NullType by @PHILO-HE in #2996
- [CORE] Add logical link to rewritten spark plan by @ulysses-you in #4817
- [GLUTEN-4803][UT] Add Golden Files for TPC-H Spark33 + Gluten Execution Plan by @zwangsheng in #4804
- [VL] Allow replacing installed minio package by @PHILO-HE in #4825
- [VL] Daily Update Velox Version (2024_03_01) by @GlutenPerfBot in #4821
- [VL] Enable more tests of GlutenParquetIOSuite for Spark32/33/34 by @Yohahaha in #4823
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240302) by @lwz9103 in #4837
- [GLUTEN-4039][VL] support map_keys and map_values by @konjac in #4826
- [GLUTEN-4424][CORE] Upgrade spark version to 3.5.1 in Gluten by @JkSelf in #4822
- [VL] Daily Update Velox Version (2024_03_04) by @GlutenPerfBot in #4841
- [GLUTEN-4813] Replace resize/reserve to resize_extact/reserve_exact to reduce memory by @taiyang-li in #4824
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240305) by @lwz9103 in #4849
- [VL] Fix boost installation issue and remove useless QueryCtx by @PHILO-HE in #4850
- [VL] Enable "parquet v2 pages - delta encoding" test for Spark33/Spark34 by @Yohahaha in #4816
- [CORE] Support FileSourceScanExec driver metrics for spark3.4/3.5 by @zhli1142015 in #4848
- [GLUTEN-4772][VL] Support empty map/array literal by @WangGuangxin in #4771
- [GLUTEN-4860][CELEBORN] Replace celeborn link by @kerwin-zk in #4861
- [VL][CI] Fix CI failure related to Celeborn by @PHILO-HE in #4862
- [CORE] Support In list option contains non-foldable expression by @ulysses-you in #4843
- [VL] Daily Update Velox Version (2024_03_05) by @GlutenPerfBot in #4852
- [VL] Enable more tests in GlutenParquetQuerySuite for Spark32/33/34 by @Yohahaha in #4854
- [CORE] ColumnarShuffleExchangeExec should respect advisoryPartitionSize for Spark3.5 by @ulysses-you in #4865
- [GLUTEN-4853][CORE] Only trim Alias when its child is semantically equal to resAttr by @liujiayi771 in #4857
- [VL] minor change for delta ut by @zhli1142015 in #4869
- [VL] Add libsodium.so to thirdparty lib for CentOS8 by @kerwin-zk in #4870
- [VL] Updated documentation, refactoring and added more testcases for BNLJ by @Surbhi-Vijay in #4782
- [VL] Daily Update Velox Version (2024_03_06) by @GlutenPerfBot in #4868
- [MINOR] Remove ExtendedAnalysisException by @PHILO-HE in #4864
- [GLUTEN-4831][VL] Support StructType in HashAggregate by @WangGuangxin in #4832
- [VL] Support inline function by @marin-ma in #4847
- [VL] Add flushable decimal sum test case by @liujiayi771 in #4871
- [CORE] Add synchronized for ExplainUtils processPlan by @ulysses-you in #4876
- [VL] Rewrite collect_set and collect_list aggregate function by @ulysses-you in #4805
- [VL] Fix and use flattenVector by @marin-ma in #4783
- [VL] Enable tests of ParquetPartitionDisconverySuite for Spark33/34 by @Yohahaha in #4881
- [CORE] Minor adjustment to columnar rule list, and move all columnar sub-rules to one source folder by @zhztheplayer in #4863
- [VL] Merge Partial and PartialMerge logic in generateMergeCompanionNode by @liujiayi771 in #4883
- [CORE] Fix Spark-3.5 CI by @ulysses-you in #4886
- [GLUTEN-4424][CORE] Follow up upgrading spark version to 3.5.1 by @JkSelf in #4845
- Add .asf.yml by @yaooqinn in #4892
- Update Vulnerability Handling Process by @yaooqinn in #4894
- [VL] Daily Update Velox Version (2024_03_07) by @GlutenPerfBot in #4877
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240308) by @lwz9103 in #4890
- [CORE] ColumnarBroadcastExchangeExec should set/cancel with job tag for Spark3.5 by @ulysses-you in #4882
- [VL] Daily Update Velox Version (2024_03_08) by @GlutenPerfBot in #4895
- [VL] Pass partition id to velox functions by @zhli1142015 in #4344
- Add Incubation Standard Disclaimer by @yaooqinn in #4911
- [GLUTEN-4835][CORE] Match metric names with Spark by @clee704 in #4834
- [Gluten-4732][CH] delta-mergetree support update/delete/upsert/insert in a more native delta way by @binmahone in #4733
- [GLUTEN-4898][CH]Bug fix to date diff by @KevinyhZou in #4900
- [VL] Daily Update Velox Version(2024_03_11) by @GlutenPerfBot in #4908
- [DOC] Update release & configuration doc by @PHILO-HE in #4910
- [VL] Support lead window function by @ulysses-you in #4902
- [VL] Fix protobuf configure arguments in get_velox.sh by @liujiayi771 in #4920
- [Gluten-4918][CH]support CTAS for clickhouse table by @binmahone in #4919
- [GLUTEN-4926][CELEBORN]
CelebornShuffleManager
should removeshuffleId
fromcolumnarShuffleIds
after unregistering shuffle by @SteNicholas in #4927 - [Gluten-4912][CH]Support Specifying columns in clickhouse tables to b… by @binmahone in #4925
- [Gluten-4706] [CH][CORE] Add a mode to execute count distinct directly instead o… by @binmahone in #4708
- [VL] Daily Update Velox Version (2024_03_12) by @GlutenPerfBot in #4923
- [GLUTEN-4914][CH] Fix exceptions in ASTParser by @taiyang-li in #4916
- [DOC] Minor fix for wrong gluten folder used in doc by @leoluan2009 in #4938
- [VL] Refine log plan/split json into one line by @Yohahaha in #4934
- [VL] Support posexplode function and code refactoring on GenerateExecTransformer by @marin-ma in #4901
- [CORE] Prior to #4893, add vanilla Spark's original scan source code to keep git history by @zhztheplayer in #4931
- [VL] Fix wrong plan equality due to case class inheritance by @zhztheplayer in #4893
- [GLUTEN-3559][VL] enable more sql query tests for Spark34 by @zhouyuan in #4880
- [VL] Daily Update Velox Version (2024_03_13) by @GlutenPerfBot in #4944
- [VL]Bucket join support for Iceberg tables by @SinghAsDev in #4859
- [GLUTEN-4827][UT] Add Golden Files for TPC-H Spark34 + Gluten Execution Plan by @zwangsheng in https://github.com/apache/i...
v1.2.0-rc3
What's Changed
- [CORE] Move all columnar rules to post-columnar transitions by @zhztheplayer in #4790
- [GLUTEN-4398][FOLLOW] Mask PullOutPostProject and PullOutPreProject id by @zwangsheng in #4815
- [GLUTEN-2956][VL] Support Spark NullType by @PHILO-HE in #2996
- [CORE] Add logical link to rewritten spark plan by @ulysses-you in #4817
- [GLUTEN-4803][UT] Add Golden Files for TPC-H Spark33 + Gluten Execution Plan by @zwangsheng in #4804
- [VL] Allow replacing installed minio package by @PHILO-HE in #4825
- [VL] Daily Update Velox Version (2024_03_01) by @GlutenPerfBot in #4821
- [VL] Enable more tests of GlutenParquetIOSuite for Spark32/33/34 by @Yohahaha in #4823
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240302) by @lwz9103 in #4837
- [GLUTEN-4039][VL] support map_keys and map_values by @konjac in #4826
- [GLUTEN-4424][CORE] Upgrade spark version to 3.5.1 in Gluten by @JkSelf in #4822
- [VL] Daily Update Velox Version (2024_03_04) by @GlutenPerfBot in #4841
- [GLUTEN-4813] Replace resize/reserve to resize_extact/reserve_exact to reduce memory by @taiyang-li in #4824
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240305) by @lwz9103 in #4849
- [VL] Fix boost installation issue and remove useless QueryCtx by @PHILO-HE in #4850
- [VL] Enable "parquet v2 pages - delta encoding" test for Spark33/Spark34 by @Yohahaha in #4816
- [CORE] Support FileSourceScanExec driver metrics for spark3.4/3.5 by @zhli1142015 in #4848
- [GLUTEN-4772][VL] Support empty map/array literal by @WangGuangxin in #4771
- [GLUTEN-4860][CELEBORN] Replace celeborn link by @kerwin-zk in #4861
- [VL][CI] Fix CI failure related to Celeborn by @PHILO-HE in #4862
- [CORE] Support In list option contains non-foldable expression by @ulysses-you in #4843
- [VL] Daily Update Velox Version (2024_03_05) by @GlutenPerfBot in #4852
- [VL] Enable more tests in GlutenParquetQuerySuite for Spark32/33/34 by @Yohahaha in #4854
- [CORE] ColumnarShuffleExchangeExec should respect advisoryPartitionSize for Spark3.5 by @ulysses-you in #4865
- [GLUTEN-4853][CORE] Only trim Alias when its child is semantically equal to resAttr by @liujiayi771 in #4857
- [VL] minor change for delta ut by @zhli1142015 in #4869
- [VL] Add libsodium.so to thirdparty lib for CentOS8 by @kerwin-zk in #4870
- [VL] Updated documentation, refactoring and added more testcases for BNLJ by @Surbhi-Vijay in #4782
- [VL] Daily Update Velox Version (2024_03_06) by @GlutenPerfBot in #4868
- [MINOR] Remove ExtendedAnalysisException by @PHILO-HE in #4864
- [GLUTEN-4831][VL] Support StructType in HashAggregate by @WangGuangxin in #4832
- [VL] Support inline function by @marin-ma in #4847
- [VL] Add flushable decimal sum test case by @liujiayi771 in #4871
- [CORE] Add synchronized for ExplainUtils processPlan by @ulysses-you in #4876
- [VL] Rewrite collect_set and collect_list aggregate function by @ulysses-you in #4805
- [VL] Fix and use flattenVector by @marin-ma in #4783
- [VL] Enable tests of ParquetPartitionDisconverySuite for Spark33/34 by @Yohahaha in #4881
- [CORE] Minor adjustment to columnar rule list, and move all columnar sub-rules to one source folder by @zhztheplayer in #4863
- [VL] Merge Partial and PartialMerge logic in generateMergeCompanionNode by @liujiayi771 in #4883
- [CORE] Fix Spark-3.5 CI by @ulysses-you in #4886
- [GLUTEN-4424][CORE] Follow up upgrading spark version to 3.5.1 by @JkSelf in #4845
- Add .asf.yml by @yaooqinn in #4892
- Update Vulnerability Handling Process by @yaooqinn in #4894
- [VL] Daily Update Velox Version (2024_03_07) by @GlutenPerfBot in #4877
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240308) by @lwz9103 in #4890
- [CORE] ColumnarBroadcastExchangeExec should set/cancel with job tag for Spark3.5 by @ulysses-you in #4882
- [VL] Daily Update Velox Version (2024_03_08) by @GlutenPerfBot in #4895
- [VL] Pass partition id to velox functions by @zhli1142015 in #4344
- Add Incubation Standard Disclaimer by @yaooqinn in #4911
- [GLUTEN-4835][CORE] Match metric names with Spark by @clee704 in #4834
- [Gluten-4732][CH] delta-mergetree support update/delete/upsert/insert in a more native delta way by @binmahone in #4733
- [GLUTEN-4898][CH]Bug fix to date diff by @KevinyhZou in #4900
- [VL] Daily Update Velox Version(2024_03_11) by @GlutenPerfBot in #4908
- [DOC] Update release & configuration doc by @PHILO-HE in #4910
- [VL] Support lead window function by @ulysses-you in #4902
- [VL] Fix protobuf configure arguments in get_velox.sh by @liujiayi771 in #4920
- [Gluten-4918][CH]support CTAS for clickhouse table by @binmahone in #4919
- [GLUTEN-4926][CELEBORN]
CelebornShuffleManager
should removeshuffleId
fromcolumnarShuffleIds
after unregistering shuffle by @SteNicholas in #4927 - [Gluten-4912][CH]Support Specifying columns in clickhouse tables to b… by @binmahone in #4925
- [Gluten-4706] [CH][CORE] Add a mode to execute count distinct directly instead o… by @binmahone in #4708
- [VL] Daily Update Velox Version (2024_03_12) by @GlutenPerfBot in #4923
- [GLUTEN-4914][CH] Fix exceptions in ASTParser by @taiyang-li in #4916
- [DOC] Minor fix for wrong gluten folder used in doc by @leoluan2009 in #4938
- [VL] Refine log plan/split json into one line by @Yohahaha in #4934
- [VL] Support posexplode function and code refactoring on GenerateExecTransformer by @marin-ma in #4901
- [CORE] Prior to #4893, add vanilla Spark's original scan source code to keep git history by @zhztheplayer in #4931
- [VL] Fix wrong plan equality due to case class inheritance by @zhztheplayer in #4893
- [GLUTEN-3559][VL] enable more sql query tests for Spark34 by @zhouyuan in #4880
- [VL] Daily Update Velox Version (2024_03_13) by @GlutenPerfBot in #4944
- [VL]Bucket join support for Iceberg tables by @SinghAsDev in #4859
- [GLUTEN-4827][UT] Add Golden Files for TPC-H Spark34 + Gluten Execution Plan by @zwangsheng in #4828
- [VL] Verify unhex has been offloaded to native successfully by @Yohahaha in #4937
- [VL] Support skewness aggregate function by @liujiayi771 in #4939
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240314) by @lwz9103 in #4948
- [VL] parquet file metadata columns support in velox by @gaoyangxiaozhu in #3870
- [VL] Daily Update Velox Version (2024_03_14) by @GlutenPerfBot in #4949
- [VL] Untangle code of TransformPreOverrides by @zhztheplayer in ht...
v1.2.0-rc2
What's Changed
- [CORE] Move all columnar rules to post-columnar transitions by @zhztheplayer in #4790
- [GLUTEN-4398][FOLLOW] Mask PullOutPostProject and PullOutPreProject id by @zwangsheng in #4815
- [GLUTEN-2956][VL] Support Spark NullType by @PHILO-HE in #2996
- [CORE] Add logical link to rewritten spark plan by @ulysses-you in #4817
- [GLUTEN-4803][UT] Add Golden Files for TPC-H Spark33 + Gluten Execution Plan by @zwangsheng in #4804
- [VL] Allow replacing installed minio package by @PHILO-HE in #4825
- [VL] Daily Update Velox Version (2024_03_01) by @GlutenPerfBot in #4821
- [VL] Enable more tests of GlutenParquetIOSuite for Spark32/33/34 by @Yohahaha in #4823
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240302) by @lwz9103 in #4837
- [GLUTEN-4039][VL] support map_keys and map_values by @konjac in #4826
- [GLUTEN-4424][CORE] Upgrade spark version to 3.5.1 in Gluten by @JkSelf in #4822
- [VL] Daily Update Velox Version (2024_03_04) by @GlutenPerfBot in #4841
- [GLUTEN-4813] Replace resize/reserve to resize_extact/reserve_exact to reduce memory by @taiyang-li in #4824
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240305) by @lwz9103 in #4849
- [VL] Fix boost installation issue and remove useless QueryCtx by @PHILO-HE in #4850
- [VL] Enable "parquet v2 pages - delta encoding" test for Spark33/Spark34 by @Yohahaha in #4816
- [CORE] Support FileSourceScanExec driver metrics for spark3.4/3.5 by @zhli1142015 in #4848
- [GLUTEN-4772][VL] Support empty map/array literal by @WangGuangxin in #4771
- [GLUTEN-4860][CELEBORN] Replace celeborn link by @kerwin-zk in #4861
- [VL][CI] Fix CI failure related to Celeborn by @PHILO-HE in #4862
- [CORE] Support In list option contains non-foldable expression by @ulysses-you in #4843
- [VL] Daily Update Velox Version (2024_03_05) by @GlutenPerfBot in #4852
- [VL] Enable more tests in GlutenParquetQuerySuite for Spark32/33/34 by @Yohahaha in #4854
- [CORE] ColumnarShuffleExchangeExec should respect advisoryPartitionSize for Spark3.5 by @ulysses-you in #4865
- [GLUTEN-4853][CORE] Only trim Alias when its child is semantically equal to resAttr by @liujiayi771 in #4857
- [VL] minor change for delta ut by @zhli1142015 in #4869
- [VL] Add libsodium.so to thirdparty lib for CentOS8 by @kerwin-zk in #4870
- [VL] Updated documentation, refactoring and added more testcases for BNLJ by @Surbhi-Vijay in #4782
- [VL] Daily Update Velox Version (2024_03_06) by @GlutenPerfBot in #4868
- [MINOR] Remove ExtendedAnalysisException by @PHILO-HE in #4864
- [GLUTEN-4831][VL] Support StructType in HashAggregate by @WangGuangxin in #4832
- [VL] Support inline function by @marin-ma in #4847
- [VL] Add flushable decimal sum test case by @liujiayi771 in #4871
- [CORE] Add synchronized for ExplainUtils processPlan by @ulysses-you in #4876
- [VL] Rewrite collect_set and collect_list aggregate function by @ulysses-you in #4805
- [VL] Fix and use flattenVector by @marin-ma in #4783
- [VL] Enable tests of ParquetPartitionDisconverySuite for Spark33/34 by @Yohahaha in #4881
- [CORE] Minor adjustment to columnar rule list, and move all columnar sub-rules to one source folder by @zhztheplayer in #4863
- [VL] Merge Partial and PartialMerge logic in generateMergeCompanionNode by @liujiayi771 in #4883
- [CORE] Fix Spark-3.5 CI by @ulysses-you in #4886
- [GLUTEN-4424][CORE] Follow up upgrading spark version to 3.5.1 by @JkSelf in #4845
- Add .asf.yml by @yaooqinn in #4892
- Update Vulnerability Handling Process by @yaooqinn in #4894
- [VL] Daily Update Velox Version (2024_03_07) by @GlutenPerfBot in #4877
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240308) by @lwz9103 in #4890
- [CORE] ColumnarBroadcastExchangeExec should set/cancel with job tag for Spark3.5 by @ulysses-you in #4882
- [VL] Daily Update Velox Version (2024_03_08) by @GlutenPerfBot in #4895
- [VL] Pass partition id to velox functions by @zhli1142015 in #4344
- Add Incubation Standard Disclaimer by @yaooqinn in #4911
- [GLUTEN-4835][CORE] Match metric names with Spark by @clee704 in #4834
- [Gluten-4732][CH] delta-mergetree support update/delete/upsert/insert in a more native delta way by @binmahone in #4733
- [GLUTEN-4898][CH]Bug fix to date diff by @KevinyhZou in #4900
- [VL] Daily Update Velox Version(2024_03_11) by @GlutenPerfBot in #4908
- [DOC] Update release & configuration doc by @PHILO-HE in #4910
- [VL] Support lead window function by @ulysses-you in #4902
- [VL] Fix protobuf configure arguments in get_velox.sh by @liujiayi771 in #4920
- [Gluten-4918][CH]support CTAS for clickhouse table by @binmahone in #4919
- [GLUTEN-4926][CELEBORN]
CelebornShuffleManager
should removeshuffleId
fromcolumnarShuffleIds
after unregistering shuffle by @SteNicholas in #4927 - [Gluten-4912][CH]Support Specifying columns in clickhouse tables to b… by @binmahone in #4925
- [Gluten-4706] [CH][CORE] Add a mode to execute count distinct directly instead o… by @binmahone in #4708
- [VL] Daily Update Velox Version (2024_03_12) by @GlutenPerfBot in #4923
- [GLUTEN-4914][CH] Fix exceptions in ASTParser by @taiyang-li in #4916
- [DOC] Minor fix for wrong gluten folder used in doc by @leoluan2009 in #4938
- [VL] Refine log plan/split json into one line by @Yohahaha in #4934
- [VL] Support posexplode function and code refactoring on GenerateExecTransformer by @marin-ma in #4901
- [CORE] Prior to #4893, add vanilla Spark's original scan source code to keep git history by @zhztheplayer in #4931
- [VL] Fix wrong plan equality due to case class inheritance by @zhztheplayer in #4893
- [GLUTEN-3559][VL] enable more sql query tests for Spark34 by @zhouyuan in #4880
- [VL] Daily Update Velox Version (2024_03_13) by @GlutenPerfBot in #4944
- [VL]Bucket join support for Iceberg tables by @SinghAsDev in #4859
- [GLUTEN-4827][UT] Add Golden Files for TPC-H Spark34 + Gluten Execution Plan by @zwangsheng in #4828
- [VL] Verify unhex has been offloaded to native successfully by @Yohahaha in #4937
- [VL] Support skewness aggregate function by @liujiayi771 in #4939
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240314) by @lwz9103 in #4948
- [VL] parquet file metadata columns support in velox by @gaoyangxiaozhu in #3870
- [VL] Daily Update Velox Version (2024_03_14) by @GlutenPerfBot in #4949
- [VL] Untangle code of TransformPreOverrides by @zhztheplayer in ht...
v1.2.0-rc1
What's Changed
- [CORE] Move all columnar rules to post-columnar transitions by @zhztheplayer in #4790
- [GLUTEN-4398][FOLLOW] Mask PullOutPostProject and PullOutPreProject id by @zwangsheng in #4815
- [GLUTEN-2956][VL] Support Spark NullType by @PHILO-HE in #2996
- [CORE] Add logical link to rewritten spark plan by @ulysses-you in #4817
- [GLUTEN-4803][UT] Add Golden Files for TPC-H Spark33 + Gluten Execution Plan by @zwangsheng in #4804
- [VL] Allow replacing installed minio package by @PHILO-HE in #4825
- [VL] Daily Update Velox Version (2024_03_01) by @GlutenPerfBot in #4821
- [VL] Enable more tests of GlutenParquetIOSuite for Spark32/33/34 by @Yohahaha in #4823
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240302) by @lwz9103 in #4837
- [GLUTEN-4039][VL] support map_keys and map_values by @konjac in #4826
- [GLUTEN-4424][CORE] Upgrade spark version to 3.5.1 in Gluten by @JkSelf in #4822
- [VL] Daily Update Velox Version (2024_03_04) by @GlutenPerfBot in #4841
- [GLUTEN-4813] Replace resize/reserve to resize_extact/reserve_exact to reduce memory by @taiyang-li in #4824
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240305) by @lwz9103 in #4849
- [VL] Fix boost installation issue and remove useless QueryCtx by @PHILO-HE in #4850
- [VL] Enable "parquet v2 pages - delta encoding" test for Spark33/Spark34 by @Yohahaha in #4816
- [CORE] Support FileSourceScanExec driver metrics for spark3.4/3.5 by @zhli1142015 in #4848
- [GLUTEN-4772][VL] Support empty map/array literal by @WangGuangxin in #4771
- [GLUTEN-4860][CELEBORN] Replace celeborn link by @kerwin-zk in #4861
- [VL][CI] Fix CI failure related to Celeborn by @PHILO-HE in #4862
- [CORE] Support In list option contains non-foldable expression by @ulysses-you in #4843
- [VL] Daily Update Velox Version (2024_03_05) by @GlutenPerfBot in #4852
- [VL] Enable more tests in GlutenParquetQuerySuite for Spark32/33/34 by @Yohahaha in #4854
- [CORE] ColumnarShuffleExchangeExec should respect advisoryPartitionSize for Spark3.5 by @ulysses-you in #4865
- [GLUTEN-4853][CORE] Only trim Alias when its child is semantically equal to resAttr by @liujiayi771 in #4857
- [VL] minor change for delta ut by @zhli1142015 in #4869
- [VL] Add libsodium.so to thirdparty lib for CentOS8 by @kerwin-zk in #4870
- [VL] Updated documentation, refactoring and added more testcases for BNLJ by @Surbhi-Vijay in #4782
- [VL] Daily Update Velox Version (2024_03_06) by @GlutenPerfBot in #4868
- [MINOR] Remove ExtendedAnalysisException by @PHILO-HE in #4864
- [GLUTEN-4831][VL] Support StructType in HashAggregate by @WangGuangxin in #4832
- [VL] Support inline function by @marin-ma in #4847
- [VL] Add flushable decimal sum test case by @liujiayi771 in #4871
- [CORE] Add synchronized for ExplainUtils processPlan by @ulysses-you in #4876
- [VL] Rewrite collect_set and collect_list aggregate function by @ulysses-you in #4805
- [VL] Fix and use flattenVector by @marin-ma in #4783
- [VL] Enable tests of ParquetPartitionDisconverySuite for Spark33/34 by @Yohahaha in #4881
- [CORE] Minor adjustment to columnar rule list, and move all columnar sub-rules to one source folder by @zhztheplayer in #4863
- [VL] Merge Partial and PartialMerge logic in generateMergeCompanionNode by @liujiayi771 in #4883
- [CORE] Fix Spark-3.5 CI by @ulysses-you in #4886
- [GLUTEN-4424][CORE] Follow up upgrading spark version to 3.5.1 by @JkSelf in #4845
- Add .asf.yml by @yaooqinn in #4892
- Update Vulnerability Handling Process by @yaooqinn in #4894
- [VL] Daily Update Velox Version (2024_03_07) by @GlutenPerfBot in #4877
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240308) by @lwz9103 in #4890
- [CORE] ColumnarBroadcastExchangeExec should set/cancel with job tag for Spark3.5 by @ulysses-you in #4882
- [VL] Daily Update Velox Version (2024_03_08) by @GlutenPerfBot in #4895
- [VL] Pass partition id to velox functions by @zhli1142015 in #4344
- Add Incubation Standard Disclaimer by @yaooqinn in #4911
- [GLUTEN-4835][CORE] Match metric names with Spark by @clee704 in #4834
- [Gluten-4732][CH] delta-mergetree support update/delete/upsert/insert in a more native delta way by @binmahone in #4733
- [GLUTEN-4898][CH]Bug fix to date diff by @KevinyhZou in #4900
- [VL] Daily Update Velox Version(2024_03_11) by @GlutenPerfBot in #4908
- [DOC] Update release & configuration doc by @PHILO-HE in #4910
- [VL] Support lead window function by @ulysses-you in #4902
- [VL] Fix protobuf configure arguments in get_velox.sh by @liujiayi771 in #4920
- [Gluten-4918][CH]support CTAS for clickhouse table by @binmahone in #4919
- [GLUTEN-4926][CELEBORN]
CelebornShuffleManager
should removeshuffleId
fromcolumnarShuffleIds
after unregistering shuffle by @SteNicholas in #4927 - [Gluten-4912][CH]Support Specifying columns in clickhouse tables to b… by @binmahone in #4925
- [Gluten-4706] [CH][CORE] Add a mode to execute count distinct directly instead o… by @binmahone in #4708
- [VL] Daily Update Velox Version (2024_03_12) by @GlutenPerfBot in #4923
- [GLUTEN-4914][CH] Fix exceptions in ASTParser by @taiyang-li in #4916
- [DOC] Minor fix for wrong gluten folder used in doc by @leoluan2009 in #4938
- [VL] Refine log plan/split json into one line by @Yohahaha in #4934
- [VL] Support posexplode function and code refactoring on GenerateExecTransformer by @marin-ma in #4901
- [CORE] Prior to #4893, add vanilla Spark's original scan source code to keep git history by @zhztheplayer in #4931
- [VL] Fix wrong plan equality due to case class inheritance by @zhztheplayer in #4893
- [GLUTEN-3559][VL] enable more sql query tests for Spark34 by @zhouyuan in #4880
- [VL] Daily Update Velox Version (2024_03_13) by @GlutenPerfBot in #4944
- [VL]Bucket join support for Iceberg tables by @SinghAsDev in #4859
- [GLUTEN-4827][UT] Add Golden Files for TPC-H Spark34 + Gluten Execution Plan by @zwangsheng in #4828
- [VL] Verify unhex has been offloaded to native successfully by @Yohahaha in #4937
- [VL] Support skewness aggregate function by @liujiayi771 in #4939
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240314) by @lwz9103 in #4948
- [VL] parquet file metadata columns support in velox by @gaoyangxiaozhu in #3870
- [VL] Daily Update Velox Version (2024_03_14) by @GlutenPerfBot in #4949
- [VL] Untangle code of TransformPreOverrides by @zhztheplayer in ht...
v1.2.0-rc0
What's Changed
- [CORE] Move all columnar rules to post-columnar transitions by @zhztheplayer in #4790
- [GLUTEN-4398][FOLLOW] Mask PullOutPostProject and PullOutPreProject id by @zwangsheng in #4815
- [GLUTEN-2956][VL] Support Spark NullType by @PHILO-HE in #2996
- [CORE] Add logical link to rewritten spark plan by @ulysses-you in #4817
- [GLUTEN-4803][UT] Add Golden Files for TPC-H Spark33 + Gluten Execution Plan by @zwangsheng in #4804
- [VL] Allow replacing installed minio package by @PHILO-HE in #4825
- [VL] Daily Update Velox Version (2024_03_01) by @GlutenPerfBot in #4821
- [VL] Enable more tests of GlutenParquetIOSuite for Spark32/33/34 by @Yohahaha in #4823
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240302) by @lwz9103 in #4837
- [GLUTEN-4039][VL] support map_keys and map_values by @konjac in #4826
- [GLUTEN-4424][CORE] Upgrade spark version to 3.5.1 in Gluten by @JkSelf in #4822
- [VL] Daily Update Velox Version (2024_03_04) by @GlutenPerfBot in #4841
- [GLUTEN-4813] Replace resize/reserve to resize_extact/reserve_exact to reduce memory by @taiyang-li in #4824
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240305) by @lwz9103 in #4849
- [VL] Fix boost installation issue and remove useless QueryCtx by @PHILO-HE in #4850
- [VL] Enable "parquet v2 pages - delta encoding" test for Spark33/Spark34 by @Yohahaha in #4816
- [CORE] Support FileSourceScanExec driver metrics for spark3.4/3.5 by @zhli1142015 in #4848
- [GLUTEN-4772][VL] Support empty map/array literal by @WangGuangxin in #4771
- [GLUTEN-4860][CELEBORN] Replace celeborn link by @kerwin-zk in #4861
- [VL][CI] Fix CI failure related to Celeborn by @PHILO-HE in #4862
- [CORE] Support In list option contains non-foldable expression by @ulysses-you in #4843
- [VL] Daily Update Velox Version (2024_03_05) by @GlutenPerfBot in #4852
- [VL] Enable more tests in GlutenParquetQuerySuite for Spark32/33/34 by @Yohahaha in #4854
- [CORE] ColumnarShuffleExchangeExec should respect advisoryPartitionSize for Spark3.5 by @ulysses-you in #4865
- [GLUTEN-4853][CORE] Only trim Alias when its child is semantically equal to resAttr by @liujiayi771 in #4857
- [VL] minor change for delta ut by @zhli1142015 in #4869
- [VL] Add libsodium.so to thirdparty lib for CentOS8 by @kerwin-zk in #4870
- [VL] Updated documentation, refactoring and added more testcases for BNLJ by @Surbhi-Vijay in #4782
- [VL] Daily Update Velox Version (2024_03_06) by @GlutenPerfBot in #4868
- [MINOR] Remove ExtendedAnalysisException by @PHILO-HE in #4864
- [GLUTEN-4831][VL] Support StructType in HashAggregate by @WangGuangxin in #4832
- [VL] Support inline function by @marin-ma in #4847
- [VL] Add flushable decimal sum test case by @liujiayi771 in #4871
- [CORE] Add synchronized for ExplainUtils processPlan by @ulysses-you in #4876
- [VL] Rewrite collect_set and collect_list aggregate function by @ulysses-you in #4805
- [VL] Fix and use flattenVector by @marin-ma in #4783
- [VL] Enable tests of ParquetPartitionDisconverySuite for Spark33/34 by @Yohahaha in #4881
- [CORE] Minor adjustment to columnar rule list, and move all columnar sub-rules to one source folder by @zhztheplayer in #4863
- [VL] Merge Partial and PartialMerge logic in generateMergeCompanionNode by @liujiayi771 in #4883
- [CORE] Fix Spark-3.5 CI by @ulysses-you in #4886
- [GLUTEN-4424][CORE] Follow up upgrading spark version to 3.5.1 by @JkSelf in #4845
- Add .asf.yml by @yaooqinn in #4892
- Update Vulnerability Handling Process by @yaooqinn in #4894
- [VL] Daily Update Velox Version (2024_03_07) by @GlutenPerfBot in #4877
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240308) by @lwz9103 in #4890
- [CORE] ColumnarBroadcastExchangeExec should set/cancel with job tag for Spark3.5 by @ulysses-you in #4882
- [VL] Daily Update Velox Version (2024_03_08) by @GlutenPerfBot in #4895
- [VL] Pass partition id to velox functions by @zhli1142015 in #4344
- Add Incubation Standard Disclaimer by @yaooqinn in #4911
- [GLUTEN-4835][CORE] Match metric names with Spark by @clee704 in #4834
- [Gluten-4732][CH] delta-mergetree support update/delete/upsert/insert in a more native delta way by @binmahone in #4733
- [GLUTEN-4898][CH]Bug fix to date diff by @KevinyhZou in #4900
- [VL] Daily Update Velox Version(2024_03_11) by @GlutenPerfBot in #4908
- [DOC] Update release & configuration doc by @PHILO-HE in #4910
- [VL] Support lead window function by @ulysses-you in #4902
- [VL] Fix protobuf configure arguments in get_velox.sh by @liujiayi771 in #4920
- [Gluten-4918][CH]support CTAS for clickhouse table by @binmahone in #4919
- [GLUTEN-4926][CELEBORN]
CelebornShuffleManager
should removeshuffleId
fromcolumnarShuffleIds
after unregistering shuffle by @SteNicholas in #4927 - [Gluten-4912][CH]Support Specifying columns in clickhouse tables to b… by @binmahone in #4925
- [Gluten-4706] [CH][CORE] Add a mode to execute count distinct directly instead o… by @binmahone in #4708
- [VL] Daily Update Velox Version (2024_03_12) by @GlutenPerfBot in #4923
- [GLUTEN-4914][CH] Fix exceptions in ASTParser by @taiyang-li in #4916
- [DOC] Minor fix for wrong gluten folder used in doc by @leoluan2009 in #4938
- [VL] Refine log plan/split json into one line by @Yohahaha in #4934
- [VL] Support posexplode function and code refactoring on GenerateExecTransformer by @marin-ma in #4901
- [CORE] Prior to #4893, add vanilla Spark's original scan source code to keep git history by @zhztheplayer in #4931
- [VL] Fix wrong plan equality due to case class inheritance by @zhztheplayer in #4893
- [GLUTEN-3559][VL] enable more sql query tests for Spark34 by @zhouyuan in #4880
- [VL] Daily Update Velox Version (2024_03_13) by @GlutenPerfBot in #4944
- [VL]Bucket join support for Iceberg tables by @SinghAsDev in #4859
- [GLUTEN-4827][UT] Add Golden Files for TPC-H Spark34 + Gluten Execution Plan by @zwangsheng in #4828
- [VL] Verify unhex has been offloaded to native successfully by @Yohahaha in #4937
- [VL] Support skewness aggregate function by @liujiayi771 in #4939
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240314) by @lwz9103 in #4948
- [VL] parquet file metadata columns support in velox by @gaoyangxiaozhu in #3870
- [VL] Daily Update Velox Version (2024_03_14) by @GlutenPerfBot in #4949
- [VL] Untangle code of TransformPreOverrides by @zhztheplayer in ht...
v1.1.1
Release Notes - Gluten - Version 1.1.1
We are pleased to announce that Gluten has been accepted as an Apache Incubating project. Additionally, we are excited to unveil the release of Gluten-1.1.1. This version marks the final release before our transition to Apache.
Highlights (Velox backend only)
- Support Spark 3.2, 3.3, and 3.4(API only)
- Support 30 common Spark Operators
- Support 220 common Spark Functions
- Velox codebase updated to 2024/02/29
- Refactor Data Lake API to support Delta Lake Scan and Iceberg read COW table
- Better S3, GCS support
- More stability in Spill support
- Enhance metric support for spill, shuffle, and additional metrics.
- Enhance fallback case support by expanding coverage for missing cases and updating messages accordingly
- Enhance Shuffle including merge before compressing, push based shuffle, and more
- More Bug Fixing
What's Changed
- [GLUTEN-3855][VL] Fix ORC related failed UT by @chenxu14 in #3805
- [VL] Support IsNull filter pushdown by @rui-mo in #3791
- [VL] Update velox-backend-limitations.md by @FelixYBW in #3639
- [GLUTEN-2169][VL] Enable GlutenEnsureRequirementsSuite in unit tests by @JkSelf in #3860
- [CH] Fix exception of pb MessageToJsonString by @exmy in #3823
- [GLUTTEN-3851][VL] Add remaining filter time metric by @zhli1142015 in #3852
- [VL] Support ignoreNulls for NthValue window function by @PHILO-HE in #3857
- [VL] Enable using static link for QAT by @marin-ma in #3863
- [VL] Fix assertion failures when mixing use of partial aggregation spilling and flushing by @zhztheplayer in #3872
- [GLUTEN-3796][VL][FOLLOW_UP] Correct test name match and move black list to exclude in
VeloxTestSettings
by @zwangsheng in #3874 - [GLUTEN-3528][VL] Construct unique & non-overlapping partition/sort keys for window operator by @PHILO-HE in #3883
- [GLUTEN-3879][CH] salt 1% of TPCH-1 data to NULL instead of 10% by @binmahone in #3880
- [VL] Doc refresh by @zhouyuan in #3882
- [GLUTEN-3865][CH] Refactor aggregating without keys by @lgbo-ustc in #3866
- [GLUTEN-3722][CH] Improve shuffle writer by @taiyang-li in #3728
- [VL] Map date_format to a Velox function name by @PHILO-HE in #3878
- [VL]Daily Update Velox Version (20231129) by @yma11 in #3877
- [CORE] Add InputIteratorTransformer to decouple ReadRel and iterator index by @ulysses-you in #3854
- [GLUTEN-3732][VL] Use arrow result-returning variants
FileWriter::Open
API by @yangzhg in #3733 - [CORE] Move validate methods from TransformerApi to ValidatorApi by @exmy in #3881
- [GLUTEN-3824][CH]Bug fix hdfs path contains space by @KevinyhZou in #3825
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20231201) by @lwz9103 in #3898
- [VL] Break up spilling operation to two phases: shrink phase and spill phase by @zhztheplayer in #3895
- [GLUTEN-1699][VL] Support loadLibFromJar on RedHat 7/8 by @ychris78 in #3893
- [GLUTEN-3906] [VL] fix: fix package.sh failed for x86 by @lzjqsdd in #3907
- [GLUTEN-3750][CH]Bug fix json parse error by @KevinyhZou in #3751
- [GLUTEN-3902][VL] Add documentation to configure the Velox+GCS connector by @tigrux in #3902
- [DOC] Revise Gluten document by @PHILO-HE in #3892
- [VL]Daily Update Velox Version (20231203) by @yma11 in #3913
- [VL] Minor improvements for CI stale bot by @zhztheplayer in #3888
- [VL] Avoid reapplying code patches for external projects when ENABLE_EP_CACHE=ON by @zhztheplayer in #3916
- [VL] minor change for fallback log by @zhli1142015 in #3919
- [VL] Add sort merge join metrics by @ulysses-you in #3920
- [GLUTEN-3378][CORE] Datasource V2 data lake read support by @liujiayi771 in #3843
- [VL] ENABLE_EP_CACHE=ON still uses cached Velox build although the build arguments were changed by @zhztheplayer in #3926
- [VL] Make bloom_filter_agg fall back when might_contain is not transformable by @zhli1142015 in #3917
- [VL][CI] update docker build script by @zhouyuan in #3904
- [GLUTEN-3917][FOLLOWUP] Add back SparkShimLoader import by @ulysses-you in #3940
- [VL] Fix VeloxTPCHV1BhjSuite and VeloxTPCHV2Suite useV1SourceList by @liujiayi771 in #3930
- [VL] Fix syntax error in stale.yml by @zhztheplayer in #3945
- [GLUTEN-3854][CORE][FOLLOWUP] Add ColumnarInputAdapter back to recover UI graph by @ulysses-you in #3933
- [GLUTEN-1632][CH]Daily Update Clickhouse Version (20231206) by @lwz9103 in #3938
- [VL] Add output row metric for InputIteratorTransformer by @Yohahaha in #3939
- [GLUTEN-3927][CH] Improve the performance of element_at by @taiyang-li in #3928
- [GLUTEN-3908][CH] Improve shuffle split for clickhouse backend by remove ColumnNullable's
memcmp
by @KevinyhZou in #3909 - [GLUTEN-3924][CORE] Match hive UDF name in case-insensitive mode during expression transformation by @taiyang-li in #3925
- [GLUTEN-3958] Use getDeclaredConstructor().newInstance() in ScanTransformerFactory by @liujiayi771 in #3961
- [GLUTEN-3944][CH]Fix gluten.jar with delta20 when use spark 3.3 by @lwz9103 in #3947
- [VL] gluten-te: In dockerfiles, use symbolic link for /opt/velox by @zhztheplayer in #3946
- [VL]Daily Update Velox Version (20231206) by @yma11 in #3954
- Revert "[GLUTEN-3908][CH] Improve shuffle split for clickhouse backend by remove ColumnNullable's
memcmp
" by @baibaichen in #3965 - [GLUTEN-3890][CH] Respect spill_threshold for all buffers in shuffle writer by @taiyang-li in #3891
- [CORE] Fix wrong fallback cost by @ulysses-you in #3967
- [GLUTEN-3922][CH] Fix incorrect shuffle hash id value when executing modulo by @zzcclp in #3923
- [VL] quick fix for static build git conflict by @zhouyuan in #3971
- [GLUTEN-3486][CH] Fix AQE cannot coalesce shuffle partitions by @exmy in #3941
- [GLUTEN-3949][CH] Merge small blocks from upstream phase into a large one by @lgbo-ustc in #3952
- [GLUTEN-3948][CH] Fix exception and diff of trunc function by @exmy in #3968
- [GLUTEN-3979][CORE] Use exists() instead of map().exists() to improve code readability by @dcoliversun in #3980
- [VL]Daily Update Velox Version (20231208) by @yma11 in #3973
- Revert "[VL] Make bloom_filter_agg fall back when might_contain is not transformable (#3917)" by @loneylee in #3977
- [GLUTEN-3580][VL] support read data from abfs with account key by @gaoyangxiaozhu in #3897
- [GLUTEN-3991][CH] Fix the incorrect display name for the mergetree file format by @zzcclp in #3992
- [VL] gluten-te: Enable BuildKit to support --cache-from by @zhztheplayer in #3964
- [GLUTEN-3841][CH] Support spill in 2nd aggregate stage by @lgbo-ustc in #3772
- [VL] Daily Update Velox Version (20231211) by @zhztheplayer in #3999
- [VL] Fix StringToMap test failure by @PHILO-HE in #3995
- [VL] Make bloom_filter_agg fall back when might_contain is not transformable by @zhli1142015 in #3994
- [VL] Following #3996, fix CI error "Runtime factory already registered" by @zhztheplayer in #4001
- [VL] Fix linking simdjson error when building benchmark by @PHILO-HE in #3960
- [GLUTEN-4002][CH] Update InputIteratorTransformer metrics by @zzcclp in https://github.com/...
Gluten v1.1.0
Release Notes - Gluten - Version 1.1.0
We are excited to announce the release of Gluten-1.1.0.
This version is the culmination of work from 45 contributors who have worked on features and bug-fixes for a total of over 800 commits since 1.0.0
Highlights (Velox backend only)
- 20% performance improvement in Decision Support Benchmarks comparing to v1.0.0
- Support Spark 3.2 and Spark 3.3
- Support Spark 3.4 (experimental)
- Run Pass all Velox UTs, Spark 3.2/3.3 SQL related UTs
- Support Ubuntu 20.04/22.04, CentOS 7/8, alinux 3, Anolis 7/8
- Support File System: localfs, HDFS, S3, OSS(via s3a), GCS
- Support File Format: Parquet, ORC
- Support Data Lake: deltalake (experimental)
- Support Data Types: Primitive Type, Decimal, Date, Timestamp, Array (partial), Map (partial), Struct (partial)
- Support 28 common Spark Operators, detail here
- Support 199 common Spark Functions, detail here
- Support Dynamic Memory Pool and Spill
- Support Velox UDF
- Support Gluten UI to print fallback event in History Server
- Support Hadoop HA and Kerberos
- Velox code updated to 20231123(commit-id: aff0cde)
- Document improvement for support features and configuration
Known Issues
- Only support static partition write in Spark 3.2 and 3.3
New Features
#3722 | [CH] improve mutex usage in shuffle writer |
#2063 | [CH] Spark sql config load dynamic by task |
#3257 | [VL] We may need more metrics collected by Velox |
#3528 | [VL] Construct unique partition/sort keys and removing overlapping sort key for window plan |
#3381 | [CH]Reuse last WholeStageTransformer instead of creating new one in FileFormatWriter |
#2118 | [CH] Support hive udtf |
#2128 | [CH]Support tablesample clause |
#2163 | [CH] support approx_percentile aggregate function |
#2193 | [CH] Support some array functions |
#2207 | [CH] Support function to_utc_timestamp/from_utc_timestamp |
#2136 | [CH] HiveTransform add metrics readBytes |
#2439 | [VL] array_aggregate support with lambda function |
#2451 | [CH] Support StaticInvoke function |
#2460 | Avoid force check Java thread in native side |
#2465 | Remove operator level fallback policy |
#2472 | [CH] Remove BasicScanExecTransformer#getInputFilePaths when CH support more general partition location parsing |
#3187 | [CH] Implement runtime native bloom filter |
#2267 | [CH] Support urldecoder which is used in reflect(""java.net.URLDecoder"", ""decode"",event.event_info['currenturl'], ""UTF-8"") |
#2309 | Implement Streaming Window in Velox backend to reduce the memory usage. |
#2323 | [CH] Build optimization |
#2343 | [VL] ShuffleWrite: Larger shuffle size than vanilla spark and long compression time |
#2365 | [CH] gluten should support setting max bytes for a partition for orc/parquet |
#2390 | [CH] Aligning the NULL and NaN compare semantics of Spark and CH |
#2600 | [CH] enhance S3 client caching |
#2617 | [VL][Spark 3.3+] support pushdown aggregate to native scan insteads of fallback |
#2619 | [VL][Spark 3.3+] support match columns use filedIds in native insteads of fallback |
#2667 | [VL] Stacktrace-categorized memory allocation dumping for debugging |
#2730 | Request for documentation on how to write a backend for 3rd party engines |
#2761 | [DOC] A doc named index.md share same content with README.md |
#2772 | [VL] When performance degradation,What factors may affect the performance? |
#2783 | [VL]Run CI with DEBUG build mode to enhance stability |
#2791 | [VL] Support spark function: concat_ws |
#2793 | Code refactor: move some common code to a root module named common |
#2807 | Code cleanup: FunctionConfig may be useless |
#2515 | when we will support spark -gpu ,now we need spark -gpu feature to train big model |
#2535 | UnsupportedOperationException is abused |
#2593 | List parquet write semantic differents in Spark and gluten |
#2804 | Handle timeZoneId for TimezoneAwareExpression |
#2815 | [VL] complex data type support in parquet scan |
#2825 | [VL] In Java, consolidate GlutenColumnarBatchSerializer and CelebornColumnarBatchSerializer |
#2826 | [VL] Use a dedicate class to maintain gluten native config |
#2845 | [VL] Separate each jni wrapper to different files |
#2874 | [VL] support spark.sql.decimalOperations.allowPrecisionLoss |
#2877 | [VL] Support read iceberg |
#2905 | [VL] Support percentile function |
#2919 | [VL] Support ORC format in HiveTableScanExecTransformer |
#2956 | [VL] Support NullType in Project |
#2975 | [VL] Track MemoryManager feature |
#3015 | [CH] ReusedExchange: Gluten does not touch it or does not support it |
#3017 | [VL] Allow users to set spill partitions/levels |
#3033 | [CH] Support aggregation spill for the second stage |
#3049 | [CORE] Statement level controls whether to use gluten |
#3817 | [CH] Optimize mergetree prewhwhere |
#3704 | [CH] support tuple subcolumn pruning for orc/parquet |
#3784 | DNM |
#3144 | [CH] Aggregation supports complicate type |
#3715 | [VL] Add support for GCS |
#2106 | [VL] CI: allow to benchmark TPCH performance on comment |
#3702 | [VL] Add sort based window support in velox backend |
#2404 | [VL] Enable Velox memory reclaimer for auto disk-spilling |
#3082 | [CORE] Support columnar CollectLimit |
#3739 | [VL] Add config to disable velox file handle cache |
#3055 | [VL] Use mixed memory (off-heap and on-heap) for native |
#3077 | [VL] EP: Centralized lifecycle management for C++ / JNI contextual objects |
#3142 | [VL] Tight Java-C++ object binding |
#3075 | [VL] Support static partition write in VL backend |
#2533 | Degrade Arrow version to 8.0 in VL backend. |
#2629 | Use Project + Unnest to implement Expand operator |
#3132 | Add streamingwindow support in velox backend |
#3361 | Support Spark 3.4 in Gluten. |
#3425 | [VL] Create Hdfs folder in Gluten side when writing hdfs file |
#3541 | [VL] Add minimal GHA CI job for debug build |
[#3705](https://... |
Gluten v1.0.0
Release Notes - Gluten - Version 1.0.0
Highlights (Velox backend only)
- Support Spark 3.2 and Spark3.3
- Run Pass all Velox, Spark3.2 UTs, and partially Spark3.3 UTs
- Support Ubuntu 20.04/22.04, CentOS 7/8, alinux 3, Anolis 7/8
- Support FileSystem: localfs, HDFS, S3, OSS (via s3a)
- Support data types: Primitive type, Decimal, Date, Timestamp
- Support 20 operators, detail here
- Support 164 functions, detail here
- Support native Parquet write
- Support native ORC read
- Support Intel® In-memory Analytics Accelerator (IAA/IAX) hardware accelerator in Shuffle compression
- Support cap-based spill (static memory allocation) for join/agg/sort operator (experimental feature)
- Support static build method via vcpkg
- Support local cache (experimental feature)
- 2.71x speedup in Decision Support Benchmark1 (TPC-H Like) testing
- 2.29x speedup in Decision Support Benchmark2 (TPC-DS Like) testing
- Velox code updated to commit
- Document improvement for support features and configuration
Known Issues
- Parquet write only support
compression.codec
,parquet.block.size
andparquet.block.rows
configurations - Velox backend does not support dynamic partition write and bucket write
- Spill may throw
OutOfMemoryExcetpion
New Features
- [GLUTEN-1243][VL] Support bit_xor aggregate function
- [GLUTEN-1245][VL][Feat] Add VeloxParquetFileFormat to support parquet write in velox backend
- [GLUTEN-1270][VL][Feat] Support multiple HDFS endpoints
- [GLUTEN-1306][VL] feat: Link static depends via vcpkg
- [GLUTEN-1306][FOLLOWUP] vcpkg setup script add alinux3 support
- [GLUTEN-1346][VL] Support native velox row to column
- [GLUTEN-1367] Support running gluten on anolis
- [GLUTEN-1371][VL] Support First/Last aggregate functions
- [GLUTEN-1374][VL] RangePartitioning supports velox columnar batch
- [GLUTEN-1409][VL] feat: Support named_struct in Velox backend
- [GLUTEN-1476][VL] Support GetStructField
- [GLUTEN-1478] Support ordered result check for MapData
- [GLUTEN-1490] refactor substrait literals using generics, and support map/struct/array literals based on it
- [GLUTEN-1521][Core] Support to add the customer columnar rules by config
- [GLUTEN-1623][VL] Support asinh, acosh, atanh, sec, csc math functions for Velox backend
- [GLUTEN-1638][VL] feat: Add hdfs support in parquet write
- [GLUTEN-1640] Support judging whether the execution plan has a fallback
- [GLUTEN-1654][VL] support approx_count_distinct for velox
- [GLUTEN-1658][CORE] feat: Support SparkResourcesUtil.scala in k8s
- [GLUTEN-1662][VL] feat: Support InsertIntoHiveDirCommand in velox parquet write
- [GLUTEN-1704][VL] Support metrics on splits and row groups by
- [GLUTEN-1794][VL] support split preload
- [GLUTEN-1860] StructLiteral support null literal
- [CORE] Support submit subqueries concurrently to improve scalar subquery performance
- [VL] package.sh support centos7 and centos8
- [VL] feat: support partial merge phase in aggregation
- [VL] package and velox scripts add alinux support
- [VL] feat: support more distinct functions
- [VL] Support mocking map stage with no input files in micro benchmark
- [VL] add support for reading ORC
- [VL] add long decimal type support for Orc file format
Improvements
- [GLUTEN-842][VL] convert expand op to expand exec in velox
- [GLUTEN-842] remove group id transformer
- [GLUTEN-1108][VL] Init NativeRowToColumnarJniWrapper with memory pool and schema
- [GLUTEN-1199] Avoid throwing exception from destructor of JavaInputStreamAdaptor
- [GLUTEN-1205][VL] Rename some class name and dir name for columnar sh…
- [GLUTEN-1205][VL] Refactor shuffle partition writer
- [GLUTEN-1205][VL] Refactor shuffle partitioner
- [GLUTEN-1205][VL][FOLLOWUP] Refactor shuffle partition writer
- [GLUTEN-1209][VL] refactor: Refactor Java Celeborn into an independent module
- [GLUTEN-1296][VL] Remove some logs in CI
- [GLUTEN-1325][VL] Optimize decimal arithmetic
- [GLUTEN-1331][CORE] Enable some functions
- [GLUTEN-1336][VL] add spark3.3 UT under connector and expression
- [GLUTEN-1336][VL] move Spark3.3 Unit tests to seperate job
- [GLUTEN-1336][VL] add more spark3.3 UT
- [GLUTEN-1336][VL] CI: move slow tests into another job for Spark3.3
- [GLUTEN-1357][CORE] Change soft-affinity log level from INFO to DEBUG
- [GLUTEN-1369][Core] Move config 'spark.gluten.enabled' to GlutenConfig from QueryPlanSelector
- [GLUTEN-1393][VL] feat: Change velox pipeline input from arrow to velox ValueStreamNode
- [GLUTEN-1407] Let profile control shim version
- [GLUTEN-1416][VL] NoSuchMethodError from shaded Arrow
- [GLUTEN-1433][VL] feat: offload timestamp scan to Velox - phase 1
- [GLUTEN-1433][VL] Enable GlutenStatisticsCollectionSuite
- [GLUTEN-1434][VL] Delete some unused files and functions
- [GLUTEN-1434][VL] Refactor to add ColumnarBatchIterator
- [GLUTEN-1434][VL] Remove unused arrow code and add GLUTEN_CHECK and GLUTEN_DCHECK
- [GLUTEN-1458][VL][CI] feat: Adding Spark3.3 w/ Ubuntu22.04 test
- [GLUTEN-1476][VL] Enable scan on struct and map types
- [GLUTEN-1476][CORE] Use correct field name in struct type
- [GLUTEN-1478][VL] enable timestamp expression tests
- [GLUTEN-1478] Enable failed UT in GlutenIntervalExpressionsSuite
- [GLUTEN-1478][VL] Enable some spark UTs for cast function
- [GLUTEN-1478][VL] Enable tests on casting from string to decimal
- [GLUTEN-1478][VL] Enable test on casting from decimal to bool
- [GLUTEN-1480][DOC] Refactor to enable github pages
- [GLUTEN-1491][VL][feat] Refine row_number() method in velox backend
- [GLUTEN-1500][VL] feat: Use 0.6 * task memory cap as spill threshold for all spillable operators
- [GLUTEN-1500][VL] Implement OOM cap shared by tasks, and spill threshold shared by tasks and operators
- [GLUTEN-1500][VL] Integrate with Velox arbitration API
- [GLUTEN-1533][VL][Feat] Replace sort agg with gluten hash agg
- [[GLUTEN-1534][VL]](https://github.com/oap-proj...
Gluten 0.5.0
Change log
Generated on 2023-04-07
Gluten 0.5.0
Gluten 0.5.0 is the 1st preview release from the repository(https://github.com/oap-project/gluten).
In this release, we have merged 971 PRs and fixed 216 issues.
Here is the major highlight in Gluten 0.5.0:
- Support Spark3.2 and Spark3.3
- Support Ubuntu20.04 or later
- Support CentOS7 and 8
- Support JDK8 only
- Support GCC9 or later
- Use Substrait as unified plan
- Use Velox as default backend engine
- Use Celeborn as default RSS
- Support most popular data types including Boolean, Byte, Short, Int, Long, Float, Double, Date, Decimal, String, ...etc.
- Support Spill for Sort, Agg, and Join operators
- Run Pass all Spark3.2 Unit Test
- 2.5x speedup in Decision Support Benchmark1(TPC-H Like) testing
- 2x speedup in Decision Support Benchmark2(TPC-DS Like) testing
- Support Intel QAT accelerators in Shuffle compression
Limitations
- Not Support Complex data type such as Array, Map, Struct
- OOM happened in some operators not support Spill
- Decimal result may mismatch in some cases
Features
#974 | [CH] Supprt string repeat function |
#1008 | [CH] Support locate function |
#1273 | Implement cast decimal to int |
#1223 | [CH] support reading from S3 and using Clickhouse local cache to speed up |
#1131 | [Gluten-core] Add an option to only fallback once |
#1165 | Reduce GC Time when executing BHJ for CH backend. |
#1147 | [Gluten-core]Make validate failure logLevel configuable |
#1100 | Making transformer plan log more obvious |
#1112 | Refactor Gluten metrics and add apis for each backend |
#926 | gluten timezone not the same as backend |
#1039 | Remove compute pid metric in shuffle operator. |
#882 | Selective query execution |
#959 | Upgrade Arrow version to 11.0.0 |
#969 | Docker for gluten running on centos 8 |
#986 | Align and enrich metrics compare to Spark |
#972 | Can we separate native dynamic library from build generated jars? |
#913 | No Spark Shim Provider found for 3.2.0 |
#853 | Support named struct type |
#888 | Clickhouse backend broadcast relation support r2c |
#850 | Add cast check in ExpressionTransformer |
#825 | Setup development environment for macOS |
#788 | Pass needed hadoop conf from driver to executor |
Bugs Fixed
#1284 | Scala double data is wronlgy compared with null in a ut |
#729 | Validation failed for GlutenHashAggregateExecTransformer class |
#799 | This operator doesn't support doExecuteColumnar |
#527 | archives for Spark patch versions become unavailable on new releases affecting shims versioning |
#523 | Some basic failed SQL cases |
#1028 | [VL] SusbtraitToVeloxPlan error |
#858 | Sort result mismatch issue with different input records. |
#877 | Array/Map DataType result mismatch issue when containing null value |
#1227 | [CH] Scalar subquery filters execute twice for parquet file |
#1265 | [CH] Rescale decimal trigger fallback |
#1233 | [CH] Fix fallback issue when reading csv files |
#1235 | [CH] Fix missing reading from the broadcasted value when executing DPP |
#1234 | [CH] Fix error 'Invalid number of columns in chunk pushed to OutputPort' when executing hash agg after union all |
#1207 | shims-spark32 and shims-spark33 may be depencied at the same time |
#1161 | Bundle built by buildbundle-veloxbe.sh for Spark3.3 is broken |
#1210 | [CH] Fix the wrong table path of the orders table for TPCH in UT |
#1175 | FileNotFoundException while executing spark jobs -.so files |
#1179 | [VL] CI is failing on boost's checksum |
#1162 | [CK]fix CoaleseBatches metrics |
#1124 | Memory management not suitable with Velox split preload feature. |
#1149 | Run tpc-ds core |
#741 | Handle remainder for the case that its right input is zero |
#1090 | [TPCH][VL] tpch has some query execution error logs but queries could finish and the result is correct |
#1068 | [VL] Managed memory leak in imported Spark UTs |
#772 | Velox does not install folly in centos8 by default, break compile in centos8. |
#789 | Jar conflicts on Arrow and Protobuf between Vanilla Spark and Gluten |
#700 | AARCH64 port of Gluten |
#1027 | [VL] unsupported method |
#1072 | [CH] Fix NPE when executing BatchScanExecTransformer.getInputFilePaths with MergeTree DS V2 |
#489 | cannot build gluten (velox backend) in Amazon Linux 2 |
#1012 | Enable local cache throw exception |
#995 | Fix memory leak for ClickHouse Backend |
#914 | System variables related to Folly could not be found when compiling gluten. |
#990 | Failed to build velox |
#946 | Upgrade arrow version to 10.0.1 |
#860 | CH backend inset result not equals spark result |
#601 | Can't decide data type of null value in gluten test framework, when transforming InteralRow to DataFrame |
#843 | Unable to convert BHJ to SHJ by using hint |
#826 | ch_backend not support inset is empty |
#815 | Gluten + Velox backend does not support Struct dataset with same element name. |
#563 | Error compiling within -Pbackends-xx,spark-3.3,spark-ut |
#560 | An unsupportedOperationException interrupted the query execution |
#770 | VeloxRuntimeError when reading parquet file with only meta data |
#800 | [UT]ExpectedAnswer may not match SparkAnswer when is sorted |
#676 | WholeStageTransformerSuite#logForFailedTest() swallows exceptions |
#790 | Join RuntimeException when having duplicated equal-join keys |
#757 | Parquet scan not offloaded |
#797 | It won't load the libparquet.so.1000 when we use Gluten with Velox backend and run it on the yarn. |
#784 | No Spark Shim Provider found for 3.3.0 |
#547 | Jar conflict issue |
#727 | build from local velox repo doesn't work |
PRs
#1266 | [GLUTEN-1246] [CORE] Fix scale may be negative issue |
#1313 | [VL] Update doc for centos7 install |
#1312 | [CH] Ignore ch backend tpcds suite |
#1198 | [VL] fix: Update Velox setup scripts for centos 7 |
#1294 | [VL] Following #1185, do some clean-ups against Velox + Celeborn CI |
[#1196](https://github.com/oa... |