GaussDB(DWS)性能调优:HashAgg策略优化

【问题背景】

GROUP BY是业务中常见的SQL操作,DWS使用HashAgg算子承载GROUP BY计算逻辑。在不同场景下具体实现行为会有一些差别,主要有以下三种形式。

以如下表定义为例

CREATE TABLE t1(a int, b int, c int) DISTRIBUTE BY HASH(a);

假设agg下层算子所输出结果集的分布列为setA,agg操作的group by列为setB,则在Stream框架下,Agg操作具体有三种实现。

场景1:setA是setB的一个子集

如下,GROUP BY列(setB)是a,HashAggregate底层算子Seq Scan输出数据的分布列(setA)也是a

postgres=# EXPLAIN SELECT a, count(1) FROM t1 GROUP BY a;
 id |          operation           | E-rows | E-width | E-costs  
----+------------------------------+--------+---------+---------
  1 | ->  Streaming (type: GATHER) |     30 |       4 | 15.56   
  2 |    ->  HashAggregate         |     30 |       4 | 14.31   
  3 |       ->  Seq Scan on t1     |     30 |       4 | 14.14   
(3 rows)

场景2:setA不是setB的一个子集

对于这种场景,Stream执行框架分为如下三种计划形态:
planA: hashagg+gather(redistribute)+hashagg
planB:redistribute+hashagg(+gather)
planC:hashagg+redistribute+hashagg(+gather)

GaussDB(DWS)提供了guc参数best_agg_plan来干预执行计划,强制其生成上述对应的执行计划,此参数取值范围为0,1,2,3
•取值为1时,强制生成计划planA。
•取值为2时,强制生成计划planB。
•取值为3时,强制生成计划planC。
•取值为0时,优化器会根据以上三种计划的估算代价选择最优的一种计划生成。

postgres=# SET best_agg_plan TO 1;
SET
postgres=# EXPLAIN SELECT b,count(1) FROM t1 GROUP BY b;
 id |            operation            | E-rows | E-width | E-costs 
----+---------------------------------+--------+---------+---------
  1 | ->  HashAggregate               |      8 |       4 | 15.83   
  2 |    ->  Streaming (type: GATHER) |     25 |       4 | 15.83   
  3 |       ->  HashAggregate         |     25 |       4 | 14.33   
  4 |          ->  Seq Scan on t1     |     30 |       4 | 14.14   
(4 rows)
postgres=# SET best_agg_plan TO 2;
SET
postgres=# EXPLAIN SELECT b,count(1) FROM t1 GROUP BY b;
 id |                operation                | E-rows | E-width | E-costs 
----+-----------------------------------------+--------+---------+---------
  1 | ->  Streaming (type: GATHER)            |     30 |       4 | 15.85   
  2 |    ->  HashAggregate                    |     30 |       4 | 14.60   
  3 |       ->  Streaming(type: REDISTRIBUTE) |     30 |       4 | 14.45   
  4 |          ->  Seq Scan on t1             |     30 |       4 | 14.14   
(4 rows)
postgres=# SET best_agg_plan TO 3;
SET
postgres=# EXPLAIN SELECT b,count(1) FROM t1 GROUP BY b;
 id |                operation                | E-rows | E-width | E-costs 
----+-----------------------------------------+--------+---------+---------
  1 | ->  Streaming (type: GATHER)            |     30 |       4 | 15.84   
  2 |    ->  HashAggregate                    |     30 |       4 | 14.59   
  3 |       ->  Streaming(type: REDISTRIBUTE) |     25 |       4 | 14.59   
  4 |          ->  HashAggregate              |     25 |       4 | 14.33   
  5 |             ->  Seq Scan on t1          |     30 |       4 | 14.14   
(5 rows)

【性能影响分析】

hashagg+redistribute+hashagg的计算方式是

  1. 先对本地结果集进行一次GROUP BY
  2. 把第一步的结果集按照GROUP BY列分布
  3. 重分布后的数据进行二次GROUP BY

假如第一步的GROUP BY的结果集收敛效果好(行数大幅缩减),那么第二步和第三操作的数据量就会大幅降低;假如第一步的GROUP BY的结果集收敛效果不好(行数没有大幅缩减),那么第二步和第三操作的代价和redistribute+hashagg的执行代价差不多

简而言之就是如果GROUP BY的收敛效果好,建议走hashagg+redistribute+hashagg;如果GROUP BY的收敛效果不好,建议走redistribute+hashagg

【优化案例】

场景1:GROUP BY收敛效果好,但走了redistribute+hashagg

这种场景建议使用hint,强制语句走双层agg(hashagg+redistribute+hashagg)的方式提升性能,比如如下用例

SELECT 
    s.reconcile_method, s.trans_entity_bank_acc_id 
FROM (
    SELECT /* 去重之后排序 */
        trans_entity_bank_acc_id, source AS reconcile_method, 
        row_number() OVER(PARTITION BY trans_entity_bank_acc_id ORDER BY statement_date DESC) AS rn
    FROM (
        SELECT  /* 数据去重 */
            tms.trans_entity_bank_acc_id, bank_st.source, bank_st.statement_date   
        FROM dwnodi.dwr_fin_tr_bank_statement_fi bank_st, dwrtretl.dwr_tr_tms_settle_f_tmp6 tms
        WHERE tms.trans_entity_bank_acc_id = bank_st.bank_account_id   
        AND tms.settlement_date <= bank_st.statement_date AND bank_st.del_flag = 'N'
        GROUP BY tms.trans_entity_bank_acc_id, bank_st.source, bank_st.statement_date
    )t
)s WHERE rn = 1;

explain perfromance信息如下,其中GROUP BY的耗时(HashAgg+redistribute)占比为(20063.376-10664.837)/20232.459 = 46%

GaussDB(DWS)性能调优:HashAgg策略优化

如下,在GROUP BY语句对应层级查询语句的SELECT关键字后面加上/*+ set(best_agg_plan 3) */,使用hint信息强制语句走hashagg+redistribute+hashagg执行方式

SELECT 
    s.reconcile_method, s.trans_entity_bank_acc_id 
FROM (
    SELECT /* 去重之后排序 */
        trans_entity_bank_acc_id, source AS reconcile_method, 
        row_number() OVER(PARTITION BY trans_entity_bank_acc_id ORDER BY statement_date DESC) AS rn
    FROM (
        SELECT /*+ set(best_agg_plan 3) */ /* 数据去重 */
            tms.trans_entity_bank_acc_id, bank_st.source, bank_st.statement_date   
        FROM dwnodi.dwr_fin_tr_bank_statement_fi bank_st, dwrtretl.dwr_tr_tms_settle_f_tmp6 tms
        WHERE tms.trans_entity_bank_acc_id = bank_st.bank_account_id   
        AND tms.settlement_date <= bank_st.statement_date AND bank_st.del_flag = 'N'
        GROUP BY tms.trans_entity_bank_acc_id, bank_st.source, bank_st.statement_date
    )t
)s WHERE rn = 1

优化后语句的执行效果如下,其中GROUP BY的耗时(HashAgg+redistribute)占比为((13067.651-10274.636)/13154.634 = 21%。其中可以明显看到重分布时间(stream算子)明显缩短

GaussDB(DWS)性能调优:HashAgg策略优化

场景2:GROUP BY收敛效果不好,但走了hashagg+redistribute+hashagg

这种场景建议使用hint,强制语句走单层agg(redistribute+hashagg)的方式提升性能,比如

原创文章,作者:奋斗,如若转载,请注明出处:https://blog.ytso.com/tech/bigdata/316842.html

(0)
上一篇 6天前
下一篇 6天前

相关推荐

发表回复

登录后才能评论