GaussDB(DWS)性能调优:row_number()场景性能优化

很多业务场景里面有类似如下SQL实现获取每组的最大或者最小值的目的

SELECT 
    s.reconcile_method, s.trans_entity_bank_acc_id 
FROM (
    SELECT tms.trans_entity_bank_acc_id, tms.source AS reconcile_method        
    FROM (SELECT 
            tms.trans_entity_bank_acc_id, bank_st.source, 
            row_number() OVER(PARTITION BY tms.trans_entity_bank_acc_id ORDER BY bank_st.statement_date DESC) cnt 
        FROM dwrtretl.dwr_tr_tms_settle_f_tmp6 tms, 
        (SELECT bank_st.source, bank_st.bank_account_id, bank_st.statement_date   
        FROM dwnodi.dwr_fin_tr_bank_statement_fi bank_st
        WHERE 1 = 1 AND bank_st.del_flag = 'N') bank_st   
        WHERE tms.trans_entity_bank_acc_id = bank_st.bank_account_id   
        AND tms.settlement_date <= bank_st.statement_date
    ) tms       WHERE tms.cnt = 1
)s

如上的SQL语句在执行的时候会导致操作的数据集按照PARTITION BY和ORDER BY指定的组合字段做排序,具体计划如下

GaussDB(DWS)性能调优:row_number()场景性能优化

排序是一个高内存、低性能的操作动作,当数据量很大时,排序动作(Sort算子)会导致极大耗时。上述SQL的执行信息如下,其中Sort算子的耗时占比为(286227.739-83911.249)/287404.461 = 70.39%

GaussDB(DWS)性能调优:row_number()场景性能优化

因此我们一般应当避免大数据量(超千万级)的排序。当出现因为row_number导致执行执行慢的场景时,一般的优化思路有两个

  1. 减小row_number操作处理的数据量。适用场景一般是row_number操作的数据源重复度比较高的场景
  2. 通过等价改写的方式避免排序操作。适用场景一般是输出列和row_number的PARTITION BY和ORDER BY列的子集的场景

根据实际业务SQL的满足情况,优先检视建议方案1

场景一:通过提前去重操作,降低row_number操作的数据量

原始SQL

SELECT 
    s.reconcile_method, s.trans_entity_bank_acc_id 
FROM (
    SELECT tms.trans_entity_bank_acc_id, tms.source AS reconcile_method        
    FROM (SELECT 
            tms.trans_entity_bank_acc_id, bank_st.source, 
            row_number() OVER(PARTITION BY tms.trans_entity_bank_acc_id ORDER BY bank_st.statement_date DESC) cnt 
        FROM dwrtretl.dwr_tr_tms_settle_f_tmp6 tms, 
        (SELECT bank_st.source, bank_st.bank_account_id, bank_st.statement_date   
        FROM dwnodi.dwr_fin_tr_bank_statement_fi bank_st
        WHERE 1 = 1 AND bank_st.del_flag = 'N') bank_st   
        WHERE tms.trans_entity_bank_acc_id = bank_st.bank_account_id   
        AND tms.settlement_date <= bank_st.statement_date
    ) tms       WHERE tms.cnt = 1
)s

优化后SQL

SELECT 
    s.reconcile_method, s.trans_entity_bank_acc_id 
FROM (
    SELECT 
        trans_entity_bank_acc_id, source AS reconcile_method, 
        row_number() OVER(PARTITION BY trans_entity_bank_acc_id ORDER BY statement_date DESC) AS rn
    FROM (
        SELECT 
            tms.trans_entity_bank_acc_id, bank_st.source, bank_st.statement_date   
        FROM dwnodi.dwr_fin_tr_bank_statement_fi bank_st, dwrtretl.dwr_tr_tms_settle_f_tmp6 tms
        WHERE tms.trans_entity_bank_acc_id = bank_st.bank_account_id   
        AND tms.settlement_date <= bank_st.statement_date AND bank_st.del_flag = 'N'
        GROUP BY tms.trans_entity_bank_acc_id, bank_st.source, bank_st.statement_date
    )t
)s WHERE rn = 1

优化后的执行情况

GaussDB(DWS)性能调优:row_number()场景性能优化

场景二:反向关联,降低row_number操作的数据量

优化后SQL

WITH t AS( /* row_number要处理的结果集 */
    SELECT 
        tms.trans_entity_bank_acc_id, bank_st.source, bank_st.statement_date   
    FROM dwnodi.dwr_fin_tr_bank_statement_fi bank_st, dwrtretl.dwr_tr_tms_settle_f_tmp6 tms
    WHERE tms.trans_entity_bank_acc_id = bank_st.bank_account_id   
    AND tms.settlement_date <= bank_st.statement_date AND bank_st.del_flag = 'N'
    GROUP BY tms.trans_entity_bank_acc_id, bank_st.source, bank_st.statement_date
),

m AS( /* 求满足rn = 1的PARTITION BY列和ORDER BY列 */
    SELECT 
         trans_entity_bank_acc_id, /* PARTITION BY列 */
         max(statement_date) AS statement_date /* ORDER BY列 */
    FROM t
    GROUP BY trans_entity_bank_acc_id
)

SELECT 
    s.reconcile_method, s.trans_entity_bank_acc_id 
FROM (
    SELECT 
        trans_entity_bank_acc_id, source AS reconcile_method, 
        row_number() OVER(PARTITION BY trans_entity_bank_acc_id ORDER BY statement_date DESC) AS rn
    FROM ( /* 跟原始结果集反关联,获取row_number()要处理的最小结果集 */
        SELECT t.*
        FROM t
        INNER JOIN m ON m.trans_entity_bank_acc_id = t.trans_entity_bank_acc_id 
        AND m.statement_date = t.statement_date
    )f
)s WHERE rn = 1

优化后的执行情况

GaussDB(DWS)性能调优:row_number()场景性能优化

原创文章,作者:奋斗,如若转载,请注明出处:https://blog.ytso.com/tech/bigdata/317280.html

(0)
上一篇 13小时前
下一篇 13小时前

相关推荐

发表回复

登录后才能评论