[Spark SQL] 源碼解析之Optimizer

前言

由前面博客我們知道了SparkSql整個解析流程如下:

  • sqlText 經(jīng)過 SqlParser 解析成 Unresolved LogicalPlan;
  • analyzer 模塊結(jié)合catalog進(jìn)行綁定,生成 resolved LogicalPlan;
  • optimizer 模塊對 resolved LogicalPlan 進(jìn)行優(yōu)化,生成 optimized LogicalPlan;
  • SparkPlan 將 LogicalPlan 轉(zhuǎn)換成PhysicalPlan;
  • prepareForExecution()將 PhysicalPlan 轉(zhuǎn)換成可執(zhí)行物理計(jì)劃;
  • 使用 execute()執(zhí)行可執(zhí)行物理計(jì)劃;

詳解optimizer 模塊

optimizer 以及之后的模塊都只會在觸發(fā)了action操作后才會執(zhí)行。優(yōu)化器是用來將Resolved LogicalPlan轉(zhuǎn)化為optimized LogicalPlan的。

optimizer 就是根據(jù)大佬們多年的SQL優(yōu)化經(jīng)驗(yàn)來對語法樹進(jìn)行優(yōu)化,比如謂詞下推、列值裁剪、常量累加等。優(yōu)化的模式和Analyzer非常相近,Optimizer 同樣繼承了RuleExecutor,并定義了很多優(yōu)化的Rule:

def batches: Seq[Batch] = {
    // Technically some of the rules in Finish Analysis are not optimizer rules and belong more
    // in the analyzer, because they are needed for correctness (e.g. ComputeCurrentTime).
    // However, because we also use the analyzer to canonicalized queries (for view definition),
    // we do not eliminate subqueries or compute current time in the analyzer.
    Batch("Finish Analysis", Once,
      EliminateSubqueryAliases,
      EliminateView,
      ReplaceExpressions,
      ComputeCurrentTime,
      GetCurrentDatabase(sessionCatalog),
      RewriteDistinctAggregates,
      ReplaceDeduplicateWithAggregate) ::
    //////////////////////////////////////////////////////////////////////////////////////////
    // Optimizer rules start here
    //////////////////////////////////////////////////////////////////////////////////////////
    // - Do the first call of CombineUnions before starting the major Optimizer rules,
    //   since it can reduce the number of iteration and the other rules could add/move
    //   extra operators between two adjacent Union operators.
    // - Call CombineUnions again in Batch("Operator Optimizations"),
    //   since the other rules might make two separate Unions operators adjacent.
    Batch("Union", Once,
      CombineUnions) ::
    Batch("Pullup Correlated Expressions", Once,
      PullupCorrelatedPredicates) ::
    Batch("Subquery", Once,
      OptimizeSubqueries) ::
    Batch("Replace Operators", fixedPoint,
      ReplaceIntersectWithSemiJoin,
      ReplaceExceptWithAntiJoin,
      ReplaceDistinctWithAggregate) :: // aggregate替換distinct
    Batch("Aggregate", fixedPoint,
      RemoveLiteralFromGroupExpressions,
      RemoveRepetitionFromGroupExpressions) ::
    Batch("Operator Optimizations", fixedPoint, Seq(
      // Operator push down
      PushProjectionThroughUnion, //謂詞下推
      ReorderJoin(conf),
      EliminateOuterJoin(conf),
      PushPredicateThroughJoin,
      PushDownPredicate,
      LimitPushDown(conf),
      ColumnPruning, //列剪裁
      InferFiltersFromConstraints(conf),
      // Operator combine
      CollapseRepartition,
      CollapseProject,
      CollapseWindow,
      CombineFilters, //合并filter
      CombineLimits, //合并limit
      CombineUnions,
      // Constant folding and strength reduction
      NullPropagation(conf), //null處理
      FoldablePropagation,
      OptimizeIn(conf), // 關(guān)鍵字in的優(yōu)化,替代為InSet
      ConstantFolding, //針對常量的優(yōu)化,在這里會直接計(jì)算可以獲得的常量
      ReorderAssociativeOperator,
      LikeSimplification, //表達(dá)式簡化
      BooleanSimplification,
      SimplifyConditionals,
      RemoveDispensableExpressions,
      SimplifyBinaryComparison,
      PruneFilters(conf),
      EliminateSorts,
      SimplifyCasts,
      SimplifyCaseConversionExpressions,
      RewriteCorrelatedScalarSubquery,
      EliminateSerialization,
      RemoveRedundantAliases,
      RemoveRedundantProject,
      SimplifyCreateStructOps,
      SimplifyCreateArrayOps,
      SimplifyCreateMapOps) ++
      extendedOperatorOptimizationRules: _*) ::
    Batch("Check Cartesian Products", Once,
      CheckCartesianProducts(conf)) ::
    Batch("Join Reorder", Once,
      CostBasedJoinReorder(conf)) ::
    Batch("Decimal Optimizations", fixedPoint, //精度優(yōu)化
      DecimalAggregates(conf)) ::
    Batch("Object Expressions Optimization", fixedPoint,
      EliminateMapObjects,
      CombineTypedFilters) ::
    Batch("LocalRelation", fixedPoint,
      ConvertToLocalRelation,
      PropagateEmptyRelation) ::
    Batch("OptimizeCodegen", Once,
      OptimizeCodegen(conf)) ::
    Batch("RewriteSubquery", Once,
      RewritePredicateSubquery,
      CollapseProject) :: Nil
  }

batch的執(zhí)行和analyzer一樣是通過RuleExecutor的execute方法依次遍歷,這里不再解析。這里有部分優(yōu)化的例子

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 前言 由前面博客我們知道了SparkSql整個解析流程如下: sqlText 經(jīng)過 SqlParser 解析成 U...
    BIGUFO閱讀 2,203評論 0 11
  • 在前面的文章《spark基礎(chǔ)(上篇)》和《spark基礎(chǔ)(下篇)》里面已經(jīng)介紹了spark的一些基礎(chǔ)知識,知道了s...
    ZPPenny閱讀 22,268評論 2 36
  • 預(yù)備知識 先介紹在Spark SQL中兩個非常重要的數(shù)據(jù)結(jié)構(gòu):Tree和Rule。 SparkSql的第一件事就是...
    BIGUFO閱讀 3,503評論 0 8
  • 1.感恩今天自己炒了青菜,第一次,味道很好,我很滿意,原來做菜并不難,自己真是棒棒噠! 2.感恩小緹的希希老大和我...
    小狐貍娃娃閱讀 369評論 0 0
  • 很多技術(shù)人員,想要加入一家有前途的創(chuàng)業(yè)公司,成為聯(lián)合創(chuàng)始人CTO,或者成為技術(shù)骨干,遇到的第一個頭疼的問題就是:到...
    范凱閱讀 2,626評論 5 35

友情鏈接更多精彩內(nèi)容