Skip to content

ARTS - 2019 Week 6-3




A Deep Dive into Query Execution Engine of Spark SQL

Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.

【PDF】A Deep Dive into Query Execution Engine of Spark SQL



  • Spark 组件

    • Spark Core、Data Source Connectors
    • Catalyst Optimization & Tungsten Execution
    • SparkSession / DataFrame / Dataset APIs
    • SQL、Spark ML、Spark Streaming、Spark Graph、3rd-party Libraries
  • SQL 流程

Parser -> Analysis -> Logical Optimization -> Physical Planning -> Code Generation -> Execution


  • 物理执行计划

    • Transform logical operators into physical operators
    • Choose between different physical alternatives
    • Includes physical traits of the execution engine
    • Some ops may be mapped into multiple physical nodes
  • 代码生成

    • No virtual function calls
    • Data in CPU registers
    • Loop unrolling & SIMD
  • 容错与失败处理

    • Mid-query recovery model:根据血缘重新计算丢失的分区
    • Task、Fetch:重试策略
  • 内存管理


- Execution Memory

    - Buffer intermediate results
    - Normally short lived

- Storage Memory

    - Reuse data for future computation
    - Cached data can be long-lived
    - LRU eviction for spill data
  • Delta Lake(Vectorized)

    • Full ACID transactions
    • Schema management
    • Scalable metadata handling
    • Data versioning and time travel
    • Unified batch/streaming support
    • Record update and deletion
    • Data expectation
  • UDF

转换为对应数据格式 -> 调用UDF -> 转换回内部数据格式

- Java/Scala UDFs
- Hive UDFs
- Python/Pandas UDFs
  • PySpark(Koalas)

通过 Py4J 执行 Python 代码

Physical Operator(JVM) -> PythonRunner(JVM) -> 序列化/反序列化数据 -> 执行代码(Python Worker)


JVM 常用参数










初始 Metaspace 大小


最大 Metaspace 大小


使用 G1 GC


最大 GC 目标停顿时间




G1 region 的大小


Metaspace GC 后,Metaspace 空闲空间最小比例


Metaspace GC 后,Metaspace 空闲空间最大比例


打印 GC 详情


打印 GC 时 JVM 启动至今的时间戳


打印 GC 时的日期和时间


Spark 作业执行流程(Cluster Mode On YARN)