Skip to content

ARTS - 2019 Week 6-3

20190616~20190622

Algorithm

Review

A Deep Dive into Query Execution Engine of Spark SQL

Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.

【PDF】A Deep Dive into Query Execution Engine of Spark SQL

Review

基础介绍

  • Spark 组件

    • Spark Core、Data Source Connectors
    • Catalyst Optimization & Tungsten Execution
    • SparkSession / DataFrame / Dataset APIs
    • SQL、Spark ML、Spark Streaming、Spark Graph、3rd-party Libraries
  • SQL 流程

Parser -> Analysis -> Logical Optimization -> Physical Planning -> Code Generation -> Execution

Agenda

  • 物理执行计划

    • Transform logical operators into physical operators
    • Choose between different physical alternatives
    • Includes physical traits of the execution engine
    • Some ops may be mapped into multiple physical nodes
  • 代码生成

    • No virtual function calls
    • Data in CPU registers
    • Loop unrolling & SIMD
  • 容错与失败处理

    • Mid-query recovery model:根据血缘重新计算丢失的分区
    • Task、Fetch:重试策略
  • 内存管理

保留内存、用户内存、执行内存、存储内存

- Execution Memory

    - Buffer intermediate results
    - Normally short lived

- Storage Memory

    - Reuse data for future computation
    - Cached data can be long-lived
    - LRU eviction for spill data
  • Delta Lake(Vectorized)

    • Full ACID transactions
    • Schema management
    • Scalable metadata handling
    • Data versioning and time travel
    • Unified batch/streaming support
    • Record update and deletion
    • Data expectation
  • UDF

转换为对应数据格式 -> 调用UDF -> 转换回内部数据格式

- Java/Scala UDFs
- Hive UDFs
- Python/Pandas UDFs
  • PySpark(Koalas)

通过 Py4J 执行 Python 代码

Physical Operator(JVM) -> PythonRunner(JVM) -> 序列化/反序列化数据 -> 执行代码(Python Worker)

Tip

JVM 常用参数

-Dproperty=value

设置系统属性,通过System.getProperty获取

-verbose:gc

展示每个GC事件的信息

-Xms6g

初始堆内存大小

-Xmx6g

最大堆内存大小

-XX:MetaspaceSize=96m

初始 Metaspace 大小

-XX:MaxMetaspaceSize=96m

最大 Metaspace 大小

-XX:+UseG1GC

使用 G1 GC

-XX:MaxGCPauseMillis=20

最大 GC 目标停顿时间

-XX:InitiatingHeapOccupancyPercent=35

触发标记周期堆占用率阈值

-XX:G1HeapRegionSize=16M

G1 region 的大小

-XX:MinMetaspaceFreeRatio=50

Metaspace GC 后,Metaspace 空闲空间最小比例

-XX:MaxMetaspaceFreeRatio=80

Metaspace GC 后,Metaspace 空闲空间最大比例

-XX:+PrintGCDetails

打印 GC 详情

-XX:+PrintGCTimeStamps

打印 GC 时 JVM 启动至今的时间戳

-XX:+PrintGCDateStamps

打印 GC 时的日期和时间

Share

Spark 作业执行流程(Cluster Mode On YARN)

Reference