ARTS - 2019 Week 6-3¶
20190616~20190622
Algorithm¶
Review¶
A Deep Dive into Query Execution Engine of Spark SQL¶
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
【PDF】A Deep Dive into Query Execution Engine of Spark SQL
Review¶
基础介绍
-
Spark 组件
- Spark Core、Data Source Connectors
- Catalyst Optimization & Tungsten Execution
- SparkSession / DataFrame / Dataset APIs
- SQL、Spark ML、Spark Streaming、Spark Graph、3rd-party Libraries
-
SQL 流程
Parser -> Analysis -> Logical Optimization -> Physical Planning -> Code Generation -> Execution
Agenda
-
物理执行计划
- Transform logical operators into physical operators
- Choose between different physical alternatives
- Includes physical traits of the execution engine
- Some ops may be mapped into multiple physical nodes
-
代码生成
- No virtual function calls
- Data in CPU registers
- Loop unrolling & SIMD
-
容错与失败处理
- Mid-query recovery model:根据血缘重新计算丢失的分区
- Task、Fetch:重试策略
-
内存管理
保留内存、用户内存、执行内存、存储内存
- Execution Memory - Buffer intermediate results - Normally short lived - Storage Memory - Reuse data for future computation - Cached data can be long-lived - LRU eviction for spill data
-
Delta Lake(Vectorized)
- Full ACID transactions
- Schema management
- Scalable metadata handling
- Data versioning and time travel
- Unified batch/streaming support
- Record update and deletion
- Data expectation
-
UDF
转换为对应数据格式 -> 调用UDF -> 转换回内部数据格式
- Java/Scala UDFs - Hive UDFs - Python/Pandas UDFs
- PySpark(Koalas)
通过 Py4J 执行 Python 代码
Physical Operator(JVM) -> PythonRunner(JVM) -> 序列化/反序列化数据 -> 执行代码(Python Worker)
Tip¶
JVM 常用参数¶
-Dproperty=value
设置系统属性,通过System.getProperty获取
-verbose:gc
展示每个GC事件的信息
-Xms6g
初始堆内存大小
-Xmx6g
最大堆内存大小
-XX:MetaspaceSize=96m
初始 Metaspace 大小
-XX:MaxMetaspaceSize=96m
最大 Metaspace 大小
-XX:+UseG1GC
使用 G1 GC
-XX:MaxGCPauseMillis=20
最大 GC 目标停顿时间
-XX:InitiatingHeapOccupancyPercent=35
触发标记周期堆占用率阈值
-XX:G1HeapRegionSize=16M
G1 region 的大小
-XX:MinMetaspaceFreeRatio=50
Metaspace GC 后,Metaspace 空闲空间最小比例
-XX:MaxMetaspaceFreeRatio=80
Metaspace GC 后,Metaspace 空闲空间最大比例
-XX:+PrintGCDetails
打印 GC 详情
-XX:+PrintGCTimeStamps
打印 GC 时 JVM 启动至今的时间戳
-XX:+PrintGCDateStamps
打印 GC 时的日期和时间