ARTS - 2019 Week 6-2¶
20190609~20190615
Algorithm¶
Review¶
Apache Spark Core - Deep Dive — Proper Optimization¶
Optimizing spark jobs through a true understanding of spark core.
Learn
- What is a partition?
- What is the difference between read/shuffle/write partitions?
- How to increase parallelism and decrease output files?
- Where does shuffle data go between stages?
- What is the "right" size for your spark partitions and files?
- Why does a job slow down with only a few tasks left and never finish?
- Why doesn't adding nodes decrease my compute time?
【PDF】Apache Spark Core - Deep Dive — Proper Optimization
Review¶
层次架构¶
- Cluster、Driver、Executor(Core
、Storage ) - Application -(Action)-> Job -(Shuffle)-> Stage -(Partition)-> Task
总结¶
-
充分利用资源
- Core
- Memory
- Disk
- Network
- Data
- Cost
-
基线与问题
- 资源:Core、Memory、Disk、Network
- 作业:Job、Stage、Task、Spill
-
减少数据扫描
- 分区 - Partition Filter
- 分桶
- Z-Ordering - Colocate
-
正确设置分区
类型
- Input - 控制分区大小 spark.sql.files.maxPartitionBytes (mutable)
- Shuffle - 控制分区数量 spark.sql.shuffle.partitions
- Output - 控制分区大小 maxRecordsPerFile、Coalesce、Repartition、localCheckpoint
原则
- 数据探索,资源预估
- 均衡数据,保证并行度 - 128MB/100W、保持倍数
- Input、Shuffle、Output
- 避免 Spill
-
平衡与取舍
- Input Partitions
- Shuffle Partitions
- Output Files
- Spills
- GC Times
-
Join 优化
- SortMerge Join – 两侧数据量都很大
- Broadcast Join – 一侧数据量小
- BroadcastNestedLoop Join - 没有相等谓词
- Skew Join - Salting,Grouping
- Range Join - Point、Overlap
-
减少数据移动与重新分区
- df#cache、df#persist
- CACHE TABLE
-
使用向量化 UDF
Tip¶
Maven 依赖¶
依赖¶
依赖元素
- 坐标:groupId, artifactId, version(范围)
- 作用域(scope):compile、test、provided 、runtime
- 类别(classifier):不同构建方式的标识
依赖原则
- 最短路劲原则
- 最先定义原则
解决冲突
- 明确定义依赖
- 排除冲突依赖
- 调整依赖作用域
- 通过Shade插件调整
Share¶
Spark 作业耗时分析¶
- 外部影响、依赖
- 内部流程、机制