Skip to content

ARTS 打卡记录 ARTS - 2019 Week 6-2

hyperj/note.arts

ARTS 打卡记录

hyperj/note.arts

首页
Share
Share
- 2019
Week
Week
- 2019



ARTS - 2019 Week 6-2¶

20190609~20190615

Algorithm¶

Review¶

Apache Spark Core - Deep Dive — Proper Optimization ¶

Optimizing spark jobs through a true understanding of spark core.

Learn

What is a partition?
What is the difference between read/shuffle/write partitions?
How to increase parallelism and decrease output files?
Where does shuffle data go between stages?
What is the "right" size for your spark partitions and files?
Why does a job slow down with only a few tasks left and never finish?
Why doesn't adding nodes decrease my compute time?

【PDF】Apache Spark Core - Deep Dive — Proper Optimization

Review¶

层次架构¶

Cluster、Driver、Executor（Core、Storage）
Application -(Action)-> Job -(Shuffle)-> Stage -(Partition)-> Task

总结¶

充分利用资源
- Core
- Memory
- Disk
- Network
- Data
- Cost
基线与问题
- 资源：Core、Memory、Disk、Network
- 作业：Job、Stage、Task、Spill
减少数据扫描
- 分区 - Partition Filter
- 分桶
- Z-Ordering - Colocate
正确设置分区

类型
- Input - 控制分区大小 spark.sql.files.maxPartitionBytes (mutable)
- Shuffle - 控制分区数量 spark.sql.shuffle.partitions
- Output - 控制分区大小 maxRecordsPerFile、Coalesce、Repartition、localCheckpoint
原则
- 数据探索，资源预估
- 均衡数据，保证并行度 - 128MB/100W、保持倍数
- Input、Shuffle、Output
- 避免 Spill
平衡与取舍
- Input Partitions
- Shuffle Partitions
- Output Files
- Spills
- GC Times
Join 优化
- SortMerge Join – 两侧数据量都很大
- Broadcast Join – 一侧数据量小
- BroadcastNestedLoop Join - 没有相等谓词
- Skew Join - Salting，Grouping
- Range Join - Point、Overlap
减少数据移动与重新分区
- df#cache、df#persist
- CACHE TABLE
使用向量化 UDF

Tip¶

Maven 依赖¶

依赖¶

依赖元素

坐标：groupId, artifactId, version（范围）
作用域（scope）：compile、test、provided 、runtime
类别（classifier）：不同构建方式的标识

依赖原则

最短路劲原则
最先定义原则

解决冲突

明确定义依赖
排除冲突依赖
调整依赖作用域
通过Shade插件调整

Spark 作业耗时分析 ¶

外部影响、依赖
内部流程、机制

Reference¶

Maven