ARTS - 2019 Week 8-3¶
20190818~20190824
Algorithm¶
Review¶
Deep Dive Scheduler in Apache Spark¶
As a core component of data processing platform, scheduler is responsible for schedule tasks on compute units. Built on a Directed Acyclic Graph (DAG) compute model, Spark Scheduler works together with Block Manager and Cluster Backend to efficiently utilize cluster resources for high performance of various workloads. This talk dives into the technical details of the full lifecycle of a typical Spark workload to be scheduled and executed, and also discusses how to tune Spark scheduler for better performance.
【PDF】Deep Dive Scheduler in Apache Spark
SparkContext¶
- 主程序入口、提交/取消作业
- SchedulerBackend(CoarseGrainedSchedulerBackend、LocalSchedulerBackend)、DAGScheduler、TaskScheduler
Scheduling Process¶
- RDD、DAGScheduler、TaskScheduler、Worker
- RDD、Stage、TaskSet
DAGScheduler¶
- RDD -> Stage
- Stage -> TaskSet
TaskScheduler¶
- Batch、Barrier
- FIFO、Fair
TaskSetManager¶
- locality-aware = delay
Handle Failures¶
- Task:maxTaskFailures
- Stage:maxStageFailures
SchedulerBackend¶
- 管理调度资源
Worker¶
- ExternalShuffleService
Improve Job Performance¶
- Break long-running tasks into simple/short tasks
- Broadcast small hot input files
Tip¶
Yarn Scheduler¶
FIFO
- 先进先出队列
- 集群资源不共享
Capacity
- 通过弹性层次队列组织资源
- 队列内部作业采用先进先出
Fair
- 通过弹性层次队列组织资源
- 支持不同资源调度算法
- 支持抢占调度:等待时间,资源
参考