

ARTS - 2019 Week 8-3¶

20190818~20190824

Algorithm¶

107. Binary Tree Level Order Traversal II

Review¶

Deep Dive Scheduler in Apache Spark ¶

As a core component of data processing platform, scheduler is responsible for schedule tasks on compute units. Built on a Directed Acyclic Graph (DAG) compute model, Spark Scheduler works together with Block Manager and Cluster Backend to efficiently utilize cluster resources for high performance of various workloads. This talk dives into the technical details of the full lifecycle of a typical Spark workload to be scheduled and executed, and also discusses how to tune Spark scheduler for better performance.

【PDF】Deep Dive Scheduler in Apache Spark

SparkContext¶

主程序入口、提交/取消作业
SchedulerBackend（CoarseGrainedSchedulerBackend、LocalSchedulerBackend）、DAGScheduler、TaskScheduler

Scheduling Process¶

RDD、DAGScheduler、TaskScheduler、Worker
RDD、Stage、TaskSet

DAGScheduler¶

RDD -> Stage
Stage -> TaskSet

TaskScheduler¶

Batch、Barrier
FIFO、Fair

TaskSetManager¶

locality-aware = delay

Handle Failures¶

Task：maxTaskFailures
Stage：maxStageFailures

SchedulerBackend¶

管理调度资源

Worker¶

ExternalShuffleService

Improve Job Performance¶

Break long-running tasks into simple/short tasks
Broadcast small hot input files

Tip¶

Yarn Scheduler¶

FIFO

先进先出队列
集群资源不共享

Capacity

通过弹性层次队列组织资源
队列内部作业采用先进先出

Fair

通过弹性层次队列组织资源
支持不同资源调度算法
支持抢占调度：等待时间，资源

参考

ARTS - 2019 Week 8-3¶

Algorithm¶

Review¶

Deep Dive Scheduler in Apache Spark ¶

SparkContext¶

Scheduling Process¶

DAGScheduler¶

TaskScheduler¶

TaskSetManager¶

Handle Failures¶

SchedulerBackend¶

Worker¶

Improve Job Performance¶

Tip¶

Yarn Scheduler¶

日志采集 ¶

Reference¶

ARTS - 2019 Week 8-3¶

Algorithm¶

Review¶

Deep Dive Scheduler in Apache Spark¶

SparkContext¶

Scheduling Process¶

DAGScheduler¶

TaskScheduler¶

TaskSetManager¶

Handle Failures¶

SchedulerBackend¶

Worker¶

Improve Job Performance¶

Tip¶

Yarn Scheduler¶

Share¶

日志采集¶

Reference¶

Deep Dive Scheduler in Apache Spark ¶

日志采集 ¶