Skip to content

ARTS - 2019 Week 8-3

20190818~20190824

Algorithm

Review

Deep Dive Scheduler in Apache Spark

As a core component of data processing platform, scheduler is responsible for schedule tasks on compute units. Built on a Directed Acyclic Graph (DAG) compute model, Spark Scheduler works together with Block Manager and Cluster Backend to efficiently utilize cluster resources for high performance of various workloads. This talk dives into the technical details of the full lifecycle of a typical Spark workload to be scheduled and executed, and also discusses how to tune Spark scheduler for better performance.

【PDF】Deep Dive Scheduler in Apache Spark

SparkContext

  • 主程序入口、提交/取消作业
  • SchedulerBackend(CoarseGrainedSchedulerBackend、LocalSchedulerBackend)、DAGScheduler、TaskScheduler

Scheduling Process

  • RDD、DAGScheduler、TaskScheduler、Worker
  • RDD、Stage、TaskSet

DAGScheduler

  • RDD -> Stage
  • Stage -> TaskSet

TaskScheduler

  • Batch、Barrier
  • FIFO、Fair

TaskSetManager

  • locality-aware = delay

Handle Failures

  • Task:maxTaskFailures
  • Stage:maxStageFailures

SchedulerBackend

  • 管理调度资源

Worker

  • ExternalShuffleService

Improve Job Performance

  • Break long-running tasks into simple/short tasks
  • Broadcast small hot input files

Tip

Yarn Scheduler

FIFO

  • 先进先出队列
  • 集群资源不共享

Capacity

  • 通过弹性层次队列组织资源
  • 队列内部作业采用先进先出

Fair

  • 通过弹性层次队列组织资源
  • 支持不同资源调度算法
  • 支持抢占调度:等待时间,资源

参考

Share

日志采集

Reference