Skip to content

ARTS - 2019 Week 6-2

20190609~20190615

Algorithm

Review

Apache Spark Core - Deep Dive — Proper Optimization

Optimizing spark jobs through a true understanding of spark core.

Learn

  • What is a partition?
  • What is the difference between read/shuffle/write partitions?
  • How to increase parallelism and decrease output files?
  • Where does shuffle data go between stages?
  • What is the "right" size for your spark partitions and files?
  • Why does a job slow down with only a few tasks left and never finish?
  • Why doesn't adding nodes decrease my compute time?

【PDF】Apache Spark Core - Deep Dive — Proper Optimization

Review

层次架构

  • Cluster、Driver、Executor(Core、Storage
  • Application -(Action)-> Job -(Shuffle)-> Stage -(Partition)-> Task

总结

  • 充分利用资源

    • Core
    • Memory
    • Disk
    • Network
    • Data
    • Cost
  • 基线与问题

    • 资源:Core、Memory、Disk、Network
    • 作业:Job、Stage、Task、Spill
  • 减少数据扫描

    • 分区 - Partition Filter
    • 分桶
    • Z-Ordering - Colocate
  • 正确设置分区

    类型

    • Input - 控制分区大小 spark.sql.files.maxPartitionBytes (mutable)
    • Shuffle - 控制分区数量 spark.sql.shuffle.partitions
    • Output - 控制分区大小 maxRecordsPerFile、Coalesce、Repartition、localCheckpoint

    原则

    • 数据探索,资源预估
    • 均衡数据,保证并行度 - 128MB/100W、保持倍数
    • Input、Shuffle、Output
    • 避免 Spill
  • 平衡与取舍

    • Input Partitions
    • Shuffle Partitions
    • Output Files
    • Spills
    • GC Times
  • Join 优化

    • SortMerge Join – 两侧数据量都很大
    • Broadcast Join – 一侧数据量小
    • BroadcastNestedLoop Join - 没有相等谓词
    • Skew Join - Salting,Grouping
    • Range Join - Point、Overlap
  • 减少数据移动与重新分区

    • df#cache、df#persist
    • CACHE TABLE
  • 使用向量化 UDF

Tip

Maven 依赖

依赖

依赖元素

  • 坐标:groupId, artifactId, version(范围)
  • 作用域(scope):compile、test、provided 、runtime
  • 类别(classifier):不同构建方式的标识

依赖原则

  • 最短路劲原则
  • 最先定义原则

解决冲突

  • 明确定义依赖
  • 排除冲突依赖
  • 调整依赖作用域
  • 通过Shade插件调整

Share

Spark 作业耗时分析

  • 外部影响、依赖
  • 内部流程、机制

Reference