Skip to content

ARTS - 2019 Week 8-1

20190804~20190810

Algorithm

Review

How to Extend Apache Spark with Customized Optimizations

There are a growing set of optimization mechanisms that allow you to achieve competitive SQL performance. Spark has extension points that help third parties to add customizations and optimizations without needing these optimizations to be merged into Apache Spark. This is very powerful and helps extensibility. We have added some enhancements to the existing extension points framework to enable some fine grained control. This talk will be a deep dive at the extension points that is available in Spark today. We will also talk about the enhancements to this API that we developed to help make this API more powerful. This talk will be of benefit to developers who are looking to customize Spark in their deployments.

【PDF】How to Extend Apache Spark with Customized Optimizations

扩展案例

  • 性能优化:参考信息完整性的约束、索引来减少数据扫描范围
  • 整合第三方应用

问题与方案

如何将定制开发应用到集群?

  • 合并代码到中心仓库
  • 修改代码,维护自己分支
  • 使用扩展接口

扩展接口(Extension Points API)

  • Spark 2.2 in SPARK-18127,SPARK-26249: API Enhancements
  • 可插拔、可扩展
  • 通过 SparkSessionExtensions,扩展 SparkSession

SparkSessionExtensions

通过钩子机制,注入定制方法

  • Spark SQL:解析、分析、优化、策略、函数
  • 通过 withExtensions 或 spark.sql.extensions 配置

参考

Tip

HBase Rowkey 设计

设计原则

  • 长度原则:控制在16字节
  • 唯一原则
  • 排序原则:按照ASCII有序排序
  • 散列原则:更均匀的分布在各个节点

避免数据热点方法

  • Reversing
  • Salting
  • Hashing

Share

元数据集成

Reference