ARTS - 2019 Week 8-1¶
20190804~20190810
Algorithm¶
Review¶
How to Extend Apache Spark with Customized Optimizations¶
There are a growing set of optimization mechanisms that allow you to achieve competitive SQL performance. Spark has extension points that help third parties to add customizations and optimizations without needing these optimizations to be merged into Apache Spark. This is very powerful and helps extensibility. We have added some enhancements to the existing extension points framework to enable some fine grained control. This talk will be a deep dive at the extension points that is available in Spark today. We will also talk about the enhancements to this API that we developed to help make this API more powerful. This talk will be of benefit to developers who are looking to customize Spark in their deployments.
【PDF】How to Extend Apache Spark with Customized Optimizations
扩展案例
- 性能优化:参考信息完整性的约束、索引来减少数据扫描范围
- 整合第三方应用
问题与方案
如何将定制开发应用到集群?
- 合并代码到中心仓库
- 修改代码,维护自己分支
- 使用扩展接口
扩展接口(Extension Points API)
- Spark 2.2 in SPARK-18127,SPARK-26249: API Enhancements
- 可插拔、可扩展
- 通过 SparkSessionExtensions,扩展 SparkSession
SparkSessionExtensions
通过钩子机制,注入定制方法
- Spark SQL:解析、分析、优化、策略、函数
- 通过 withExtensions 或
spark.sql.extensions
配置
参考
- https://developer.ibm.com/code/2017/11/30/learn-extension-points-apache-spark-extend-spark-catalyst-optimizer/
- https://rtahboub.github.io/blog/2018/writing-customized-parser/
Tip¶
HBase Rowkey 设计¶
设计原则
- 长度原则:控制在16字节
- 唯一原则
- 排序原则:按照ASCII有序排序
- 散列原则:更均匀的分布在各个节点
避免数据热点方法
- Reversing
- Salting
- Hashing