

ARTS - 2019 Week 8-1¶

20190804~20190810

Algorithm¶

Review¶

How to Extend Apache Spark with Customized Optimizations ¶

There are a growing set of optimization mechanisms that allow you to achieve competitive SQL performance. Spark has extension points that help third parties to add customizations and optimizations without needing these optimizations to be merged into Apache Spark. This is very powerful and helps extensibility. We have added some enhancements to the existing extension points framework to enable some fine grained control. This talk will be a deep dive at the extension points that is available in Spark today. We will also talk about the enhancements to this API that we developed to help make this API more powerful. This talk will be of benefit to developers who are looking to customize Spark in their deployments.

【PDF】How to Extend Apache Spark with Customized Optimizations

扩展案例

性能优化：参考信息完整性的约束、索引来减少数据扫描范围
整合第三方应用

问题与方案

如何将定制开发应用到集群？

合并代码到中心仓库
修改代码，维护自己分支
使用扩展接口

扩展接口（Extension Points API）

Spark 2.2 in SPARK-18127，SPARK-26249: API Enhancements
可插拔、可扩展
通过 SparkSessionExtensions，扩展 SparkSession

SparkSessionExtensions

通过钩子机制，注入定制方法

Spark SQL：解析、分析、优化、策略、函数
通过 withExtensions 或 spark.sql.extensions 配置

参考

Tip¶

HBase Rowkey 设计¶

设计原则

长度原则：控制在16字节
唯一原则
排序原则：按照ASCII有序排序
散列原则：更均匀的分布在各个节点

避免数据热点方法

Reversing
Salting
Hashing

ARTS - 2019 Week 8-1¶

Algorithm¶

Review¶

How to Extend Apache Spark with Customized Optimizations ¶

Tip¶

HBase Rowkey 设计¶

元数据集成 ¶

Reference¶

ARTS - 2019 Week 8-1¶

Algorithm¶

Review¶

How to Extend Apache Spark with Customized Optimizations¶

Tip¶

HBase Rowkey 设计¶

Share¶

元数据集成¶

Reference¶

How to Extend Apache Spark with Customized Optimizations ¶

元数据集成 ¶