

ARTS - 2019 Week 7-1¶

20190707~20190713

Algorithm¶

38. Count and Say

Review¶

Understanding Query Plans and Spark UIs ¶

The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.

【PDF】Understanding Query Plans and Spark UIs

Read Plan¶

Spark UI 或者 Spark History Server 的 SQL 标签页 - Details 按钮

Parsed Plan
Analyzed Plan
Optimized Plan
Physical Plan

Understand and Tune Plans¶

明确查询数据的类型，避免类型隐式转换问题
创建 Spark ORC 模式表（using orc），自动谓词下推
开启嵌套结构裁剪（spark.sql.optimizer.nestedSchemaPruning.enabled=true）
使用非确定 UDF（asNondeterministic），避免不正确的重复调用
使用跨 Session 的缓存

Spark 3.0

Join Hints(BROADCAST/MERGE/SHUFFLE_HASH/SHUFFLE_REPLICATE_NL)

Track Execution¶

一次查询对应多个作业，一个作业对应一个DAG

标签页：Job、Stage、Executor
Job：执行时间、任务数量与状态
Stage：执行时间、异常任务、耗时任务、任务均衡、数据倾斜、任务数量与状态、分区数量、本地性
Executor：存储、任务、GC、Shuffle、线程信息

Delta Lake

通过在 transaction log 中存储元数据信息，替代原用基于 metastore

对应的分区可能非常多 - 不受 Hive Metastore 影响
生成的文件可能非常多 - 元数据与数据分离，数据合并写
新的数据不直接可见，需刷新表（Refresh Table） - 读取正在计算的数据

特性

完全 ACID 事务支持
Schema 管理
数据时间及版本信息
统一批处理与流计算支持
元数据管理能力
支持更新与删除
数据可预期

Tip¶

Hive 常用参数¶

set hive.mapred.mode=strict; -- 使用严格模式，检查可能不合理计算
set hive.exec.parallel=true；-- 开启并发执行
set hive.exec.parallel.thread.number=16; -- 最大并行度
set hive.exec.dynamic.partition=true; -- 开启动态分区
set hive.exec.dynamic.partition.mode=nonstrict; -- 使用非严格模式，允许全部分区字段为动态分区
set hive.exec.max.dynamic.partitions=1000; -- 允许最大创建分区总数
set hive.exec.max.dynamic.partitions.pernode=100; -- 每个Mapper或Reducer节点允许最大创建分区总数
set hive.merge.mapfiles=true; -- 合并mapper作业小文件
set hive.merge.mapredfiles=true; -- 合并reducer作业小文件
set hive.merge.size.per.task=256000000; -- 合并文件大小
set hive.merge.smallfiles.avgsize=64000000; -- 当文件平均大小小于该值，启动Job合并小文件

ARTS - 2019 Week 7-1¶

Algorithm¶

Review¶

Understanding Query Plans and Spark UIs ¶

Read Plan¶

Understand and Tune Plans¶

Track Execution¶

Tip¶

Hive 常用参数¶

漫谈数据质量 ¶

Reference¶

ARTS - 2019 Week 7-1¶

Algorithm¶

Review¶

Understanding Query Plans and Spark UIs¶

Read Plan¶

Understand and Tune Plans¶

Track Execution¶

Tip¶

Hive 常用参数¶

Share¶

漫谈数据质量¶

Reference¶

Understanding Query Plans and Spark UIs ¶

漫谈数据质量 ¶