ARTS - 2019 Week 7-3¶
20190721~20190727
Algorithm¶
Review¶
Efficient Upserts into Data Lakes with Databricks Delta¶
Databricks Delta, the next-generation engine built on top of Apache Spark™, now supports the MERGE command, which allows you to efficiently upsert and delete records in your data lakes. MERGE dramatically simplifies how a number of common data pipelines can be built; all the complicated multi-hop processes that inefficiently rewrote entire partitions can now be replaced by simple MERGE queries. This finer-grained update capability simplifies how you build your big data pipelines for various use cases ranging from change data capture to GDPR.
背景及问题
- 合规要求
- 数据变更
- 数据删除
- 效率
- 质量
支持 MERGE 命令
支持更新及删除数据湖中的记录
- 细粒度:文件粒度,而不是分区粒度
- 高效:过滤无关的数据
- 事务:支持事务,并发读写
Tip¶
Flink 常用参数¶
- jobmanager.heap.size=8g -- JobManager 堆内存大小
- taskmanager.heap.size=4g -- TaskManager 堆内存大小
- parallelism.default=1 -- 作业的默认并行度
- taskmanager.numberOfTaskSlots=1 -- TaskManager 的并行度
- state.checkpoints.dir=hdfs:///checkpoints -- 设置 checkpoints 目录
- state.savepoints.dir=hdfs:///savepoints -- 设置 savepoints 目录
- state.backend=rocks -- 设置状态存储类型
- state.backend.async=true -- 使用异步快照
- state.backend.incremental=true -- 创建增量状态存储