Skip to content

ARTS - 2019 Week 7-3

20190721~20190727

Algorithm

Review

Efficient Upserts into Data Lakes with Databricks Delta

Databricks Delta, the next-generation engine built on top of Apache Spark™, now supports the MERGE command, which allows you to efficiently upsert and delete records in your data lakes. MERGE dramatically simplifies how a number of common data pipelines can be built; all the complicated multi-hop processes that inefficiently rewrote entire partitions can now be replaced by simple MERGE queries. This finer-grained update capability simplifies how you build your big data pipelines for various use cases ranging from change data capture to GDPR.

背景及问题

  • 合规要求
  • 数据变更
  • 数据删除
  • 效率
  • 质量

支持 MERGE 命令

支持更新及删除数据湖中的记录

  • 细粒度:文件粒度,而不是分区粒度
  • 高效:过滤无关的数据
  • 事务:支持事务,并发读写

Tip

  • jobmanager.heap.size=8g -- JobManager 堆内存大小
  • taskmanager.heap.size=4g -- TaskManager 堆内存大小
  • parallelism.default=1 -- 作业的默认并行度
  • taskmanager.numberOfTaskSlots=1 -- TaskManager 的并行度
  • state.checkpoints.dir=hdfs:///checkpoints -- 设置 checkpoints 目录
  • state.savepoints.dir=hdfs:///savepoints -- 设置 savepoints 目录
  • state.backend=rocks -- 设置状态存储类型
  • state.backend.async=true -- 使用异步快照
  • state.backend.incremental=true -- 创建增量状态存储

Share

Reference