Skip to content

ARTS - 2019 Week 7-4

20190728~20190803

Algorithm

Review

Fast and Reliable Apache Spark SQL Engine

Building the next generation Spark SQL engine at speed poses new challenges to both automation and testing. At Databricks, we are implementing a new testing framework for assessing the quality and performance of new developments as they produced. Having more than 1,200 worldwide contributors, Apache Spark follows a rapid pace of development. At this scale, new testing tooling such as random query and data generation, fault injection, longevity stress, and scalability tests are essential to guarantee a reliable and performance Spark later in production. By applying such techniques, we will demonstrate the effectiveness of our testing infrastructure by drilling-down into cases where correctness and performance regressions have been found early. In addition, showing how they have been root-caused and fixed to prevent regressions in production and boosting the continuous delivery of new features.

【PDF】Fast and Reliable Apache Spark SQL Engine

目标

making high quality releases automatic and frequent

问题

  • 每个月数百次提交,错误一定会发生
  • 需要投入非常大的精力进行测试工作
  • 单元测试不足以保证正确性和性能

方案

  • 持续集成
  • 分类与报警

正确性

  • 生成随机查询
  • 生成随机数据
  • 递归查询模型
  • 概率查询类型
  • 查询算子覆盖率分析

性能

  • 基准测试:负载、基准、性能
  • 根源分析:火焰图(慢方法、新方法)

总结

自动化工具保障正确性与性能

Tip

HBase 客户端优化

  • scan 设置合理的缓存:scan.setCaching(1000);
  • get 使用批量请求方式:hTable.get(getList);
  • 请求指定需要的列族或列:addColumn、addFamily
  • 离线批量读取禁止缓存:scan.setBlockCache(false);

Share

Reference