Skip to content

ARTS - 2019 Week 8-2

20190811~20190817

Algorithm

Review

Smart Join Algorithms for Fighting Skew at Scale

Consumer apps like Yelp generate log data at huge scale, and often this is distributed according to a power law, where a small number of users, businesses, locations, or pages are associated with a disproportionately large amount of data. This kind of data skew can cause problems for distributed algorithms, especially joins, where all the rows with the same key must be processed by the same executor. Even just a single over-represented entity can cause a whole job to slow down or fail. One approach to this problem is to remove outliers before joining, and this might be fine when training a machine learning model, but sometimes you need to retain all the data. Thankfully, there are a few tricks you can use to counteract the negative effects of skew while joining, by artificially redistributing data across more machines. This talk will walk through some of them, with code examples.

【PDF】Smart Join Algorithms for Fighting Skew at Scale

数据倾斜

  • 数据分布

    • 正太分布
    • 幂律分布
  • 异常值

产生问题

通过分位数诊断:执行时间、数据大小、数据量、spill、gc

  • 热点分片、读写
  • 数据加载
  • Join
  • Shuffle

Spark Joins

  • Shuffled hash join
  • Broadcast join

解决方案

  • 倾斜数据 key 增加随机因子,关联数据扩张对应倍数
  • 改进:只处理频繁项对应的 key

自动方案

  • 预估频繁项,自动调整扩张比例
  • 两边都倾斜,分别生成随机 key 和扩张 key

Tip

Redis 数据类型

STRING

  • 字符串、整数、浮点数
  • GET、SET、DEL

LIST

  • 链表
  • RPUSH、LRANGE、LINDEX、LPOP

SET

  • 无序无重复集合
  • SADD、SMEMBERS、SISMEMBER、SREM

HASH

  • 无序散列表
  • HSET、HGET、HGETALL、HDEL

ZSET

  • 有序无重复集合
  • ZADD、ZRANGE、ZRANGEBYSCORE、ZREM

Shadowsocks 加速 Github

配置代理

# Windows
git config --global http.proxy 'socks5://127.0.0.1:1080'
git config --global https.proxy 'socks5://127.0.0.1:1080'
# MacOS
git config --global http.proxy 'socks5://127.0.0.1:1086'
git config --global https.proxy 'socks5://127.0.0.1:1086'

取消代理

git config --global --unset http.proxy
git config --global --unset https.proxy

Share

元数据仓库

Reference