ARTS - 2019 Week 8-2¶
20190811~20190817
Algorithm¶
Review¶
Smart Join Algorithms for Fighting Skew at Scale¶
Consumer apps like Yelp generate log data at huge scale, and often this is distributed according to a power law, where a small number of users, businesses, locations, or pages are associated with a disproportionately large amount of data. This kind of data skew can cause problems for distributed algorithms, especially joins, where all the rows with the same key must be processed by the same executor. Even just a single over-represented entity can cause a whole job to slow down or fail. One approach to this problem is to remove outliers before joining, and this might be fine when training a machine learning model, but sometimes you need to retain all the data. Thankfully, there are a few tricks you can use to counteract the negative effects of skew while joining, by artificially redistributing data across more machines. This talk will walk through some of them, with code examples.
【PDF】Smart Join Algorithms for Fighting Skew at Scale
数据倾斜
-
数据分布
- 正太分布
- 幂律分布
-
异常值
产生问题
通过分位数诊断:执行时间、数据大小、数据量、spill、gc
- 热点分片、读写
- 数据加载
- Join
- Shuffle
Spark Joins
- Shuffled hash join
- Broadcast join
解决方案
- 倾斜数据 key 增加随机因子,关联数据扩张对应倍数
- 改进:只处理频繁项对应的 key
自动方案
- 预估频繁项,自动调整扩张比例
- 两边都倾斜,分别生成随机 key 和扩张 key
Tip¶
Redis 数据类型¶
STRING
- 字符串、整数、浮点数
- GET、SET、DEL
LIST
- 链表
- RPUSH、LRANGE、LINDEX、LPOP
SET
- 无序无重复集合
- SADD、SMEMBERS、SISMEMBER、SREM
HASH
- 无序散列表
- HSET、HGET、HGETALL、HDEL
ZSET
- 有序无重复集合
- ZADD、ZRANGE、ZRANGEBYSCORE、ZREM
Shadowsocks 加速 Github¶
配置代理
# Windows git config --global http.proxy 'socks5://127.0.0.1:1080' git config --global https.proxy 'socks5://127.0.0.1:1080' # MacOS git config --global http.proxy 'socks5://127.0.0.1:1086' git config --global https.proxy 'socks5://127.0.0.1:1086'
取消代理
git config --global --unset http.proxy git config --global --unset https.proxy