Projects

Data Management

Griffin

Apache Griffin is a Data Quality Service platform built on Apache Hadoop and Apache Spark. It provides a framework process for defining data quality model, executing data quality measurement, automating data profiling and validation, as well as a unified data quality visualization across multiple data systems. It tries to address the data quality challenges in big data and streaming context.

Measure Model

Accuracy（准确性）- Does data reflect the real-world objects or a verifiable source
Profiling（统计）- Apply statistical analysis and assessment of data values within a dataset for consistency, uniqueness and logic
Completeness（完整性）- Is all necessary data present
Timeliness（实时性）- Is the data available at the time needed
Anomaly detection（异常检测）- Pre-built algorithm functions for the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset
Validity（有效性）- Are all data values within the data domains specified by the business

Data Integration

Gobblin

Nifi

Distributed Storage

Alluxio: Open Source Memory Speed Virtual Distributed Storage

Cassandra: Manage massive amounts of data, fast, without losing sleep

HBase: The Hadoop database, a distributed, scalable, big data store.

HDFS: A distributed file system that provides high-throughput access to application data.

Distributed Computing

Flink

MapReduce: A YARN-based system for parallel processing of large data sets.

Spark: A fast and general engine for large-scale data processing.

Storm

Tez

Interactive Analytics

Drill: Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage

Druid: a high-performance, column-oriented, distributed data store.

Hive

Links

Impala

Kylin

Phoenix: OLTP and operational analytics for Apache Hadoop

Presto

Scheduling & Workflow

AirFlow

Azkaban

Oozie

Machine Learning

Hivemall: A scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.