Projects

Data Management

Griffin

Apache Griffin is a Data Quality Service platform built on Apache Hadoop and Apache Spark. It provides a framework process for defining data quality model, executing data quality measurement, automating data profiling and validation, as well as a unified data quality visualization across multiple data systems. It tries to address the data quality challenges in big data and streaming context.

Measure Model
  • Accuracy(准确性)- Does data reflect the real-world objects or a verifiable source
  • Profiling(统计)- Apply statistical analysis and assessment of data values within a dataset for consistency, uniqueness and logic
  • Completeness(完整性)- Is all necessary data present
  • Timeliness(实时性)- Is the data available at the time needed
  • Anomaly detection(异常检测)- Pre-built algorithm functions for the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset
  • Validity(有效性)- Are all data values within the data domains specified by the business

Data Integration

Gobblin
Nifi

Distributed Storage

Alluxio: Open Source Memory Speed Virtual Distributed Storage
Cassandra: Manage massive amounts of data, fast, without losing sleep
HBase: The Hadoop database, a distributed, scalable, big data store.
HDFS: A distributed file system that provides high-throughput access to application data.

Distributed Computing

MapReduce: A YARN-based system for parallel processing of large data sets.
Spark: A fast and general engine for large-scale data processing.
Storm
Tez

Interactive Analytics

Drill: Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Druid: a high-performance, column-oriented, distributed data store.
Hive

Links

Impala
Kylin
Phoenix: OLTP and operational analytics for Apache Hadoop
Presto

Scheduling & Workflow

AirFlow
Azkaban
Oozie

Machine Learning

Hivemall: A scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.

Deep Learning

Keras
TensorFlow: An open-source software library for Machine Intelligence
TFLearn: Deep learning library featuring a higher-level API for TensorFlow.
TensorLayer

Virtualization

KVM:
OpenVZ: Virtuozzo Containers
XEN: