Data Management
Griffin
Apache Griffin is a Data Quality Service platform built on Apache Hadoop and Apache Spark. It provides a framework process for defining data quality model, executing data quality measurement, automating data profiling and validation, as well as a unified data quality visualization across multiple data systems. It tries to address the data quality challenges in big data and streaming context.
Measure Model
- Accuracy(准确性)- Does data reflect the real-world objects or a verifiable source
- Profiling(统计)- Apply statistical analysis and assessment of data values within a dataset for consistency, uniqueness and logic
- Completeness(完整性)- Is all necessary data present
- Timeliness(实时性)- Is the data available at the time needed
- Anomaly detection(异常检测)- Pre-built algorithm functions for the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset
- Validity(有效性)- Are all data values within the data domains specified by the business
Links
Data Integration
Gobblin
Nifi
Distributed Storage
Alluxio: Open Source Memory Speed Virtual Distributed Storage
Cassandra: Manage massive amounts of data, fast, without losing sleep
HBase: The Hadoop database, a distributed, scalable, big data store.
HDFS: A distributed file system that provides high-throughput access to application data.
Distributed Computing
Flink
MapReduce: A YARN-based system for parallel processing of large data sets.
Spark: A fast and general engine for large-scale data processing.
Storm
Tez
Interactive Analytics
Drill: Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Druid: a high-performance, column-oriented, distributed data store.
Hive
Links