Projects & Articles¶
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Apache Atlas Data Governance and Metadata framework for Hadoop¶
Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.
Apache Atlas provides open metadata management and governance capabilities for organizations to build a catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts and the data governance team.
- Metadata types & instances
- Security & Data Masking
Apache Griffin is a Data Quality Service platform built on Apache Hadoop and Apache Spark. It provides a framework process for defining data quality model, executing data quality measurement, automating data profiling and validation, as well as a unified data quality visualization across multiple data systems. It tries to address the data quality challenges in big data and streaming context.
- Accuracy（准确性）- Does data reflect the real-world objects or a verifiable source
- Profiling（统计）- Apply statistical analysis and assessment of data values within a dataset for consistency, uniqueness and logic
- Completeness（完整性）- Is all necessary data present
- Timeliness（实时性）- Is the data available at the time needed
- Anomaly detection（异常检测）- Pre-built algorithm functions for the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset
- Validity（有效性）- Are all data values within the data domains specified by the business