当前版本基于Spark SQL 2.x进行整理,参考了主流分布式SQL计算引擎相关的开源项目。

Spark SQL

  • Spark Core(RDD APIs)、Data Source Connectors
  • Catalyst Optimization、Tungsten Execution
  • SparkSession、Dataset/DataFrame APIs、SQL
  • Structured Streaming、MLlib、GraphFrame、TensorFrames


  • Spark SQL: Spark SQL is Apache Spark's module for working with structured data.
  • Hive: The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
  • Presto: Distributed SQL Query Engine for Big Data.