Web applications and services today collect, store and analyze an immense amount of data. The sheer size of the data has fundamentally changed the bottlenecks in systems for big data analytics. In particular, memory bandwidth and CPU performance continue to grow at a rate much faster than bandwidth between CPU and slower storage devices (SSD, disk, etc.). The result is an I/O bottleneck, that is (and will continue to be) getting worse!
A fundamental approach to alleviating the I/O bottleneck is to use data compression. Traditional compression techniques have led to significant gains in terms of storage costs, energy costs, and performance for a wide variety of batch processing jobs. Traditional compression techniques have also been used for reducing I/O bottlenecks in columnar stores with significant performance improvements for OLAP workloads that typically require scanning the entire dataset (see Daniel Abadi’s thesis, this paper, and references within).
However, the aforementioned compression and query execution techniques are unsuitable for a wide variety of workloads that do not necessarily require data scans (e.g., point queries). One example is search, a fundamental primitive supported by many web applications and services. Examples include Facebook search, Twitter search, LinkedIn search, Airline and hotel search, and services that are specifically built around search (Google, Bing, Yelp, to name a few). Another example is random access as typically performed via get() interface in key-value stores, NoSQL stores, document stores, etc. Queries in such workloads are often short-lived (ideally sub-millisecond), and data scans and/or decompression lead to significant performance degradation. Given the large number of applications that run such workloads, we at AMPLab decided to take a stab at this problem and asked the following fundamental question:
Is it possible to execute point queries (e.g., search and random access) directly on compressed data without performing data scans?
Exploring the above question led to the Succinct project! At a high-level, Succinct enables a wide range of queries including search, range and wildcard queries over arbitrary strings as well as random access into the input data directly on a compressed representation of the input. What differentiates Succinct from previous systems that support point queries is that Succinct supports these queries without storing any indexes, without data scans and without data decompression — all the required information is embedded within the compressed representation and queries are executed directly on the compressed representation.
On real-world and benchmark datasets, Succinct can execute sub-millisecond search queries while keeping as much as an order of magnitude more input data in faster storage compared to state-of-the-art systems that provide similar functionality using indexes. For example, on a server with 128GB RAM, Succinct can push as much as 163 — 250GB of raw data, depending on the dataset, while executing search queries within a millisecond. Thus, Succinct executes more queries in faster storage, leading to lower query latency than existing systems for a much larger range of input sizes.
Over next couple of weeks, we will be providing much more information on Succinct — the techniques, tradeoffs and benchmark results over several real-world applications! We are also very excited about the upcoming open-source release of Succinct on Apache Spark, making point queries extremely efficient on Apache Spark. Stay tuned!
Finally, over next couple of weeks, I will write a lot more about several very exciting follow up projects on Succinct that we have been working on at AMPLab.
相关链接:Succinct