This page was last modified on 25 June 2016, at 20:14.
|Written in||C++, Java|
|Type||Database, Business Intelligence, Data Warehouse|
LucidDB is an open source database purpose-built to power data warehouses, OLAP servers and business intelligence systems. LucidDB is the first and only open-source RDBMS purpose-built entirely for data warehousing and business intelligence. It is based on architectural cornerstones such as column-store, bitmap indexing, hash join/aggregation, and page-level multiversioning. Most database systems (both proprietary and open-source) start life with a focus on transaction processing capabilities, then get analytical capabilities bolted on as an afterthought (if at all). By contrast, every component of LucidDB was designed with the requirements of flexible, high-performance data integration and sophisticated query processing in mind. Moreover, comprehensiveness within the focused scope of its architecture means simplicity for the user: no DBA required.
Column Store and Bitmap Indexes
Recent DBMS research has established the superiority of a "column-store" architecture for read-mostly database systems such as LucidDB. In LucidDB, database tables are vertically partitioned and stored in a highly compressed form. Vertical partitioning means that each page on disk stores values from only one column rather than entire rows; as a result, compression algorithms are much more effective because they can operate on homogeneous value domains, often with only a few distinct values. For example, a column storing the state component of a US address only has 50 possible values, so each value can be stored using only 6 bits instead of the 2-byte character strings used in a traditional uncompressed representation.
Vertical partitioning also means that a query that only accesses a subset of the columns of the referenced tables can avoid reading the other columns entirely. The net effect of vertical partitioning is greatly improved performance due to reduced disk I/O and more effective caching (data compression allows a greater logical dataset size to fit into a given amount of physical memory). Compression also allows disk storage to be used more effectively (e.g. for maintaining more indexes).
The companion to column store is bitmap indexing, which has well-known advantages for data warehousing. LucidDB's bitmap index implementation takes advantage of column store features; for example, bitmaps are built directly off of the compressed row representation, and are themselves stored compressed, reducing load time significantly. And at query time, they can be rapidly intersected to identify the exact portion of the table which contributes to query results. All access paths support asynchronous I/O with intelligent prefetch for optimal use of disk bandwidth.
It should be noted that LucidDB is not suitable for use as a transactional database. LucidDB is very fast at bulk-loading or updating large amounts of data at once, but it is not intended to work well for the single-row operations typical of transactional systems. Best practice is to separate analytical systems from transactional systems; LucidDB can be used as a data warehouse, data mart, or operational data store in tandem with the traditional transactional systems used as data sources.
Although LucidDB is primarily intended as a read-only data warehouse, write operations are required for loading data into the warehouse. To allow reads to continue during data loads and updates, LucidDB uses page versioning. Data pages are read based on a snapshot of the data at the start of the initiating transaction. When a page needs to be updated, a new version of the page is created and chained from the original page. Each subsequent write transaction will create a new version of the page and add it to the existing page chain. Therefore, long-running, read-only transactions can continue to read older snapshots while newer transactions will read more up-to-date snapshots. Pages that are no longer in use can be reclaimed so the page chains don't grow forever. By versioning at the page level, the overhead of locating the desired data snapshot is minimized. LucidDB is able to quickly identify the appropriate page by walking through the page chain metadata. There's no need to manipulate data at the row-level, or to reconstruct a snapshot based on previously logged information.
Query Optimization and Execution
LucidDB's optimizer is designed with the assumptions of a data warehousing environment in mind, so no hints are needed to get it to choose the best plan for typical analytical query patterns. In particular, it supports star joins based on bitmap indexes via a well-known semijoin technique, meaning that usually the only fact table rows accessed are those actually needed to compute query results. The optimizer uses a mix of heuristics and cost-based analysis to achieve hint-free planning. In particular, cost-based analysis is used to determine the order in which joins are executed, as well as which bitmap indexes to use when applying table-level filters and star join optimizations. The analysis uses data statistics gathered and stored as metadata in the system catalogs, allowing the optimizer to realistically compare one option versus another even when many joins are involved. By using cost-based analysis in a targeted fashion for these complex areas, LucidDB is able to consider a large space of viable candidates for join and indexing combinations. By using heuristics in other areas, LucidDB keeps optimization time to a minimum by avoiding an explosion in the search space.
The query executor includes a high-performance sorter, hash join, and hash aggregation. All of these execution algorithms are optimized to take best advantage of available memory. A built-in resource governor assigns memory, balancing between availability and the needs of various operations. The resource governor uses statistics-based optimizer estimates for predicting which algorithms need the most memory to perform well, and distributes available memory accordingly. The execution algorithms are also capable of multi-level disk-based partitioning for handling even the largest data sets. The hash-based algorithms include adaptive skew-resistance and pre-filtering. The only limit on the data size that can be processed is available temp disk space.
- Column-store tables
- Intelligent indexing
- Page-level multi-versioning
- Hot+incremental backup
- Warehouse labels
- Star join optimization
- Cost-based join ordering and index selection
- Hash join/aggregation
- Intelligent prefetch
- INSERT/UPSERT as bulk load
- SQL/MED architecture
- JDBC foreign data wrapper
- Flat file foreign data wrapper
- SQL/JRT architecture
- User-defined functions
- User-defined transformations
- JDBC, HTTP
Cite error: Invalid
parameter "group" is allowed only.
<references />, or
<references group="..." />