C-Store: A Columnar Database
C-store was one of the earlier database designs that lead to a progression of databases that were centered on column oriented approaches for storage. Before that, the common design pattern was to store the database records as contiguous rows. C-store focuses on read-optimized workloads such as a data warehouse and also supports infrequent write workload. There are a bulk of applications such as Data warehouses or CRMs that need to load data once and do multiple interactive queries on that data. Such workload would benefit from being read-optimized. With column-oriented databases, one can bring only required attributes/columns into the memory and do the necessary analysis on it. If that data was stored in a row-oriented manner, a lot of unnecessary attributes would need to be accessed or brought into memory. Another additional advantage of column oriented storage is that attributes of the same column are of the same data type and hence can be encoded for efficient storage and retrieval (For row oriented storage — this becomes tricky). C-store stores a group of columns sorted on some key, in a physically contiguous space — such group is called a projection. In addition, C-Store assumes a distributed architecture where different projections might land on different nodes in the system. C-Store supports k-safe failure mode i.e.the database is accessible and data is recoverable/queryable even if k nodes in the system fail. Since the same column might belong to multiple projections and those projections might land on different nodes, there is a natural redundancy in the system which helps in case of node failures.
In order to support read-optimized workload and to avoid concurrency/locking issues with write based workload, C-Store utilizes two logical sets: Writable-Store(WS) and Readable-Store(RS). All the writes happen to WS and a tuple mover component moves the changed records using a merge-out process periodically to the RS. To support lock-free access to RS, a snapshot of RS(upto the very recent past) is created which can then be used for querying without any locks — this mechanism is called as snapshot isolation.
C-Store supports traditional database models of tables comprising of columns as attributes and concepts of primary keys and foreign keys. Most row stores physically store the data in the tabular format with indices used to speed up the access. C-Store only stores column families(projections) in a physically contiguous manner. The C-Store projections consist of a column from a given table — this column is the anchor of the projection. The same projection can contain other columns from the same table or other tables as well as long as the anchor can be mapped to the foreign table. For example, if there are two tables called EMP(name, age, salary, dept) and DEPT(dname, floor), then a projection EMP1 could be:
EMP1: (name, dept, DEPT.dname | name)
For the projection EMP1, each column is stored separately. The order of the values in this projection is dependent on the sort key, here indicated by the column name after the character ‘|’. Every such projection is divided into one more segments and each segment is given an SID. This segmentation is based on the sort key and hence each segment basically represents the range of keys for that segment. One important thing to note here is that, all rows/values in the database should be reconstructable from the various projections — this is called a “covering set” of projections.
In order to represent the entire table or all records that a query should return, Join indices and Storage Keys are used. Storage keys are used to represent a logical row or a tuple in the given segment. Storage keys are stored in WS, but are not stored in RS. Join indices use Storage Keys to create the original tables across multiple segments. If two projections T1 and T2 can cover an entire table, then the join index entry consists of (Segment ID, Storage Key) of T2 for the corresponding keys in T1. Here is an example of a join index that maps EMP1(name, age| age) projection to EMP3(name, salary| salary).
Maintaining join index is expensive. Any update to a projection leads to an update of the incoming or outgoing join index.
RS: Read optimized store
The read optimized store uses efficient encoding schemes based on the distribution of the data in the columns e.g. For a column consisting of a few unique values, it can use a run-length encoding kind of scheme that can be described as (occurrence in the column, value, frequency). Similarly for a column consisting of many distinct values, one can store the first value in the column followed by deltas from the first value — this will help reduce the overall size needed to store deltas and hence the column. Join indices are also stored in the RS — as we discussed earlier, it is really two columns consisting of (segment id, Storage key) tuple.
WS: Write optimized Store
This is where all the recent modified data resides. It makes sense to have similar structure for WS as RS to avoid writing query optimizers differently. Only difference that exists in WS is that the columns are not stored in an encoded format — to facilitate ongoing updates. There is 1:1 mapping between WS and RS — sid and storage-key(SK) identify the same tuple in both. sid is explicitly stored in the WS. A new sid is created when a new value is inserted in WS. A column is represented as (value, SK) pair. B-trees on SK are used to represent these columns. In addition the entire projection is represented as (s, SK) with s being the sort key. Combination of sort index and the raw storage key based index one provides efficient access to attributes while scanning/querying.
Segments is the unit of operation in C-Store. Due to the distributed nature of the database, it uses a storage allocator to assign particular segments to particular nodes. Some ideas presented in the paper include colocating the following: same columns in the projection, same segments of RS and WS, the sender segment like EMP3 and the join index in the figure above.
Updates and Transactions
Inserts happen in WS and it means updating all the columns in the given projection and also the sorted-key column. This change can first land at any of the nodes in the system and so this node is responsible for creating a new storage key for this entry. A unique storage key is obtained by node_id + local_counter combination. This local_counter is unique and greater than the largest key in RS to ensure a unique value across all nodes. Locking is minimized by using snapshot isolation — the central idea being read queries use some snapshot in the past which is immutable.
To provide snapshot isolation, it is important to note the epochs(time-ranges) at which the records were inserted or deleted. WS maintains an Insertion vector(IV) and a Deleted record vector(DRV) for every logical record in a given segment. All the records inserted or deleted before Low watermark(explained later) are visible to any query. Looking at IV and DRV, the execution engine can figure out if the record should be visible or not to this query. If you think of time from top to bottom, the high watermark(HWM) denotes the time at which snapshot isolation ran. Low watermark(LWM) indicates the latest time at which a read-only query can be run on the snapshot.
C-Store doesn’t track time with high fidelity. It divides time into epochs(few seconds or more) and then provides snapshot isolation based on the epoch number. As shown in the figure above, there is a timestamp authority(TA) which assigns epoch to all nodes and maintains the HWM. It sends the “start of epoch” message to all nodes to indicate that new epoch should start. Each node sends the “Epoch complete” message back when some transactions from the earlier epoch e are done. When TA gets all the complete messages back, it announces the new HWM of “e” to all the nodes — this is the isolated snapshot. Current epoch in the system is e + 1 which is being updated, but not visible to queries. The tuple mover ensures that no records in the RS were created after LWM — this helps RS to not maintain the IV.
To support commits across different nodes, each transaction gets a master. The master is responsible for splitting the work of the transaction and assigning it to many nodes. Master waits for all nodes to finish their work. Then it sends the “commit” message to all nodes and can release locks and delete the UNDO log when all nodes have gotten that message.
Recovery consists of 3 main cases:
- When the failed node has data intact. In that case it can be recovered by executing updates that were queued somewhere else.
- If both WS and RS failed and suffered data loss on the given node, then it needs to be reconstructed using projections and join indices at other site
- The most common case is when RS is intact, but WS is damaged. This scenario can be dealt with efficiently given that it is the most common scenario. Paper contains more details on this.
Tuple mover is a merge process that runs in the background to move qualified blocks of tuples from WS to RS. LWM is essentially the earliest effective time at which a read-only transaction can be run. The tuple mover needs to ensure that all the transactions before LWM are available in RS. The tuples in WS with deleted timestamp less than LWM are discarded as they should not be visible to any query. Newly inserted tuples are added to RS. This merge is performed by using a new temp segment. This means that each tuple gets a new storage key and SID. This implies that join indices also need to be maintained.
C-store introduces a columnar database with a novel concept of isolating WS for efficient writes and RS for efficient read-heavy workload. It also supports distributed committing mechanism using a transaction master, thus supporting a fully distributed database. Encoding schemes not covered in this post(are in the paper) also provide an efficient mechanism for effective storage.