Back to index

C-Store: A Column-oriented DBMS

Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran, and Stan Zdonik.
MIT, Brandeis University, UMass Boston, and Brown University
Scribe by: Zuyu Zhang

One-line Summary

Overview/Main Points

Background
- What has changed since the architecture of DBMS became mature?
  1. hardware
    - huge main memory: work set and more applications could fit in memory
    - not just disk VS RAM (flash, PCM, ...)
    - multi-core systems: concurrency control becomes painful
  2. It is OK to be special purpose
    - read-only dbms on RAID/ array, 10x ~ 20x faster. Suitable for append-only xacts with scan.
  3. NoSQL
- Why take cs764?
  1. Legancy of dbms can not be retired soon.
  2. "elephants" adapts well over years & evaluation."
  3. Old stuff help for understanding new stuff
row store
- Row scan for access path without index.
- SQL deals with rows (record).
- Select * From R

column store

Benefits

Scanning only a few column is fast
- S(A₁, A₂, ..., A_n)
- Select S.A₁, S.A₂
  From S

Compression

dictionary encoding
- ex: if a column has 1000 distinct values, use 10 bits per value. Shanmigasundaram equals 0001011001.
- But need time to look up for decoding.
row length encoding
- sorted & not too many values.
- ex: 100 rows of 3 ⇒ (3, 100).

delta encoding

sorted & many values.
ex:

Original column	Compressed column
100,001	100,001
100,003	2
100,003	0
100,007	4
...	...

Only one column would get a good compression rate.

Late materialization

ex: σ_R.A=8M (R)
R.A is compressed, so it is a tiny column.
Select operator uses bitmap to give the positions for getting other columns.
Better cache performance than row-store

Row-store	a₁b₁c₁a₂b₂c₂...
C-store	a₁a₂a₃a₄a₅a₆...

hash join
- ex: emp⋈ dept
- Store join results for materlization

Disadvantages
- Reconstruction cost
  - Select * from R
  - How do you reconstruct rows?
    1. Actually store (key, attr val) pairs, and do joins. Row is actually (k, a, b, c), and the store looks like ...
    2. Store columns in the same order.
  - Not that bad
- updates (insertions)
  - one I/O per column, even worse for c-store systems with replicas.
  - Really bad impacts, but could mitigate by batching updates for optimization.
  - It is claimed not to work for such workloads!
  - Typical workload: append-only, and reply queries as fast as possible.

C-Store: A Column-oriented DBMS

One-line Summary

Overview/Main Points

Relevance

Flaws