Stratified B-Trees and Versioned Dictionaries

Topics: File system, B-tree, Revision control Pages: 16 (4093 words) Published: January 1, 2013
Stratified B-trees and Versioned Dictionaries.
Andy Twigg, Andrew Byde, Grzegorz Miło´s, Tim Moreton, John Wilkesy and Tom Wilkie Acunu, yGoogle
External-memory versioned dictionaries are fundamental
to file systems, databases and many other algorithms.
The ubiquitous data structure is the copy-onwrite
(CoW) B-tree. Unfortunately, it doesn’t inherit the
B-tree’s optimality properties; it has poor space utilization, cannot offer fast updates, and relies on random IO
to scale. We describe the ‘stratified B-tree’, which is the first versioned dictionary offering fast updates and an optimal tradeoff between space, query and update costs.
1 Introduction
The (external-memory) dictionary is at the heart of any
file system or database, and many other algorithms. A
dictionary stores a mapping from keys to values. A versioned dictionary is a dictionary with an associated version
tree, supporting the following operations:
 update(k,v,x): associate value x to key k in
leaf version v;
 range query(k1,k2,v): return all keys (and
values) in range [k1,k2] in version v;
 clone(v): return a new child of version v that
inherits all its keys and values.
Note that only leaf versions can be modified. If clone
only works on leaf versions, we say the structure is
partially-versioned; otherwise it is fully-versioned.
2 Related work
The B-tree was presented in 1972 [1], and it survives because it has many desirable properties; in particular, it
uses optimal space, and offers point queries in optimal
O(logB N) IOs1. More details can be found in [7].
1We use the standard notation B to denote the block size, andN the total number of elements inserted. For the analysis, we assume entries (including pointers) are of equal size, so B is the number of entries per block.

A versioned B-tree is of great interest to storage
and file systems. In 1986, Driscoll et al. [8] presented
the ‘path-copying’ technique to make pointerbased
internal-memory data structures fully-versioned
(fully-persistent). Applying this technique to the B-tree
gives the copy-on-write (CoW) B-tree, first deployed in
EpisodeFS in 1992 [6]. Since then, it has become ubiquitous
in file systems and databases, e.g. WAFL [11], ZFS
[4], Btrfs [9], and many more.
The CoW B-tree does not share the same optimality
properties as the B-tree. Every update requires random
IOs to walk down the tree and then to write out a new
path, copying previous blocks. Many systems use a CoW
B-tree with a log file system, in an attempt to make the
writes sequential. Although this succeeds for light workloads, in general it leads to large space blowups, inefficient
caching, and poor performance.
3 This paper
For unversioned dictionaries, it is known that sacrificing
point lookup cost from O(logB N) to O(logN) allows
update cost to be improved from O(logB N) to
O((logN)=B). In practice, this is about 2 orders of magnitude improvement for around 3x slower point queries
[3]. This paper presents a recent construction, the Stratified B-tree, which offers an analogous query/update
tradeoff for fully-versioned data. It offers fully-versioned updates around 2 orders of magnitude faster than the
CoW B-tree, and performs around one order of magnitude
faster for range queries, thanks to heavy use of sequential
IO. In addition, it is cache-oblivious [10] and
can be implemented without locking. This means it can
take advantage of many-core architectures and SSDs.
The downside is that point queries are slightly slower,
around 3x in our implementation. However, many applications, particularly for analytics and so-called ‘big data’
problems, require high ingest rates and range queries,
rather than point queries. For these applications, we
believe the stratified B-tree is a better choice than the
CoW B-tree, and all other known versioned dictionaries.
Acunu is developing a commercial open-source implementation
of stratified B-trees...

References: 1(3):173–189, 1972.
5(4):264–275, 1996.
New York, NY, USA, 2007. ACM.
[4] Jeff Bonwick and Matt Ahrens. The zettabyte file system,
In USENIX Annual Technical Conference, pages 43–60,
McGraw-Hill Higher Education, 2nd edition, 2001.
In STOC ’86, pages 109–121, New York, NY, USA,
USA, 1999. IEEE Computer Society.
[11] Dave Hitz and James Lau. File system design for an nfs
file server appliance, 1994.
SIGMOD Rec., 20(2):426–435, 1991.
Berkeley, CA, USA, 2003. USENIX Association.
Continue Reading

Please join StudyMode to read the full document

You May Also Find These Documents Helpful

  • Trees Essay
  • Essay about Trees
  • The Trees Research Paper
  • Decision Trees Essay
  • B Essay
  • The B Essay
  • Trees Essay
  • trees Essay

Become a StudyMode Member

Sign Up - It's Free