Stratified B-trees and Versioned Dictionaries.
Andy Twigg, Andrew Byde, Grzegorz Miło´s, Tim Moreton, John Wilkesy and Tom Wilkie Acunu, yGoogle
External-memory versioned dictionaries are fundamental
to file systems, databases and many other algorithms.
The ubiquitous data structure is the copy-onwrite
(CoW) B-tree. Unfortunately, it doesn’t inherit the
B-tree’s optimality properties; it has poor space utilization, cannot offer fast updates, and relies on random IO
to scale. We describe the ‘stratified B-tree’, which is the first versioned dictionary offering fast updates and an optimal tradeoff between space, query and update costs.
The (external-memory) dictionary is at the heart of any
file system or database, and many other algorithms. A
dictionary stores a mapping from keys to values. A versioned dictionary is a dictionary with an associated version
tree, supporting the following operations:
update(k,v,x): associate value x to key k in
leaf version v;
range query(k1,k2,v): return all keys (and
values) in range [k1,k2] in version v;
clone(v): return a new child of version v that
inherits all its keys and values.
Note that only leaf versions can be modified. If clone
only works on leaf versions, we say the structure is
partially-versioned; otherwise it is fully-versioned.
2 Related work
The B-tree was presented in 1972 , and it survives because it has many desirable properties; in particular, it
uses optimal space, and offers point queries in optimal
O(logB N) IOs1. More details can be found in .
1We use the standard notation B to denote the block size, andN the total number of elements inserted. For the analysis, we assume entries (including pointers) are of equal size, so B is the number of entries per block.
A versioned B-tree is of great interest to storage
and file systems. In 1986, Driscoll et al.  presented
the ‘path-copying’ technique to make pointerbased
internal-memory data structures fully-versioned
(fully-persistent). Applying this technique to the B-tree
gives the copy-on-write (CoW) B-tree, first deployed in
EpisodeFS in 1992 . Since then, it has become ubiquitous
in file systems and databases, e.g. WAFL , ZFS
, Btrfs , and many more.
The CoW B-tree does not share the same optimality
properties as the B-tree. Every update requires random
IOs to walk down the tree and then to write out a new
path, copying previous blocks. Many systems use a CoW
B-tree with a log file system, in an attempt to make the
writes sequential. Although this succeeds for light workloads, in general it leads to large space blowups, inefficient
caching, and poor performance.
3 This paper
For unversioned dictionaries, it is known that sacrificing
point lookup cost from O(logB N) to O(logN) allows
update cost to be improved from O(logB N) to
O((logN)=B). In practice, this is about 2 orders of magnitude improvement for around 3x slower point queries
. This paper presents a recent construction, the Stratified B-tree, which offers an analogous query/update
tradeoff for fully-versioned data. It offers fully-versioned updates around 2 orders of magnitude faster than the
CoW B-tree, and performs around one order of magnitude
faster for range queries, thanks to heavy use of sequential
IO. In addition, it is cache-oblivious  and
can be implemented without locking. This means it can
take advantage of many-core architectures and SSDs.
The downside is that point queries are slightly slower,
around 3x in our implementation. However, many applications, particularly for analytics and so-called ‘big data’
problems, require high ingest rates and range queries,
rather than point queries. For these applications, we
believe the stratified B-tree is a better choice than the
CoW B-tree, and all other known versioned dictionaries.
Acunu is developing a commercial open-source implementation
of stratified B-trees...
References: 1(3):173–189, 1972.
New York, NY, USA, 2007. ACM.
 Jeff Bonwick and Matt Ahrens. The zettabyte file system,
In USENIX Annual Technical Conference, pages 43–60,
McGraw-Hill Higher Education, 2nd edition, 2001.
In STOC ’86, pages 109–121, New York, NY, USA,
USA, 1999. IEEE Computer Society.
 Dave Hitz and James Lau. File system design for an nfs
file server appliance, 1994.
SIGMOD Rec., 20(2):426–435, 1991.
Berkeley, CA, USA, 2003. USENIX Association.
Please join StudyMode to read the full document