Stratified B-trees and Versioned Dictionaries.
Andy Twigg, Andrew Byde, Grzegorz Miło´s, Tim Moreton, John Wilkesy and Tom Wilkie Acunu, yGoogle
External-memory versioned dictionaries are fundamental
to file systems, databases and many other algorithms.
The ubiquitous data structure is the copy-onwrite
(CoW) B-tree. Unfortunately, it doesn’t inherit the
B-tree’s optimality properties; it has poor space utilization, cannot offer fast updates, and relies on random IO
to scale. We describe the ‘stratified B-tree’, which is the first versioned dictionary offering fast updates and an optimal tradeoff between space, query and update costs.
The (external-memory) dictionary is at the heart of any
file system or database, and many other algorithms. A
dictionary stores a mapping from keys to values. A versioned dictionary is a dictionary with an associated version
tree, supporting the following operations:
update(k,v,x): associate value x to key k in
leaf version v;
range query(k1,k2,v): return all keys (and
values) in range [k1,k2] in version v;
clone(v): return a new child of version v that
inherits all its keys and values.
Note that only leaf versions can be modified. If clone
only works on leaf versions, we say the structure is
partially-versioned; otherwise it is fully-versioned.
2 Related work
The B-tree was presented in 1972 , and it survives because it has many desirable properties; in particular, it
uses optimal space, and offers point queries in optimal
O(logB N) IOs1. More details can be found in .
1We use the standard notation B to denote the block size, andN the total number of elements inserted. For the analysis, we assume entries (including pointers) are of equal size, so B is the number of entries per block.
A versioned B-tree is of great interest to storage
and file systems. In 1986, Driscoll et al.  presented
the ‘path-copying’ technique to make pointerbased
internal-memory data structures fully-versioned
(fully-persistent). Applying this technique to the B-tree
gives the copy-on-write (CoW) B-tree, first deployed in
EpisodeFS in 1992 . Since then, it has become ubiquitous
in file systems and databases, e.g. WAFL , ZFS
, Btrfs , and many more.
The CoW B-tree does not share the same optimality
properties as the B-tree. Every update requires random
IOs to walk down the tree and then to write out a new
path, copying previous blocks. Many systems use a CoW
B-tree with a log file system, in an attempt to make the
writes sequential. Although this succeeds for light workloads, in general it leads to large space blowups, inefficient
caching, and poor performance.
3 This paper
For unversioned dictionaries, it is known that sacrificing
point lookup cost from O(logB N) to O(logN) allows
update cost to be improved from O(logB N) to
O((logN)=B). In practice, this is about 2 orders of magnitude improvement for around 3x slower point queries
. This paper presents a recent construction, the Stratified B-tree, which offers an analogous query/update
tradeoff for fully-versioned data. It offers fully-versioned updates around 2 orders of magnitude faster than the
CoW B-tree, and performs around one order of magnitude
faster for range queries, thanks to heavy use of sequential
IO. In addition, it is cache-oblivious  and
can be implemented without locking. This means it can
take advantage of many-core architectures and SSDs.
The downside is that point queries are slightly slower,
around 3x in our implementation. However, many applications, particularly for analytics and so-called ‘big data’
problems, require high ingest rates and range queries,
rather than point queries. For these applications, we
believe the stratified B-tree is a better choice than the
CoW B-tree, and all other known versioned dictionaries.
Acunu is developing a commercial open-source implementation
of stratified B-trees...
Please join StudyMode to read the full document