next-generation programming platform, currently in development
help fund the project

Twitter . GitHub . RSS

Easy snapshot isolation and PCBT merges

This post has some notes on how to implement database inserts and deletes for the indexing data structure discussed in the last post, the Prioritized Critical Bit Tree (PCBT).

As background, you may want to review the data structure being used for indexing, the PCBT. It’s similar to a radix trie except that the positions examined along each path are not necessarily monotonically increasing; instead, we examine whatever position has maximal entropy with respect to the set of values contained in the subtree of the structure. This raises a few questions:

Let’s look at the first question. The PCBT as described is a static data structure, a labeled tree with certain properties that enable lots of operations. How do we modify this structure? We could:

Let’s reject that approach. The other class of approaches is to use some form of dynamization. To dynamize a data structure, we provide a merge function for the structure, and a merge function for results of queries over the structure. There are various schemes we can use for deciding when to merge the various fragments (my term) of the structure, but as a rule of thumb we’ll aim for having a sequence of fragments which increase in size exponentially. To do a lookup, we check the smallest fragment (which likely lives in memory), then the next fragment, which is say double the size, and so on. In the naive approach we end up doing a logarithmic number of lookups, but it’s possible to improve on this (for instance, by using a bloom filter to find the fragment which could contain the value).

Because we never modify a fragment once it is created, achieving things like snapshot isolation becomes easy—just keep around the old fragments. Assuming we have a fast way of doing merges, we can achieve very high write speeds, too, as there is only contention at the level of the “insert queue” rather than contention on updates to individual nodes in the data structure.

Anyway, this is not a new idea, and some storage engines (like LevelDB, I’m told) use it to implement fast edits.

Merging PCBTs

I’ve already worked out how to merge PCBT results for most query operations like join, intersect, union, and so on. The implementations have kind of an obvious quality to them; there are generally two cases to deal with at each level—the case where both trees being combined examine the same position at their root, and the case where they differ in what position they examine. A little thought reveals what must be done in each case. See the code for details.

What about merging the PCBTs themselves? Let’s look at an example data set, consisting of a set of bitvectors (we’ll generalize this later):


For each column, from 0 to the length of the maximum bit-vector, we compute the number of 1s, 0s, and neithers (for bit vectors not long enough) in that column. Call this the “summary vector”. The summary vector takes up very little space. If there are N elements in the PCBT, with a maximum length of M, storing the summary vector even naively is only O(M * log N) (that is an upper bound; Θ-bound is tighter). We will choose to store the summary vector for each node in our PCBT, which it can be shown increases memory usage by at most a constant factor. (Note for instance that the leaf level summary vector contains all the information of the original bit vector, and is within a constant factor in size, so we can just store summary vectors if we wish.)

Given a summary vector, the root of the PCBT should be the position with “maximal information”. One scoring function is to just multiply entropy by total count (total count is just 0s + 1s + neithers). If all bit vectors have the same length, this is just the position with maximum entropy. If vectors are of different lengths (as in the above example), positions receive a linear weight based on how many vectors even have each position. Thus, a position with exactly 50/50 1s and zeros but only 4 vectors gets a lower score than a position with 55/45 but a thousand vectors.

To merge two PCBTs, we combine their summary vectors, then use this to decide the root position. We then partition the full data set based on this position. The 0s all go in one set, the 1s all go in another. Notice this can be done streaming, and we can compute the summary for the 0-set and 1-set as we go. We then run the same algorithm on each subset, recursively.

That’s the basic idea. Simple!

Something to be worked out still—there is almost definitely a way to leverage this structure to do the merge operation using more “block copies”, rather than having to literally examine every bit of every element on each level of partitioning. Especially in the common case where the PCBTs are “well-aligned” and corresponding nodes examine the same positions in the bit vectors.

Also not worked out yet—we need a file format for this structure that can be built up incrementally but has good locality. This is not trivial.

Merge schedules

Now that we have our merge operation, we just need a schedule for deciding when to do merges. A simple schedule is just to merge fragments whenever the size of the nth fragment exceeds half the size of the (n-1)th fragment. As in a binary counter, this occasionally leads to long chains of ‘carries’. There are various ways of deamortizing this so that only a constant number of merges are done in the worst case for each insert. Ed Kmett has a few talks on cache-oblivious data structures (also here) that discuss how to do this sort of thing nicely. But I’ll leave this work for later.

Appendix: extensions

comments powered by Disqus