I’m taking a break from work on the editor to do a design writeup. This post presents a design for the Unison persistent data API. With this API, users will be able to store arbitrary Unison values, including functions, without having to deal with any parsing or serialization logic. Unlike SQL, this API doesn’t awkwardly force the user to encode everything as tuples or named tuples—sum types (or any other type) are totally fine to persist and query for.
Though the API doesn’t force awkward encodings onto the user, it’s not dynamically typed or schemaless either. A
Data object has a particular type, it’s a
Data a, which denotes a set of values of type
a. Thus a “schema migration” isn’t some dynamically typed SQL script that’s run against a live database and might explode at runtime, it’s an ordinary Unison function, statically checked using the Unison typechecker and trivially testable on a mock dataset.
All right, let’s look at the API:
-- Create an empty `Data` value empty : Remote! (Data a) -- Add to a `Data` value add : a -> Data a -> Remote! () -- Keep only values matching the pattern keep : Pattern a -> Data a -> Remote! ()
Remote! type constructor is a bit like an
IO type. Its use is not really essential here, the key point is just that these functions have imperative effects. See the distributed evaluation API for more info on it.
Data a denotes a set of values of type
empty is the only function for creating
Data values. It conjures up a globally unique identifier to serve as the first-class handle for referencing the data set. Unlike SQL, datasets are first-class; you can pass them to functions, store them in lists, and so on. Notice that the
add function has zero constraints on it—any Unison value at all can be persisted in a
Pattern a denotes an
a -> Boolean, but the API for creating them is limited. I’ll make the API completely explicit here for clarity, but there will likely be much nicer syntax for creating and viewing these patterns (or any other applicative functor) in the Unison editor:
negate : Pattern a -> Pattern a literal : a -> Pattern a blank : Pattern a ap : Pattern (a -> b) -> Pattern a -> Pattern b
negate function inverts a pattern—it fails whenever the original pattern succeeds and succeeds whenever the original fails. The
literal a pattern matches any value with the same hash as the normal form of
a—literally the same value as
blank pattern is
True for any input. And the
ap lets us build up larger patterns which may contain
blank values. For instance:
bobs : Pattern (Text, a) bobs = literal (,) `ap` "Bob" `ap` blank lefts : Pattern (Either a b) lefts = literal Left `ap` blank -- nicer syntax/display - -- bobs = ("Bob", _) -- lefts = Left _
bobs pattern will match tuples whose first element is
"Bob", and the
lefts pattern will match
Either values which are
Left. Thus, there is no special handling of tuples, and no need to contort your data model to represent every concept using tuples. If you data model is best described by a
Data (Either a b) or any other sum type, that’s what you should be able to persist and query on!
Let’s now look at querying:
query : Data a -> Results a empty : Results a add : a -> Results a -> Results a keep : Pattern a -> Results a -> Results a union : Results a -> Results a -> Results a intersection : Results a -> Results a -> Results a -- etc
Results a is a pure description of how to fetch a result set of
a values. These descriptions may then be combined in various ways to build up or refine a query. In the end, we often want to look at some values matching the query. Here’s one way to do that:
ascending : Permutation a -> Ordering a descending : Permutation a -> Ordering a then : Ordering a -> Ordering a -> Ordering a sort : Ordering a -> Results a -> Stream a stream : Results a -> Stream a take : Number -> Stream a -> Stream a map : (a -> b) -> Stream a -> Stream b -- etc run : Stream a -> Remote! (Vector a)
Permutation a denotes a permutation of the paths into
a, which generalizes the notion of a column. A simple
Permutation might be
id : Permutation a (the identity permutation, rearranges nothing) or a
Permutation (a,b) (swaps the
_1 path and the
_2 path). A
Permutation can then be converted to an
Ordering, either ascending or descending, and orderings may be chained together using the
then function. Together these functions let us express arbitrary ways of sorting (“sort by the third element of the tuple, ascending, then the first element descending”), the only constraint is that we must order based on information already present, at least when manipulating a
Results object. (You have more flexibility after the
Results is converted to a
Stream or list.)
Once you’re done manipulating your
Results, you use
sort to convert to a
Stream, which can be further manipulated and eventually
See the appendix for details on how permutations are built up and some possible variations on this design.
Something that might not be obvious is that this API can be used to implement indices or key-value stores. Given a
Data (k,v), we can find the value(s) associated with a given key using the pattern
(k,_), or we can query in the other direction with the pattern
(_,v). In a future post, I’ll describe a data structure that can be used to implement the actual store and which can answer all pattern queries efficiently, as well as supporting all the other
Results operations like
intersection, etc. The data structure comes in a few flavors, one of which is adaptive and tunes itself to efficiently run the most common queries while still being reasonably efficient for other queries—this covers the 80% case. But it can also be used in a completely explicit fashion, with the programmer specifying the “query model” which determines exactly how the structure is organized and what pattern queries are most efficiently handled.
That’s the API. Storing and querying for data should be simple, and shouldn’t force you to work with weird representations of your data types or worry about parsing and serialization. As in the distributed evaluation API, you can of course customize how things are encoded just by using regular Unison code, before the data is persisted and/or after it comes out of the store.
Now why isn’t it this simple today? The reason: databases and languages are usually designed and implemented completely separately. There’s a runtime barrier between the language and the database, and all static type information gets tossed out when crossing this barrier, just like going to and from text files or streams of bytes on a network socket. The programmer thus deals with parsing and serialization explicitly at this barrier, and there’s also often an impedance mismatch since the database represents concepts in different ways than our programs (or can’t represent some concepts well at all, like sum types). All this generates a huge amount of knock-on complexity. Taking a step back and addressing the problem in a cohesive way with better assumptions eliminates all that complexity. This is a running theme around here!
For bulk inserts, we can use the following API:
stream : Permutation a -> Results a -> Stream a (++) : Stream a -> Stream a -> Stream a map : (a -> b) -> Stream a -> Stream b -- etc adds : Stream a -> Data a -> Remote! ()
stream function converts a result set to an ordered stream of values. We assume that
Stream has a list-like API with
filter, etc. The values of a
Stream a can then be inserted to a
Data a. This is one simple way of doing arbitrary “migrations”, though of course it involves doing a full reinsertion of all the data from the original data set. See the discussion of permutations below for ideas on other options.
adds function is also useful for synchronizing a data set with some external source—in this case the
Stream a might be built from some periodic source of updates to a third-party dataset.
Here’s an API for building permutations:
id : Permutation a compose : Permutation a -> Permutation a -> Permutation a swap : Path a b -> Path a c -> Permutation a id : Path a a compose : Path a b -> Path b c -> Path a c _1 : Path (a,b) a _2 : Path (a,b) b -- etc swap _1 _2 : Permutation (a,b)
A similar idea can be used for expressing joins:
join : Join a b -> Results a -> Results b -> Results (a,b) empty : Join a b and : Join a b -> Join a b -> Join a b align : View a x -> View b x -> Join a b _1 : View (a,b) a _2 : View (a,b) b -- etc compose : View a b -> View b c -> View a c
And we can use a similar data type to build up strongly-typed migrations that can be interpreted efficiently by the data store.comments powered by Disqus