Architecture: Transactor / Persistent, durable tree


as I’m myself working since 2007 on a persistent, durable data store for Java (once started as an XML database system, now also handles JSON). I’m interested in how the Transactor works internally.

I’m sadly only working in my spare time on the project, but I love the idea of a huge persistent and durable tree of tries, as besides versioning, that is allowing time travel queries, audits, corrections of errors and so on it’s also ideal, as you don’t need a WAL and you can swap the root page atomically. SirixDB [1] also offers a single read-write transaction per resource in a database and N concurrent (if you wish parallel) and completely isolated read-only transactions, which do not involve any locks at all. The resource itself is a huge persistent, durable tree of tries. All data, written in the single read-write transaction is buffered in-memory. You can even revert to a previous state (all states in-between are of course still available, meaning a fully persistent data structure). The structural sharing might not be a big novelty anymore, but the data pages are also versioned, meaning fine granular updates, which are also written into a log-structure (and as such a storage medium with fast, random fine granular reads in parallel is best as for instance Intel Optane Memory). Essentially you can also write a commit-message in SirixDB and add an author.

However, I wonder how to best fix the write amplification with small writes, meaning just a few JSON nodes or XML nodes are inserted/modified/deleted (deletion created a tombstone of course).

As such, maybe every 100ms or after a specific number of nodes or after a number of bytes has accumulated in-memory the huge tree is updated with the new data plus the ancestors and the new root node (in principle the single read-write transaction can do this currently, for instance auto-committing based on the aforementioned metrics). I think, if I understood correctly Datomic is doing something like that and the default in MongoDB seems also to be 100ms in which a transaction can still fail before synced to disk and more data is accumulated, hopefully. However, do you always wait until the data is synced to disk, to acknowledge a commit? And regarding scaling, do you sync to some nodes synchronously and then acknowledge a commit and to the rest asynchronously?

Currently my main issue is that the transaction of course also allows to simply issue a commit, maybe programmatically when used as an embedded data store and the changed data of course might not justify an immediate sync to disk, because of the leaf to root copying overhead.

Kind regards

[1] https://sirix.io

The datomic transactor does not acknowledge a write until it has been validated and durably persisted to the Log. Since underlying storage implementations vary, the specifics of “durable” depend on the underlying storage, but generally it is expected that these writes are fully durable and replicated (e.g. when DynamoDB acknowledges a write it is “synced” to disks in multiple AWS AZs).

Index rebuilds (which, once complete, would swap out the root tree node) are done periodically, e.g. based on the amount of novelty accumulated in the log. In terms of “write amplification”, I guess it depends on what you mean. Datomic indexes are covering and every datom is written to 3-5 indexes (Always Log, EAVT and AEVT, sometimes AVET and/or VAET). Plus every transaction itself generates at least one datom for the transaction…so strictly speaking there is a lot of “duplication” in the storage layer, but the IO cost of most of those writes can be deferred to when the indexing job occurs. Only the write to log “blocks” the transaction. Similarly, given the per-transaction overhead, it is more space efficient to write two datoms in one transaction than in two.

On the read side, each Peer is responsible for combining the latest tree-version they have with any novelty accumulated since then (which they keep in memory). One of the first things a peer does on startup is initialize this in-memory structure from the durable Log. From that point, Peers stay up-to-date by subscribing (i.e. via activemq) to the transactor, which publishes new log entries as they are acknowledged. It isn’t necasary to rebuild indexes anywhere near every 100ms…the rate at which these rebuilds occur is primarily governed by the size of novelty that is in the log but not in the other indexes.

The docs cover this process here:

So, you’re basically using two data-structures, right? A WAL (write to the log) and the indexes, which are rebuild as in other database systems, more or less, but the index (of indexes?) of course is a huge persistent, durable tree. So, I guess it’s no force/no steal policy.

Currently, in SirixDB it’s possible to addd an author and a commit timestamp, but I might get rid of this and also implement a simple resource-wide log / queue, which then does a transactional group commit after a specific amount of bytes has been accumulated in-memory :slight_smile: But I don’t want to maintain a full log with redo/undo operations since a checkpoint, so the commit can only return, once the group commit has swapped the root page of the huge index.

But in essence I think it might be almost the same, regading my ideas. In the HTTP-Server for the REST-API I can accumulate JSON or XML Strings plus insertion points in a queue and then issue the read-write transaction to commit based on a time interval (plus if anything has changed at all) and a predefined size of bytes, whichever comes first.

Kind regards