Hi,
as I’m myself working since 2007 on a persistent, durable data store for Java (once started as an XML database system, now also handles JSON). I’m interested in how the Transactor works internally.
I’m sadly only working in my spare time on the project, but I love the idea of a huge persistent and durable tree of tries, as besides versioning, that is allowing time travel queries, audits, corrections of errors and so on it’s also ideal, as you don’t need a WAL and you can swap the root page atomically. SirixDB [1] also offers a single read-write transaction per resource in a database and N concurrent (if you wish parallel) and completely isolated read-only transactions, which do not involve any locks at all. The resource itself is a huge persistent, durable tree of tries. All data, written in the single read-write transaction is buffered in-memory. You can even revert to a previous state (all states in-between are of course still available, meaning a fully persistent data structure). The structural sharing might not be a big novelty anymore, but the data pages are also versioned, meaning fine granular updates, which are also written into a log-structure (and as such a storage medium with fast, random fine granular reads in parallel is best as for instance Intel Optane Memory). Essentially you can also write a commit-message in SirixDB and add an author.
However, I wonder how to best fix the write amplification with small writes, meaning just a few JSON nodes or XML nodes are inserted/modified/deleted (deletion created a tombstone of course).
As such, maybe every 100ms or after a specific number of nodes or after a number of bytes has accumulated in-memory the huge tree is updated with the new data plus the ancestors and the new root node (in principle the single read-write transaction can do this currently, for instance auto-committing based on the aforementioned metrics). I think, if I understood correctly Datomic is doing something like that and the default in MongoDB seems also to be 100ms in which a transaction can still fail before synced to disk and more data is accumulated, hopefully. However, do you always wait until the data is synced to disk, to acknowledge a commit? And regarding scaling, do you sync to some nodes synchronously and then acknowledge a commit and to the rest asynchronously?
Currently my main issue is that the transaction of course also allows to simply issue a commit, maybe programmatically when used as an embedded data store and the changed data of course might not justify an immediate sync to disk, because of the leaf to root copying overhead.
Kind regards
Johannes