I have 3 questions
1st question: How I can verify the integrity of a Datomic backup? For example, a way to calculate a hash from a DB value. That way I can run the hash against my production DB and run the same hash against a restored backup of my production DB and compare.
2nd question: How is a Datomic backup structured? What are values? What are roots?
3rd question: I have a Datomic DB running on AWS and have two backups of the same database. One is a recurring backup to an S3 bucket, running every hour; the other is a local backup that I run pseudo randomly whenever I feel like it. When I run the Linux command
diff against these two backups, the S3 backup has more files in both the
roots subdirectory and the
values subdirectory, although no files actually differ according to the output. Why is this?
Possible answer to #3. I think Datomic stores data regarding the point-in-time that the backup occurs, in order to support point-in-time restore: https://docs.datomic.com/on-prem/backup.html#sec-5. So based on that assumption it would make sense that a backup run every hour would have more data than a backup run every week or month. It’s a little tricky to answer questions for ourselves with closed source code haha. We have to rely on Datomic support.
You’re correct regarding #3. The point-in-time backups will account for more data in the backup location.
There is not currently a specific built-in method for verifying a backup’s integrity. I would recommend restoring a specific point-in-time backup to a secondary storage (dev would be fine) and ensuring that it restores fully.
I would also recommend periodically (perhaps weekly) running a full backup to a secondary empty site (i.e. a separate s3 bucket that isn’t used for incremental backups), as incremental backups to a single location will never “re-copy” segments that are already present in the existing backup location.
Regarding the structure - the roots and values are the internal representations of the Datomic indexes (segments).
This part of Rich’s talk on Datomic describes Datomic’s use of storage in more detail.
I will take a look at the video. Do you have any theories on ways to implement an integrity check for a Datomic backup? Do you think simply successfully restoring a backup without errors, is good enough? Could queries on a successfully restored backup ever uncover data corruption that
datomic restore-db would not uncover? Thanks again.
There are various ways of “ensuring” a backup, depending on the degree of assurance you require.
Restoring into a storage and connecting to it to be sure it can be read is a pretty good assurance, but, as you surmise, there are (unlikely) scenarios in which corruption could potentially exist at the datom level.
As with any approach of this sort, ultimately you could implement something O(n) that examined the entire database (i.e. walk the log or the indexes). Whether that is warranted is largely up to you.