Does gcStorage collect all of the garbage?

We are trying to understand the expected behavior regarding the storage footprint on disk over time and the interaction of datomic gcStorage with the storage level cleanup.

In a specific deployment, we are running on-prem datomic 0.9.5544 with cassandra 3.10 as the storage. Cassandra is scaled out over 3 nodes with a replication factor of 3 using the SizeTieredCompactionStrategy with default configuration.

Over several months, the database footprint of cassandra SSTable files grew to over 900 G. There were many entity updates and deletes over that period (perhaps 1 million transactions per day average). We were not running gcStorage during that period. We did a backup/restore and the resulting storage footprint was reduced to approximately 40G.

We are now monitoring the disk usage closely and running gcStorage daily on garbage older than 1 day. The cassandra gc_grace_seconds is set to 4 days. The storage footprint is growing several gigabytes per day while the backup/restore size is not significantly changed. We are trying to understand whether this increase is simply due to compaction lag or additional uncollectable garbage generated by datomic.

(1) Should we expect that running gcStorage and storage level compaction/cleanup should be as effective in reclaiming disk space as a backup/restore? Or is there additional garbage generated during the normal operation of datomic that can only be cleaned up with backup/restore?

(2) We are using Cassandra’s SizeTieredCompactionStrategy which is described as being appropriate when rows are write-once. I am expecting that the datomic segments are write-once from the Cassandra perspective and that the gcStorage will result in tombstones for those rows. Is that a correct expectation?

I appreciate any guidance to help in our tuning.

  • Gordon

Hi Gordon,

You are correct that Datomic segments are immutable and write-once.

I would expect that your system as described will reach an “equilibrium” at some point. The specifics of that balance will depend on:

  • The total size of the “live” Datomic DB
  • The frequency and “older than” settings of gcStorage
  • The Cassandra tombstone grace period

The combination of these factors will eventually determine the steady-state overhead of Datomic’s garbage (which is created during indexing jobs) and Cassandra’s reclamation of space that is freed when you run gcStorage.

If your write load (and thus indexing job sizes and frequency) doesn’t change, I would expect the system to reach an equilibrium point where it only continues to grow proportionally to the addition of new data you’re adding to the system.

-Marshall