Does gcStorage collect all of the garbage?

Gordon · October 3, 2019, 3:49pm

We are trying to understand the expected behavior regarding the storage footprint on disk over time and the interaction of datomic gcStorage with the storage level cleanup.

In a specific deployment, we are running on-prem datomic 0.9.5544 with cassandra 3.10 as the storage. Cassandra is scaled out over 3 nodes with a replication factor of 3 using the SizeTieredCompactionStrategy with default configuration.

Over several months, the database footprint of cassandra SSTable files grew to over 900 G. There were many entity updates and deletes over that period (perhaps 1 million transactions per day average). We were not running gcStorage during that period. We did a backup/restore and the resulting storage footprint was reduced to approximately 40G.

We are now monitoring the disk usage closely and running gcStorage daily on garbage older than 1 day. The cassandra gc_grace_seconds is set to 4 days. The storage footprint is growing several gigabytes per day while the backup/restore size is not significantly changed. We are trying to understand whether this increase is simply due to compaction lag or additional uncollectable garbage generated by datomic.

(1) Should we expect that running gcStorage and storage level compaction/cleanup should be as effective in reclaiming disk space as a backup/restore? Or is there additional garbage generated during the normal operation of datomic that can only be cleaned up with backup/restore?

(2) We are using Cassandra’s SizeTieredCompactionStrategy which is described as being appropriate when rows are write-once. I am expecting that the datomic segments are write-once from the Cassandra perspective and that the gcStorage will result in tombstones for those rows. Is that a correct expectation?

I appreciate any guidance to help in our tuning.

Gordon

marshall · October 16, 2019, 4:11pm

Hi Gordon,

You are correct that Datomic segments are immutable and write-once.

I would expect that your system as described will reach an “equilibrium” at some point. The specifics of that balance will depend on:

The total size of the “live” Datomic DB
The frequency and “older than” settings of gcStorage
The Cassandra tombstone grace period

The combination of these factors will eventually determine the steady-state overhead of Datomic’s garbage (which is created during indexing jobs) and Cassandra’s reclamation of space that is freed when you run gcStorage.

If your write load (and thus indexing job sizes and frequency) doesn’t change, I would expect the system to reach an equilibrium point where it only continues to grow proportionally to the addition of new data you’re adding to the system.

-Marshall

Topic		Replies	Views
Heartbeat failed - PostgreSQL backend Troubleshooting	3	2214	December 26, 2017
Does excision count towards the 1 billion datom limit? Datomic Pro	4	1038	August 20, 2019
Entities Excision doesn’t take effect Troubleshooting	1	1115	May 8, 2020
Entities Excision doesn't take effect Peer API	1	1077	May 8, 2020
About the Datomic Pro category Datomic Pro	0	1297	July 31, 2017

Does gcStorage collect all of the garbage?

Related topics