Buildup of datomic.index.TransposedData instances in heap causing crashes, unsure how to clear

We’ve been seeing frequent application crashes as a result of our app running out of memory. We looked at our metrics in AWS and found a strange pattern where the memory would spike periodically, then plateau. When the plateau reaches a high enough height, a single spike is enough to freeze the docker container and we have to manually restart it. We’re seeing a similar pattern in our dev environment, just on a much smaller scale.

After analyzing the heap dump, we found that a large portion of the heap was contributed to a datomic class “datomic.index.TransposedData”. I believe this is the object cache on the peer judging by the look of some of the data types used and looking through some of the data.


(heap dump showing datomic TransposedData)

We’re trying to understand why these objects are building up in the cache and what we can do to clear them out. I tried passing the JVM_OPT datomic.ObjectCacheMax through to limit the cache on the peer, but it didn’t seem to have much of an impact when testing in our dev environment. Another possibility we’re considering is that we’re holding onto the head of some reference to the cache.

We are running datomic-pro 0.9.5561.62, with the database using PostgreSQL, but are in the process of upgrading to peer 1.0.7075 since it looks like there’s a system-administer API we can try to use. We’ve also seen some improvements in our dev environment from upgrading, so we’re going to push the upgrade to production to see what effect it has.

As an aside, does datomic have any mechanisms to rollback the internal PostgreSQL schema in the event something goes wrong during our upgrade and we have to roll it back, or will we have to use our own system to restore the database?

Any advice would be greatly appreciated.

Here’s some more information about our system:

JVM_OPTS: -Xms32m -Xmx12g

Transactor properties:
memory-index-threshold=32m
memory-index-max=512m
object-cache-max=1g
heartbeat-interval-msec=10000

Updating to the latest version of peer seems to have stabilized things, but we’re still seeing a similar issue. We’re now seeing a high-memory usage in our transactor though, so it’s likely we just need to start adjusting our transactor.

@DanDonley have you seen a successful indexing job? When was the last successful indexing job? You can review the logs for the metric :CreateEntireIndexMsec as a proxy for index completion as this metric only reports when an indexing job completes.

Also you mentioned you upgraded your peer. Did you also upgrade the transactor?

https://docs.datomic.com/pro/operation/deployment.html#upgrading-live-system

Thanks Jaret,

It seems we didn’t have any logging on the txor, so I followed these instructions to do so and log them to CW:
https://docs.datomic.com/pro/overview/aws.html#other-storages

However I’m not seeing a “CreateEntireIndexMsec” metrics anywhere and I don’t think I missed anything. Are there other instructions to enable the other metrics?

Thanks,
Dan Donley

edit: we are seeing 21 other metrics and they’re definitely from the transactor, but we’re not sure why about a dozen others are excluded, we couldn’t find anything in the docs as to why. and yes we made sure the transactor and peer deployed with the same version of datomic

@DanDonley you should log a support case so we can help you resolve this issue. If you are not indexing we need to address. You can e-mail support@datomic.com or log a ticket from the portal: Datomic - Support

I see, thanks we’ll do that