Slow initial connection

#1

Lately, we’ve been having to wait for about 5 to 6 minutes for the initial Datomic connection to be established.

Our setup is:
Datomic transactors on AWS ECS inside a docker image running version 0.9.5561
Actual data stored on an on-demand DynamoDB, data size is 210GB

We’ve done tests connecting from different locations and they all took the same time so it’s not latency.

Another observation was that 5 months ago we were not having these issues, a connection was taking 30 seconds.

We are inclined to think it is data size related as well as the peer going through the data building its index.

Any advice/explanation would be appreciated.

Regards

#2

Your DDB isn’t small, but I’ve gotten closer to 1TB without such a long connect time.
Do you run gcStorage regularly/ever? What about bin/datomic gc-deleted-dbs (if applicable)?

#3

Some other thoughts: have you tuned memory-index-threshold or memory-index-max? My understanding is that the peer needs to load the newest storage tree root node plus the log segments representing any novelty accumulated since that tree root was built. If you have index threshold set high, then the novelty in the log can be large.

#4

Hi adam,

We run gcStorage once a week.
As for gc-deleted-dbs we do not run it because we do not delete any databases.

#5

This is our transactor.properties
memory-index-threshold=32m
memory-index-max=256m
object-cache-max=128m

Does not look big enough to be causing that much of a delay. Very peculiar issue.

#6

Hmmm…have you measured data transfer during peer startup? I’m curious if you can measure, e.g., 4 gigabytes needed to be transferred for the peer to connect, vs network transfer speed, vs other stats like CPU usage.

Also, your object cache is very small. I don’t know enough about the Peer internals to say with certainty that this would cause your issues, but one possibility is that you are thrashing the object cache and having to transfer the same segments multiple times as the Peer keeps invalidating cached data due to a full cache. The default object cache size is 50% of the Peer heap size, so 128m is WAY under provisioned.

#7

Thanks for your time helping us trying to figure out this issue.
I checked a few of the peers and looks like:

  • in the first 5 minutes from startup it does a chunk of 80MB
  • followed by 5 minutes of nothing
  • and then between 100 - 250MB in the next 5 minutes (this is where the connection gets established)
  • after the total of 15 minutes the peer is operational.

As far as CPU goes it is around the 5% utilization.

Reason we have a low object cache is because we thought if we make it big it would cause long GCs which would make the transactor and peers unavailable for a long time. and we are trying to avoid that. but we can try with a bigger value and see how it goes.

Thanks again for your time. I will update once we get the time to test with a larger object cache.

#8

Hi adam,

Thanks again for your time helping us figure out the connection issue.

I’ve checked a few servers when they started up and the data in is around 100 - 250MB within the 5 minutes we are establishing the connection.

CPU is at around 5% during that time.

Also, I found out that the peers have their own cache setting

-Ddatomic.objectCacheMax=4096m

We could try and change the transactor object cache and see how it goes.

I’ll update you once we schedule some time to do that.

Thanks again.

#9

Some more updates I was looking at the averages in the graphs, changing to the maximums the downloaded data withing the connection 5 minutes it could go up to 1.3GB download and CPU actually went up to 25%. It is a 4 core CPU so that could mean it’s utilizing one to the max.

#10

Cool, yeah, I think I was conflating the transactor vs peer object cache in my prior reply anyway. The transactor can benefit from a bit of a larger cache, but it doesn’t seem like it should effect Peer connect time (just transaction performance, e.g. if you are hitting indexes for uniqueness checks during transactions, running transaction functions that retrieve a lot of data, etc).

In terms of the startup time which you’re reporting as 15 minutes, I’m a bit confused how much of this is actually spent blocking on Peer.connect. If you write a test Peer program that does nothing but call Peer.connect and exit, does this take 15 minutes to run? The 100-250mb you see transferring on Peer.connect corresponds to the memory index size (at that time) and makes sense wrt to the parameters you shared. But on-demand DDB should be able to get that done in much less than 5 minutes.

#11

It’s a very strange issue.

We’ve done the Peer test and the blocking for Peer.connect is about 5.5 minutes. It takes the same mount of time when connecting from an EC2 instance in the same AZ and we also done a test from our machines, starting a repl and it was the same time.
This makes us think it’s not really the size of the data but something CPU bound that the connector needs to do before it is ready. But at the moment we have no idea what that could be.

We are going to do a fresh deployment of our app in another country and we’ll do some tests to see if it will be much faster with just the initial data and maybe that would help us out identifying the issue