Help debugging high numbers of Datomic Client Timeouts

I’m trying to debug some apparent performance issues we’re getting with Datomic Cloud, manifesting as a large number of Datomic Client Timeout errors with category interrupted (normally) or unavailable (occasionally) whenever we try to ramp up requests to our app beyond absolutely minimal levels. This seems to remain true whether we set a reasonable timeout on requests (500ms, with a few retries), or an absurdly high one like 10 minutes. Given Datomic Cloud itself is a bit of a black box, I was hoping if I described our setup and issues folk might have some suggestions as to where to look for the problem/how we might need to restructure things.

The app is fed by Kafka, and we have a unique database in Datomic per service per Kafka topic partition. Currently there’s ~40 separate services reading from Kafka (though that is likely to increase), 10 partitions (and thus databases) per service (again likely to increase), and each service only ever interacts at a given time with the specific database related to the message it’s just received. The app is deployed as a series of Kubernetes pods, each of which runs a thread per service it has been configured to handle. We can scale all the way from one pod running all 400 service/partition pairs and only ever possibly making 1 request to each of 40 databases at a time, to 400 pods each running only a single service/partition pair for a max of 400 requests (1 per db) at a time. In reality we’re somewhere in the middle at the moment, with anywhere from 5-25 pods each running a subset of services and each service running on 1-5 pods, depending on how we’re currently testing the scaling.

We deployed Datomic Cloud (the production version) from the standard CloudFormation template, so have currently got 2 i3.large instances handing all our requests. Our q count is only about 2x our transact count, with current testing being at ~150 queries and ~75 transactions per second during a ‘load’ test. We’ve not yet deployed any query groups, although that is something we’re considering, especially as the CPU use of the existing instances spikes to 80-90% during a test. Nor have we increased the size of the standard instance group from 2 to the 3 it has as max, because the recommendation in docs is we talk to Datomic before doing that. We have changed the DynamoDB scaling policy to ‘on-demand’, so that we’re not being hit by any restrictions to DynamoDB read/write units. We do occasionally see some OpsTimeout on the CloudWatch dashboard, but not nearly as frequently as the client seems to report interrupted.

One other thing that we are doing is going cross-VPC to communicate with Datomic. We already had all our K8s, Kafka etc set up in one VPC, and the Datomic Cloud templates don’t allow for adding to an existing VPC, nor did we want to tie ourselves to having to recreate everything else if we needed to rebuild Datomic. So we’ve got a VPC for Datomic and a VPC for everything else, with inter-VPC comms to go between them. I had considered modifying the templates to work with the existing VPC, but don’t know what that would do for support from Cognitect.

If anyone has any pointers for things I should be trying/looking at to reduce our failures, that would be great, or even what sort of performance we should expect from Datomic Cloud in terms of handled requests/sec.

Hi @DanM

  1. I’d be happy to work with you in detail on this in a support case (support@cognitect.com) we can then circle back and share our findings, especially if there is general advice for troubleshooting this that would help others (but I’d like to look at your logs and metrics from read-only AWS account if possible.)

  2. The first thing that jumped out to me when reading your description of events is:

only ever possibly making 1 request to each of 40 databases at a time, to 400 pods each running only a single service/partition pair for a max of 400 requests (1 per db)

and

OpsTimeout on the CloudWatch dashboard

What does your Ops (especially OpsPending) dashboard chart look like? Does it correspond with timeouts? My assumption here is that you may be inadvertently consuming all available threads and other requests are timing out while waiting. I understand you aren’t seeing the evidence of that in the Opstimeout metric, but perhaps there is other evidence in the log I can review with ReadOnly access.

Hi @jaret, that would be brilliant. Would it make most sense to email suppport@cognitect.com and include a link/reference to this thread in that case?

With respect to Ops[Timeout|Pending], when we get a spike during a (low) load test, OpsPending does also spike but not nearly as high. OpsPending spikes to ~9-10 (this seems reasonably consistent whenever we trigger our test), whereas when we see OpsTimeout it’s hitting 50-70 or so, but we don’t necessarily see OpsTimeout (and do see client side interrupted exceptions) every time we get an OpsPending spike, it seems OpsTimeout is a much more irregular thing.

This does match up with the high CPU load on the instances as well. During a test/when OpsPending spikes, we see CPU spike to 70-90% reliably, but we don’t always see OpsTimeout when that happens, and we often see interrupted exceptions without a spike on the graphs at all. Potentially that is due to a spike for such a small amount of time (only a few seconds) that within the minimum granularity of an AWS dashboard graph (1 minute) it doesn’t show as noticeable, but it’s enough to cause a few errors?

Interestingly from our more detailed breakdown of timings it seems like transact calls either fail outright (with interrupted) or stay reasonably low on the response times, whereas calls to q or db are the ones where we see response times skyrocket to multiple seconds as well as interrupted exceptions. Maybe that points to a need for query groups even though our read counts aren’t that much higher than our write?

Yeah e-mail support@cognitect.com with a general description and a link. I will pick up the case and send you our instructions for creating a read only account.