I’m trying to debug some apparent performance issues we’re getting with Datomic Cloud, manifesting as a large number of Datomic Client Timeout
errors with category
interrupted
(normally) or unavailable
(occasionally) whenever we try to ramp up requests to our app beyond absolutely minimal levels. This seems to remain true whether we set a reasonable timeout on requests (500ms, with a few retries), or an absurdly high one like 10 minutes. Given Datomic Cloud itself is a bit of a black box, I was hoping if I described our setup and issues folk might have some suggestions as to where to look for the problem/how we might need to restructure things.
The app is fed by Kafka, and we have a unique database in Datomic per service per Kafka topic partition. Currently there’s ~40 separate services reading from Kafka (though that is likely to increase), 10 partitions (and thus databases) per service (again likely to increase), and each service only ever interacts at a given time with the specific database related to the message it’s just received. The app is deployed as a series of Kubernetes pods, each of which runs a thread per service it has been configured to handle. We can scale all the way from one pod running all 400 service/partition pairs and only ever possibly making 1 request to each of 40 databases at a time, to 400 pods each running only a single service/partition pair for a max of 400 requests (1 per db) at a time. In reality we’re somewhere in the middle at the moment, with anywhere from 5-25 pods each running a subset of services and each service running on 1-5 pods, depending on how we’re currently testing the scaling.
We deployed Datomic Cloud (the production
version) from the standard CloudFormation template, so have currently got 2 i3.large
instances handing all our requests. Our q
count is only about 2x our transact
count, with current testing being at ~150 queries and ~75 transactions per second during a ‘load’ test. We’ve not yet deployed any query groups, although that is something we’re considering, especially as the CPU use of the existing instances spikes to 80-90% during a test. Nor have we increased the size of the standard instance group from 2 to the 3 it has as max, because the recommendation in docs is we talk to Datomic before doing that. We have changed the DynamoDB scaling policy to ‘on-demand’, so that we’re not being hit by any restrictions to DynamoDB read/write units. We do occasionally see some OpsTimeout
on the CloudWatch dashboard, but not nearly as frequently as the client seems to report interrupted
.
One other thing that we are doing is going cross-VPC to communicate with Datomic. We already had all our K8s, Kafka etc set up in one VPC, and the Datomic Cloud templates don’t allow for adding to an existing VPC, nor did we want to tie ourselves to having to recreate everything else if we needed to rebuild Datomic. So we’ve got a VPC for Datomic and a VPC for everything else, with inter-VPC comms to go between them. I had considered modifying the templates to work with the existing VPC, but don’t know what that would do for support from Cognitect.
If anyone has any pointers for things I should be trying/looking at to reduce our failures, that would be great, or even what sort of performance we should expect from Datomic Cloud in terms of handled requests/sec.