Hi Jarrod, thanks for the reply!
I’m going to answer these out-of-order because the first one is probably the biggest.
We’re measuring response times with a library which provides some stats macros which output to DataDog (GitHub - unbounce/clojure-dogstatsd-client: A thin veneer over the official java dogstatsd client), including a timing one. Effectively, we’re wrapping each call to Datomic in a
(try... (finally ...)) which records the timing information and writes it as a DataDog metric even if the request times out.
Initially, we added this timing to identify bottlenecks in our code as some operations were taking much longer than we were happy with. Those have now been identified and either fixed or we have a plan to fix them. This question was more around a general understanding of system baseline. If you’ve been chatting to Jaret too you probably know we’re having some issues with timeouts and overloading Datomic there, and similarly to that I’m trying to understand whether what seems like a relatively low request rate is actually asking a lot more than we thought of the system. I’ve not used Datomic (on-prem or Cloud) before, but others in the team have used on-prem (we’re all new to Cloud I believe), and based on their on-prem experience they were expecting things to be noticeably faster that we are seeing.
I’m not sure what you mean in 3. We’re consuming messages from Kafka, with each message (that isn’t filtered as irrelevant) normally being a query, followed by a transact of the result of combining the data in the message and the data from the query (e.g. we recieve intended updates for a user and want to retrieve the user, make sure the updates make sense in the context of the rest of the user, then write the updated user back if they do). In a number of our use cases later messages will be dependent/build on the update that was processed in earlier ones, so we need to know the
transact for n has completed successfully before we do the
q for n+1.
We have not run tests from inside the VPC yet. I have a number of things I think it would be useful to compare, including (specifically related to Datomic):
- running the app from within the same VPC as Datomic vs a separate one
- comparing throughput and response times of Datomic on-prem vs Datomic Cloud
- separating out queries to run through a dedicated query group instead of the transactors
I haven’t done any of these yet. I was hoping some of the answers on here would help nudge me as to which to try first.
transact specifics, there would be quite a lot of them. A single call to our API might trigger 3-5 query+transacts to different Datomic DBs (on the same cluster) to build the needed answer. The load test I’m running considers a single ‘user’ to be a series of 10 API calls, each using the response from the previous call to determine the endpoint to call next. And then it generates n users per second to do that. The queries are generally on the lines of
Lookup user by username:
[:find (pull ?user [::user-id])
:in $ ?username
[?user ::username ?username]]
List all API keys for user (the users in the test only have 1 each):
[:find (pull ?api-key [::api-key-id ::name ::live?
:in $ ?user-id
:where [?user ::users/user-id ?user-id]
[?api-key ::user ?user]]
List all orgs a user is a member of (each test user is only in 1 org):
[:find (pull ?organization
:in $ ?user-id
[?organization ::o/organization-members ?member]
[?member ::o/member-status ::o/current-member]
[?member ::o/member-user ?user]
[?user ::u/user-id ?user-id]]
We actually have ~400 DBs on the cluster at the moment, each of which has one single threaded process doing the query…transact…next msg… loop against it for one specific type of call. We have the full schema in every DB for ease (which I can link in a service ticket if necessary), even though each individual DB will only actually ever store data in a subset of the datoms. The Datomic Cloud dashboard says that the cluster as a whole currently has ~2.6m datoms, but not how they’re divided between the DBs (I’ll have to look up the calls to get that info).
Given that my current testing is with GET requests to the API, the
transact calls we’re doing will be very simple. It’s a
:db/cas of a random UUID which is belt-and-braces to make sure there aren’t 2 processes both handling messages for the same DB (as soon as one starts up and does an initial blind set of that value, it locks the other out). On top of that it’s some basic Kafka offset data and additional tracing, which totals 6 simple string/int/keyword datoms and one which is a set of a small number (<10) of UUIDs.