What sort of q/transact response times should I be expecting?

At the moment our timing metrics are showing that when things are running smoothly a call to datomic.client.api/q is taking, at best, around 60ms. A call to datomic.client.api/transact is more like 40ms. Is that expected? Should we be getting response times much lower than that? The DynamoDB metrics we have are showing that the put latency to the DDB table itself is sub-5ms, and query latency is sub-2ms normally. So that’s a lot of overhead over and above DynamoDB itself, but obviously the timings from the client/request side would be higher than those reported internally by DDB.

I can totally see this being expected, when you factor in HTTP request overhead between our client and the Datomic transactor (or query group, although we’re not using those yet), processing on the transactor etc etc (Especially as our app is in a separate VPC, so we’re accessing Datomic via a VPC endpoint pointing at the NLB. I don’t know how much overhead that will be adding). But if those timings are expected then it means we probably need to rearchitect some of our app, so it would be useful to know if we should be expecting things to run a lot faster than that or not.

Hi Dan,

I was chatting with Jaret about the dialogue you two have been having. In general, response times can vary based on a number of factors. Could you provide some additional details?

  1. Specifics of the transactions and queries you are running? Number of datoms per transaction, schema, query and size of the database.

  2. How are you measuring response times and what is the motivation (i.e, specific metric you are required to meet or general understanding of system baseline)?

  3. Are the queries depending upon the transactions having being completed?

  4. Have you run similar tests all from inside the same VPC?

Jarrod

Hi Jarrod, thanks for the reply!

I’m going to answer these out-of-order because the first one is probably the biggest.

We’re measuring response times with a library which provides some stats macros which output to DataDog (GitHub - unbounce/clojure-dogstatsd-client: A thin veneer over the official java dogstatsd client), including a timing one. Effectively, we’re wrapping each call to Datomic in a (try... (finally ...)) which records the timing information and writes it as a DataDog metric even if the request times out.

Initially, we added this timing to identify bottlenecks in our code as some operations were taking much longer than we were happy with. Those have now been identified and either fixed or we have a plan to fix them. This question was more around a general understanding of system baseline. If you’ve been chatting to Jaret too you probably know we’re having some issues with timeouts and overloading Datomic there, and similarly to that I’m trying to understand whether what seems like a relatively low request rate is actually asking a lot more than we thought of the system. I’ve not used Datomic (on-prem or Cloud) before, but others in the team have used on-prem (we’re all new to Cloud I believe), and based on their on-prem experience they were expecting things to be noticeably faster that we are seeing.

I’m not sure what you mean in 3. We’re consuming messages from Kafka, with each message (that isn’t filtered as irrelevant) normally being a query, followed by a transact of the result of combining the data in the message and the data from the query (e.g. we recieve intended updates for a user and want to retrieve the user, make sure the updates make sense in the context of the rest of the user, then write the updated user back if they do). In a number of our use cases later messages will be dependent/build on the update that was processed in earlier ones, so we need to know the transact for n has completed successfully before we do the q for n+1.

We have not run tests from inside the VPC yet. I have a number of things I think it would be useful to compare, including (specifically related to Datomic):

  • running the app from within the same VPC as Datomic vs a separate one
  • comparing throughput and response times of Datomic on-prem vs Datomic Cloud
  • separating out queries to run through a dedicated query group instead of the transactors

I haven’t done any of these yet. I was hoping some of the answers on here would help nudge me as to which to try first.

As to q/transact specifics, there would be quite a lot of them. A single call to our API might trigger 3-5 query+transacts to different Datomic DBs (on the same cluster) to build the needed answer. The load test I’m running considers a single ‘user’ to be a series of 10 API calls, each using the response from the previous call to determine the endpoint to call next. And then it generates n users per second to do that. The queries are generally on the lines of

Lookup user by username:

[:find (pull ?user [::user-id])
       :in $ ?username
       :where
       [?user ::username ?username]]

List all API keys for user (the users in the test only have 1 each):

[:find (pull ?api-key [::api-key-id ::name ::live?
                       {::user [::users/user-id]}
                       {::organization [::orgs/organization-id]}])
       :in $ ?user-id
       :where [?user ::users/user-id ?user-id]
              [?api-key ::user ?user]]

List all orgs a user is a member of (each test user is only in 1 org):

[:find (pull ?organization
             [::o/organization-id
              ::o/organization-name])
       :in $ ?user-id
       :where
       [?organization ::o/organization-members ?member]
       [?member ::o/member-status ::o/current-member]
       [?member ::o/member-user ?user]
       [?user ::u/user-id ?user-id]]

We actually have ~400 DBs on the cluster at the moment, each of which has one single threaded process doing the query…transact…next msg… loop against it for one specific type of call. We have the full schema in every DB for ease (which I can link in a service ticket if necessary), even though each individual DB will only actually ever store data in a subset of the datoms. The Datomic Cloud dashboard says that the cluster as a whole currently has ~2.6m datoms, but not how they’re divided between the DBs (I’ll have to look up the calls to get that info).

Given that my current testing is with GET requests to the API, the transact calls we’re doing will be very simple. It’s a :db/cas of a random UUID which is belt-and-braces to make sure there aren’t 2 processes both handling messages for the same DB (as soon as one starts up and does an initial blind set of that value, it locks the other out). On top of that it’s some basic Kafka offset data and additional tracing, which totals 6 simple string/int/keyword datoms and one which is a set of a small number (<10) of UUIDs.

Dan,

I know there is an ongoing support ticket with more details, but I wanted to reply more generally about something here.

I think we can improve the query. In the first where clause it appears as though we might expect a large number of intermediate results. It is possible this could unnecessarily increase the query time. The reason it appears to be unnecessary in this case is that at the time of the query we already know the user-id. It is always a good idea to start with the most specific clause first.

For a query such as this I would think it could be as simple as something like:

(d/pull db '[:o/organization-id 
             :o/organization-name] 
        [:u/user-id #uuid "9fd8de38-15af-4187-941b-a039bcbab1cb"])