Does Datomic provide any performance search query benefits when using comparison operators

Does Datomic provide any performance search query benefits when using comparison operators such as <, >, <=, or >= on indexed attributes, for example, numbers, strings, or dates?
Or are indexes optimized only for equality lookups (=), and range queries are still evaluated by scanning through index segments?

1 Like

@nikolayandr Indexed attributes ( :db/index is true) will utilize and leverage AVET for Range predicates. Using AVET makes them much more performant. see: Query Reference | Datomic

1 Like

@jaret

I initially asked that question because I’ve encountered a somewhat unexpected performance issue related to indexed attributes.

I have a query that filters entities by a date range using an indexed attribute (:Model/createdAt). The query runs fine when there’s plenty of memory available, but it becomes extremely slow when memory is limited.

(time (d/query {:query '[:find
        (count ?e)
        :in
        $
        ?date1
        ?date2
        ?typeName
        :where
        [?e :Model/typeName ?typeName]
        [?e :Model/createdAt ?createdAt]
        [(>= ?createdAt ?date1)]
        [(<  ?createdAt ?date2)]]
:args [(d/db conn) #inst "2025-07-01T00:00:00.000-00:00"  #inst "2025-08-01T00:00:00.000-00:00" "ViaxCustomerOrder"] :query-stats true }))

Result:

“Elapsed time: 11431.092938 msecs”

`{:ret [[13933]],
  :query-stats
  {:query
   [:find
    (count ?e)
    :in
    $
    ?date1
    ?date2
    ?typeName
    :where
    [?e :Model/typeName ?typeName]
    [?e :Model/createdAt ?createdAt]
    [(>= ?createdAt ?date1)]
    [(< ?createdAt ?date2)]],
   :phases
   [{:sched
     (([(ground $__in__4) ?typeName]
       [(ground $__in__3) ?date2]
       [(ground $__in__2) ?date1]
       [?e :Model/typeName ?typeName]
       [?e :Model/createdAt ?createdAt]
       [(>= ?createdAt ?date1)]
       [(< ?createdAt ?date2)])),
     :clauses
     [{:clause [(ground $__in__4) ?typeName],
       :rows-in 0,
       :rows-out 1,
       :binds-in (),
       :binds-out [?typeName],
       :expansion 1}
      {:clause [(ground $__in__3) ?date2],
       :rows-in 1,
       :rows-out 1,
       :binds-in [?typeName],
       :binds-out [?date2 ?typeName]}
      {:clause [(ground $__in__2) ?date1],
       :rows-in 1,
       :rows-out 1,
       :binds-in [?date2 ?typeName],
       :binds-out [?date1 ?date2 ?typeName]}
      {:clause [?e :Model/typeName ?typeName],
       :rows-in 1,
       :rows-out 97278,
       :binds-in [?date1 ?date2 ?typeName],
       :binds-out [?date1 ?date2 ?e],
       :expansion 97277}
      {:clause [?e :Model/createdAt ?createdAt],
       :rows-in 97278,
       :rows-out 13933,
       :binds-in [?date1 ?date2 ?e],
       :binds-out [?date1 ?date2 ?createdAt ?e],
       :preds ([(>= ?createdAt ?date1)] [(< ?createdAt ?date2)])}]}]}}

When I run the same query using a non-indexed attribute (:BusinessInteraction/biCreatedAt), it executes dramatically faster — even with limited memory:

(time (d/query {:query '[:find
        (count ?e)
        :in
        $
        ?date1
        ?date2
        ?typeName
        :where
        [?e :Model/typeName ?typeName]
        [?e :BusinessInteraction/biCreatedAt ?createdAt]
        [(>= ?createdAt ?date1)]
        [(<  ?createdAt ?date2)]]
:args [(d/db conn) #inst "2025-07-01T00:00:00.000-00:00"  #inst "2025-08-01T00:00:00.000-00:00" "ViaxCustomerOrder"] :query-stats true }))

Result:

Elapsed time: 337 msecs {:ret [[13904]],
  :query-stats
  {:query
   [:find
    (count ?e)
    :in
    $
    ?date1
    ?date2
    ?typeName
    :where
    [?e :Model/typeName ?typeName]
    [?e :BusinessInteraction/biCreatedAt ?createdAt]
    [(>= ?createdAt ?date1)]
    [(< ?createdAt ?date2)]],
   :phases
   [{:sched
     (([(ground $__in__4) ?typeName]
       [(ground $__in__3) ?date2]
       [(ground $__in__2) ?date1]
       [?e :Model/typeName ?typeName]
       [?e :BusinessInteraction/biCreatedAt ?createdAt]
       [(>= ?createdAt ?date1)]
       [(< ?createdAt ?date2)])),
     :clauses
     [{:clause [(ground $__in__4) ?typeName],
       :rows-in 0,
       :rows-out 1,
       :binds-in (),
       :binds-out [?typeName],
       :expansion 1}
      {:clause [(ground $__in__3) ?date2],
       :rows-in 1,
       :rows-out 1,
       :binds-in [?typeName],
       :binds-out [?date2 ?typeName]}
      {:clause [(ground $__in__2) ?date1],
       :rows-in 1,
       :rows-out 1,
       :binds-in [?date2 ?typeName],
       :binds-out [?date1 ?date2 ?typeName]}
      {:clause [?e :Model/typeName ?typeName],
       :rows-in 1,
       :rows-out 97278,
       :binds-in [?date1 ?date2 ?typeName],
       :binds-out [?date1 ?date2 ?e],
       :expansion 97277}
      {:clause [?e :BusinessInteraction/biCreatedAt ?createdAt],
       :rows-in 97278,
       :rows-out 13904,
       :binds-in [?date1 ?date2 ?e],
       :binds-out [?date1 ?date2 ?createdAt ?e],
       :preds ([(>= ?createdAt ?date1)] [(< ?createdAt ?date2)])}]}]}}

Both queries return almost the same result set (about 13,900 entities), and when enough memory is available, they perform similarly. But when memory is tight, the indexed version becomes orders of magnitude slower.

I assume that this happens because Datomic tries to read large index segments into memory, and when the cache is full, it repeatedly evicts and reloads pages, leading to IO thrashing. The non-indexed field avoids this by not going through the index structure.

Does that explanation sound correct?
And is increasing the peer cache/memory really the only way to avoid this behavior, or is there a way for Datomic to fall back to a less index-intensive strategy when memory is constrained?

I assume that this happens because Datomic tries to read large index segments into memory, and when the cache is full, it repeatedly evicts and reloads pages, leading to IO thrashing. The non-indexed field avoids this by not going through the index structure.

Datomic datalog queries are always reading through an index: the question is only which one.

These queries are both doing roughly the same thing index-wise:

  1. Get a 97278 Model entity ids (?e)
    1. If this attribute is indexed, it uses something like (->> (d/datoms $ :avet :Model/typeName ?typeName) (map :e))
    2. If not indexed, it’s more like (->> (d/datoms $ :aevt :Model/typeName) (filter #(= ?typeName (:v %)) (map :e))
  2. You now have a fully-realized set of ?e values. For each ?e see if the next clause is satisfied. Both queries use AEVT again, something like (->> (d/datoms $ :aevt the-attribute ?e) (filter #(and (>= (:v %) ?date1) (< (:v %) ?date2))))

So the fact that :Model/createdAt has an AVET index and one does not does not matter because neither query uses AVET.

However, you could have other workloads on the same peer that ensure that AEVT of one attribute is hotter than the other.

The way you diagnose this is with query io-stats, which tells you which indexes you read from and how many times, and which storage or cache tiers you read from and how many times and for how long.

With io-stats in hand you should have a better idea of where the time is spent in each query in the storage layers.

Thanks for the explanation, that helped clarify how both queries traverse indexes.

To make sure the test environment was as clean as possible, I ran both queries locally and restarted the peer before each execution, so no caching effects should be involved.
Here are the io-stats results:

Query using indexed field :Model/createdAt:

{:ret [[13933]],
  :io-stats
  {:io-context :io/by-model-created-at,
   :api :query,
   :api-ms 13303.14,
   :reads
   {:dir 194558,
    :aevt 194811,
    :avet 21,
    :inflight-lookup-ms 46.72,
    :aevt-load 42550,
    :deserialize-ms 40568.38,
    :deserialize 42472,
    :dev-ms 7468.45,
    :avet-load 9,
    :dev 42472,
    :ocache 45233}}}

Query using non-indexed field :BusinessInteraction/biCreatedAt:

{:ret [[13904]],
  :io-stats
  {:io-context :io/by-interaction-created-at,
   :api :query,
   :api-ms 536.7,
   :reads
   {:dir 194558,
    :aevt 196281,
    :avet 21,
    :inflight-lookup-ms 16.65,
    :aevt-load 223,
    :deserialize-ms 205.94,
    :deserialize 209,
    :dev-ms 81.76,
    :avet-load 9,
    :dev 209,
    :ocache 1966}}}

The difference is pretty significant — about 13 seconds vs 0.5 seconds, and especially the :aevt-load values (42 550 vs 223).

:Model/typeName itself is an indexed attribute

After increasing the available peer memory, I repeated the test and got these results.

For the first (indexed) query, on a cold start (immediately after peer restart):

[{:ret [[13933]],
  :io-stats
  {:io-context :io/by-model-created-at,
   :api :query,
   :api-ms 2257.1,
   :reads {:aevt-load 5870, :deserialize 5860, :ocache 6124}}}]

Then, after running the same query again (without restarting the peer):

[{:ret [[13933]],
  :io-stats
  {:io-context :io/by-model-created-at,
   :api :query,
   :api-ms 239.55,
   :reads {:ocache 250}}}]

So even though the first (cold) run is still slower compared to the version using the non-indexed field, the second run executes much faster.

So, based on these results, does it mean that queries using indexed attributes are more dependent on how much memory is available to the peer, compared to equivalent queries that use non-indexed attributes?

In my tests, the query with an indexed field (:Model/createdAt) became much faster after increasing peer memory, while the non-indexed version stayed almost the same regardless of available RAM. That suggests the indexed one is more sensitive to cache size and memory pressure.

Is that interpretation correct?

So, based on these results, does it mean that queries using indexed attributes are more dependent on how much memory is available to the peer, compared to equivalent queries that use non-indexed attributes?

Still no. “indexed attribute” means in this context that it has an AVET index via :db/indexed? true on its schema (or implicitly via :db/unique). An “unindexed attribute” doesn’t have an AVET. Both still have AEVT and EAVT (and VAET if attribute is :db.type/ref)

The io-stats are showing that there is no difference in AVET use between these two queries. Both queries are reading very little from AVET to satisfy the first clause [?e :Model/typeName ?type]. The remaining clauses only read from AEVT.

The “indexedness” of the attribute cannot possibly be relevant if the query isn’t even attempting to read from AVET for that attribute.

What we’re seeing here is that many, many more distinct segments are needed to answer questions about :Model/createdAt than :BusinessInteraction/biCreatedAt.

Let’s compare the stats in table form. First let’s see what is the same.

:io-context :io/by-model-created-at :io/by-interaction-created-at
:avet 21 21 number of times reading an avet segment
:avet-load 9 9 number of avet segment object cache misses
:aevt 194,811 196,281 number of times reading an aevt segment (cache or not)
:dir 194,558 194,558 number of times reading a dir node segment (cache or not)

So far so good. AVET is used for :Model/typeName. 21 distinct reads are needed, 9 of them are ocache misses and loaded from dev storage. This is the only section of io-stats that making use of “indexed attribute” segments.

Both queries have the same number of rows of ?e (models with matching typename) so the following clauses should need approximately the same number of reads of AEVT (some nearly constant number per distinct ?e). And that’s what we see: approximately 195,000 :aevt reads, and exactly the same number of :dir reads.

But then we start to see the differences:

:io-context :io/by-model-created-at :io/by-interaction-created-at
:ocache 45,233 1,966 number of requests to object cache
:aevt-load 42,550 223 number of aevt segment object cache misses
:deserialize-ms 40,568.38 205.94 sum of time deserializing segments to load into object cache
:deserialize 42,472 209 number of segments deserialized
deserialize ms avg 0.96 0.99
:dev-ms 7,468.45 81.76 sum of time reading from dev storage
:inflight-lookup-ms 46.72 16.65 sum of time spent waiting for another inflight storage request to complete instead of issuing an identical one (is time that could have been spent in dev-ms, but was avoided)
:dev 42,472 209 number of reads from dev storage
dev ms avg 0.18 0.47

Looking up :Model/createdAt for 97,278 ?es something like 44,000 more reads than biCreatedAt (compare :ocache values). 42,550 AEVT segments had to be loaded from storage in the Model/createdAt case, only 223 in the biCreatedAt case. Note the average time per segment (dev ms average if 0.18 ms per segment) is actually much better for Model/createdAt, but there’s just an enormous quantity of them.

How can this be?

  • You must have many, many more :Model/createdAt datoms than biCreatedAt datoms!
  • The “locality” of those datoms doesn’t align with the entities in your ?e set, so it has to read many more distinct segments from all over the AEVT index to find the data it needs.
  • Much more time (40s) is spent deserializing items loaded from storage than loading them (7s). Deserializing is cpu and memory-allocation intensive; more memory or cpu cores (with higher readConcurrenty setting) might help do this work in parallel or avoid pauses due to jvm GC.

More memory will help keep your cache warm and avoid GC pressure related to deserialization, but if you have a completely cold peer with this much data it’s going to take time to load it.

It looks from your query-stats that your created-at date range tests significantly decrease the number of rows (97k to approximately 14k). Is it possible that for these date ranges it would be more selective to find all models in the range, then filter those by model type? You can confirm this with query-stats and seeing how many :rows-out emerge from your date range tests. Your query will then use AVET on the :Model/createdAt, still use AEVT on biCreatedAt, and use AEVT on :Model/typeName as a filter over whatever it found by date.

What settings are you using for each of these runs? You say “slow when memory is limited” and “increasing peer memory” but you don’t give any numbers for these comparisons. What is your jvm heap size, core count, and ocache settings for each of these results?

Your final two results look odd because all the io-stat numbers are wildly different from your previous runs (e.g. :ocache, should be roughly the same all the time), and are also missing storage load information even though you are clearly loading things. (E.g. :aevt-load should have :dev and :dev-ms :deserialize should have :deserialize-ms). Are you sure these are exactly the same queries as the “less peer memory” cases with the same database and input parameters? Are you omitting anything from the output?

2 Likes

I really appreciate your detailed explanation — it helps a lot to better understand how query execution actually works internally.
Yes, there are indeed hundreds of millions of entities in the database that have the attribute :Model/createdAt. Every entity that is created in our system has this attribute, while only a subset of them (the ones relevant for our search queries) also have the :BusinessInteraction/biCreatedAt attribute, usually differing by a few seconds or hours.

My machine has an Intel® Core™ i9-10900 × 10 (3.7 GHz, 10 cores).
The “slow” test was run with 1 GB heap size, and the “fast” one with 5 GB.

Is it possible that for these date ranges it would be more selective to find all models in the range, then filter those by model type?

That’s actually an interesting suggestion. I tried this approach, and it performs several times faster when memory is limited (1 GB), but several times slower when more memory is available (5 GB). Sometimes it even fails with an error when the memory is low:

Execution error (OutOfMemoryError) at java.util.HashMap/resize (HashMap.java:710).
Java heap space

Your final two results look odd because all the io-stat numbers are wildly different…

I can’t really explain how that happened. I didn’t modify any input or output data. I simply copied the output and pretty-printed it with clojure.pprint/pprint.
Here’s the output for the cold query run with more memory (5 GB):

[{:ret [[13933]],
  :io-stats
  {:io-context :io/by-model-created-at,
   :api :query,
   :api-ms 2439.77,
   :reads
   {:dir 194558,
    :aevt 194797,
    :avet 21,
    :inflight-lookup-ms 20.31,
    :aevt-load 5874,
    :deserialize-ms 5902.04,
    :deserialize 5860,
    :dev-ms 1610.48,
    :avet-load 9,
    :dev 5860,
    :ocache 6128}}}]

And the second run on the warm cache shows:

  :io-stats
  {:io-context :io/by-model-created-at,
   :api :query,
   :api-ms 210.28,
   :reads {:avet 19, :aevt 194789, :dir 194558, :ocache 250}}}

As you mentioned, all the detailed fields (like :aevt-load, :deserialize-ms, etc.) are present in the cold run but missing in the warm one — and :ocache values still differ between runs.

I think we’ll review our data model and move away from the approach where every entity in the database has a :Model/createdAt attribute.

Once again, thank you very much — your response was extremely helpful.

I can’t really explain how that happened. I didn’t modify any input or output data. I simply copied the output and pretty-printed it with clojure.pprint/pprint .

As you mentioned, all the detailed fields (like :aevt-load , :deserialize-ms , etc.) are present in the cold run but missing in the warm one — and :ocache values still differ between runs.

These two results you just posted make sense and are consistent with your other runs. Just so we’re on the same page, it’s this specific result that doesn’t make sense (missing :reads entries) or seem consistent (very different, much smaller :aevt-load numbers):

For the first (indexed) query, on a cold start (immediately after peer restart):

[{:ret [[13933]],
  :io-stats
  {:io-context :io/by-model-created-at,
   :api :query,
   :api-ms 2257.1,
   :reads {:aevt-load 5870, :deserialize 5860, :ocache 6124}}}]

I want to make sure you didn’t omit anything from :reads for brevity or change the query parameters.

I think we’ll review our data model and move away from the approach where every entity in the database has a :Model/createdAt attribute.

It really depends on what use cases these queries were meant to support. From the large quantity of data it seems like it may be supporting a batch job, but maybe this is an interactive activity feed? You may need a different, even composite index, or to stop using query and use index-pull or similar.

Perhaps another post with your specific problem from a high level would be a good place to understand and iterate. Focus on: what I’m trying to do (from user/requirements perspective), and what the obstacle is now (again from user/requirements perspective), and what your implementation looks like today and how it falls short.