Query populous business-entity where there are changes at any level

Hello Datomic Forum! I hope my first post is up to snuff.

I’m trying to work out the most efficient way to query against a large
set of entities for a given time range. Specifically, we need to return
only the members of this group that have changed since a certain point
in time.

Since we’re dealing with a business-entity that has multiple
component-ref levels, it isn’t sufficient to only consider the root when
looking for changes. We use a recursive rule to find all the component
refs given the root:

(def component-rules
  "All component and transitive component entities."
  '[[(component ?e ?a ?child)
     [?e ?a ?child]
     [?a :db/isComponent true]]
    [(component ?e ?b ?grandchild)
     [?e ?a ?child]
     [?a :db/isComponent true]
     (component ?child ?b ?grandchild)]])

(defn component-tree
  "Given an eid, return all eids including it that can be directly or
  transitively traversed through component attributes."
  [db eid]
  (let [eid (d/entid db eid) ;; allow for passing a lookup ref
        descendents (d/q '[:find [?descendant ...]
                           :in $ % ?e
                           :where
                           (component ?e ?a ?descendant)]
                         db component-rules eid)]
    (into #{eid} descendents)))

We can then ask if an entity has changed since a given t:

(defn updated-since-t?
  "Return true if the given business entity -- i.e. the reference tree
   implied by eid -- has had any alterations since t."
  [db t eid]
  (let [db-since (d/since db t)
        eids (component-tree db eid)]
    (boolean
      (d/q '[:find ?e .
             :in $ [?e ...]
             :where [?e]]
           db-since eids))))

Here I need to get a hold of all the roots. A wasteful approach
would be to filter

(d/datoms db :aevt :membership-root-attr)

by updated-since-t?. But that’s prohibitive given number of members
and requirements around time. Note also that we can’t pass db-since
to d/datoms here because we would only iterate business entities
that had been created since-t – it’s a characteristic of our root
membership-attr that it’s only touched on creation.

At this point, I started to wonder if time is a more selective
criterion and if we could start with a tx range. I’ll elide the code,
but we can efficiently compute the maximal attribute set for a given
business-entity, and then pass those attributes to a query against
a tx range. This is pretty fast for a day range and 125 possible
attributes at all levels of the business entity.

(defn changes-for-span-by-attrs
  [db log attributes t-start t-end]
  (d/q '[:find [?e ...]
         :in $ ?log [?a ...] ?t1 ?t2
         :where
         [(tx-ids ?log ?t1 ?t2) [?tx ...]]
         [(tx-data ?log ?tx) [[?e ?a]]]]
       db log attributes t-start t-end))

Now we need filter to the common roots, which is where I can’t make
this approach fast enough.

(def get-root-rules
  '[[(component-ancestors ?e ?a ?parent)
     [?parent ?a ?e]
     [?a :db/isComponent true]]
    [(component-ancestors ?e ?a2 ?ancestor)
     [?parent ?a ?e]
     [?a :db/isComponent true]
     (component-ancestors ?parent ?a2 ?ancestor)]
    [(get-root ?e ?root ?membership-attr)
     [?e ?membership-attr]
     [(identity ?e) ?root]]
    [(get-root ?e ?ancestor ?membership-attr)
     (component-ancestors ?e _ ?ancestor)
     [?ancestor ?membership-attr]]])

(defn changed-entities-for-range-by-attrs
  ;; Takes too much time; i haven't seen it finish for a day range
  [db log membership-attr attributes t-start t-end]
  (d/q '[:find ?root
         :in $ ?log % ?membership-attr [?a ...] ?t1 ?t2
         :where
         [(tx-ids ?log ?t1 ?t2) [?tx ...]]
         [(tx-data ?log ?tx) [[?e ?a]]]
         ;; Then get root which could be the given ?e or an ancestor:
         (get-root ?e ?root ?membership-attr)]
       db log get-root-rules membership-attr attributes t-start t-end))

Happy to have any suggestions!

Hi @uwo,

I have a lot of questions, so I am going to break it up into bullet points so we can respond in-line.

  • Would it be fair to characterize this problem as a “Return the tree root for all trees where a node changed within some time interval” problem?

  • Assuming the answer is “yes” to the above question, will you need more tree/graph related queries in the future?

  • What is the distribution of tree depths? (a histogram would be great – a spreadsheet would work for this as data and percentiles)

  • What is the distribution of child node cardinality from tree root? (again histogram)

  • Of your entire transaction load, what percentage per month affects one or more of these trees?

  • Will these ALWAYS be trees or will this eventually become a graph?

  • Will the child nodes ever belong to multiple trees?

  • What is your growth rate in Datoms per month?

  • Are the attributes ever cardinality many ref attributes? Will they be in the future?

  • Are all descendent nodes component attributes?

  • Are the attributes ever cardinality many ref attributes? Will they be in the future?

  • Is the set of attributes growing?

  • Realistically, how fast is “Fast enough” for your p50, p75, p90, p95, p99, p99.9 and pMAX? (perhaps a table would help)

  • Can you provide a small but exemplary schema with 3+ attributes?

  • Where are these query results going? An ETL job, directly to a mobile app/ webpage, or another system?

Q: Would it be fair to characterize this problem as a “Return the tree root for all trees where a node changed within some time interval” problem?
A: Yes.

Q: Assuming the answer is “yes” to the above question, will you need more tree/graph related queries in the future?
A: For other business entities, likely yes.

Q: What is the distribution of tree depths? (a histogram would be great – a spreadsheet would work for this as data and percentiles)
A: I hope a frequency map suffices as a histogram. For bills, I’ve
included the histogram with and without our legacy data. As you can see,
all legacy bills have a depth of 1. We don’t want to include legacy
data for the query at hand. Early on we just iterated over an :avet :from-legacy? false.

Though legacy bills currently dominate the number of bills in our
system , they are also the smallest trees as you’ll see in the answer to
the next question.

;; We can pretty much ignore trees of depth 4 and higher -- I'm pretty
;; sure those are anomalous.
;; Note there are no flat bills, i.e. trees of depth 0.

;; With legacy bills
{1 7296512
 2 83893
 3 79159
 4 11
 5 8
 6 16
 7 1}

 ;; Without legacy bills
{1 24
 2 83893
 3 79159
 4 11
 5 8
 6 16
 7 1}

Q: What is the distribution of child node cardinality from tree root? (again histogram)
A: I’m answering this as, “What is the distribution of tree sizes?”
Again, I’ll split this into with and without legacy data. The legacy
trees are much smaller

;; Stats excluding legacy
{:max 128
 :mean 14.283841777429005
 :median 14
 :min 5
 :percentiles {25 12
               50 14
               75 16
               90 19
               95 23
               99 28
               99.9 44}
 :sample-count 163112
 :stdev 4.349581307924771
 :sum 2329866.0
 :variance 18.918857554248557}

;; Stats including legacy
{:max 128
 :mean 4.2020218242265
 :median 4
 :min 3
 :percentiles {25 4
               50 4
               75 4
               90 4
               95 4
               99 14
               99.9 23}
 :sample-count 7459600
 :stdev 1.645663974917032
 :sum 3.1345402E7
 :variance 2.7082099183397257}

;; And here are the raw frequencies:

;; Excluding legacy
{5 391
 6 122
 7 6306
 8 819
 9 4455
 10 2038
 11 11681
 12 37563
 13 10530
 14 37922
 15 7692
 16 11396
 17 3869
 18 4353
 19 9170
 20 3982
 21 1343
 22 1141
 23 982
 24 2574
 25 2441
 26 378
 27 322
 28 295
 29 343
 30 342
 31 105
 32 102
 33 65
 34 62
 35 54
 36 35
 37 13
 38 15
 39 29
 40 2
 41 1
 42 13
 43 2
 44 1
 45 2
 46 20
 47 1
 48 1
 50 6
 51 1
 52 1
 54 25
 55 1
 57 1
 58 16
 60 1
 61 1
 62 13
 63 1
 65 1
 66 14
 67 2
 70 11
 71 3
 74 7
 77 2
 78 3
 82 2
 85 1
 86 3
 90 4
 91 1
 94 5
 95 1
 98 4
 102 4
 103 1
 106 1
 110 1
 128 1}

;; Including
{3 170416
 4 7126072
 5 391
 6 122
 7 6306
 8 819
 9 4455
 10 2038
 11 11681
 12 37563
 13 10530
 14 37922
 15 7692
 16 11396
 17 3869
 18 4353
 19 9170
 20 3982
 21 1343
 22 1141
 23 982
 24 2574
 25 2441
 26 378
 27 322
 28 295
 29 343
 30 342
 31 105
 32 102
 33 65
 34 62
 35 54
 36 35
 37 13
 38 15
 39 29
 40 2
 41 1
 42 13
 43 2
 44 1
 45 2
 46 20
 47 1
 48 1
 50 6
 51 1
 52 1
 54 25
 55 1
 57 1
 58 16
 60 1
 61 1
 62 13
 63 1
 65 1
 66 14
 67 2
 70 11
 71 3
 74 7
 77 2
 78 3
 82 2
 85 1
 86 3
 90 4
 91 1
 94 5
 95 1
 98 4
 102 4
 103 1
 106 1
 110 1
 128 1}

Q: Of your entire transaction load, what percentage per month affects one or more of these trees?
A: For January of this year,

{:percentage ~0.439
 :total-txes 1044746
 :bill-txes 458716}

Q: Will these ALWAYS be trees or will this eventually become a graph?
A: Always trees. Some of these business entities may be part of a graph
if we include non-component references, but we’re only interested in the
component trees.

Q: Will the child nodes ever belong to multiple trees?
A: No. While researching this, I did find some child component-refs that
are being improperly shared across business entities. This is a bug that
we intend to fix.

Q: What is your growth rate in Datoms per month?
A: Still working on this – I’m running into trouble getting my initial
counts using d/datoms and d/as-of. I’ll get back to you once I’ve
figured it out.

Q: Are the attributes ever cardinality many ref attributes? Will they be in the future?
A: Yes.

Q: Are all descendent nodes component attributes?
A: Yes.

Q: Is the set of attributes growing?
A: Yes, with new feature work and dynamically in a some cases.

Q: Realistically, how fast is “Fast enough” for your p50, p75, p90, p95, p99, p99.9 and pMAX? (perhaps a table would help)
A: Please forgive my original language here. We have no SLA for this,
and I sort of buried the lede in this comment, “Takes too much time;
I haven’t seen it finish for a day range.” So, I suppose my pMAX is
finishing at all for a day-range.
Funny enough, while calculating the percentage of txes that touch
bills, I figured out that I needed to divide the query for a day’s
transactions into many smaller t-ranges, and that did the trick. It was
memory that I was butting up against when I attempted the percentage
calculation for a day without partitioning into ranges, and I suspect the
same is true of the original post.

Q: Can you provide a small but exemplary schema with 3+ attributes?
A: (Shared out of band.)

Q: Where are these query results going? An ETL job, directly to a mobile app/ webpage, or another system?
A: ETL. We’re exporting data to our analytics team. They’ve been
pulling data so far, but we’re likely to move to a push-approach with the
tx-report-queue plus some message queue. It’ll still be good to have
this query, however.

Awesome! So, I just want to confirm this resolves the whole issue for you. Or am I misunderstanding.

We’re always interested in understanding if customers have seen Datomic Analytics or if you’ve considered using analytics for this task as then you won’t have to ETL your data. I’d be happy to discuss details in a private case if that is not relevant to the current discussion or business data.

Q: We’re always interested in understanding if customers have seen
Datomic Analytics or if you’ve considered using analytics for this
task…
A: We definitely took a close look at the Presto connector. At the
moment, Datomic Analytics doesn’t seem like a good fit for us because of
computed properties. We can’t expect the other silo in our organization
to replicate the logic for our computed properties, especially some of
the computed properties that leverage Datomic history.

Q: So, I just want to confirm this resolves the whole issue for you. Or
am I misunderstanding.
A: I spoke too soon. I shouldn’t have speculated idly without first
testing. The issue was never computing all datoms that touch bills
changes-for-span-by-attrs in the original post – that function
returns almost immediately for a day range. The slow part of my code is
computing ancestors, which is needed to compute only the roots. Let me
illustrate (with some improved code):

(defn touched-eids-for-t-range-by-attrs
  ;; Named `changes-for-span-by-attrs` in the original post. This is
  ;; unchanged and was always fast, even for hundreds of attributes.
  "Given all possible business-entity attributes and a t-range, return
  the eids that were touched during that time."
  [db log attributes t-start t-end]
  (d/q '[:find [?e ...]
         :in $ ?log [?a ...] ?t1 ?t2
         :where
         [(tx-ids ?log ?t1 ?t2) [?tx ...]]
         [(tx-data ?log ?tx) [[?e ?a]]]]
       db log attributes t-start t-end))

(def component-ancestors-rules
  ;; This is simplified from the original
  '[[(component-ancestors [?e] ?parent)
     [?parent ?a ?e]
     [?a :db/isComponent true]]
    [(component-ancestors [?e] ?ancestor)
     (component-ancestors ?e ?ancestor)]])

(defn component-ancestors
  "Given an eid, return all ancestor eids that can be directly or
  transitively traversed thru component attributes."
  [db eid]
  (d/q '[:find [?ancestors ...]
         :in $ % ?e
         :where
         (component-ancestors ?e ?ancestors)]
       db component-ancestors-rules eid))

(defn get-root2
  ;; This replaces my original get-root, which used a recursive rule.
  "Returns the root eid or nil if the given eid isn't a member."
  [db membership-attr eid]
  (let [is-root? #(get (d/entity db %) membership-attr)]
  (if (is-root? eid)
    eid
    (let [as (component-ancestors db eid)]
      (first (filter is-root? as))))))

(defn get-touched-roots-for-range
  [db membership-attr attrs t-start t-end]
  (let [;; Needed to bind to attrs returned by tx-data:
        attr-ids (mapv #(d/entid db %) attrs)]
    (let [src (touched-eids-for-t-range-by-attrs db (d/log conn) attr-ids t-start t-end)
          xf (keep #(get-root2 db membership-attr %))]
      (into #{} xf src))))

Joe Lane mentioned to me that my recursive rules could be the reason
for the slowness, so I got rid of the get-root rule and slightly
improved component-ancestors. These changes got me from 85 seconds to
21 seconds for a day-range.

Outside of the two approaches I outlined in the original post, I’m curious
if there are others I should consider. We could always write a
:last-updated attribute on the root for every transaction that touches
a bill, but that doesn’t feely very Datomic-y.

Judging by what our legacy system sees on an average day, we can
expect about 10 times the number of bills coming through, and that
would be without any new customers. Are there better ways to answer
the question, “Return the tree root for all trees where a node changed
within some time interval”, or are the approaches outlined above what it
comes down to.

PS: Also, please forgive the confusing timeline I’ve presented of
“not returning at all” versus “taking too long”. I am now able to
get a result, and it’d be nice to know if there are any other obvious
improvements or alternative approaches.

Welp, I found it! Replacing my rules-based component-ancestors above with the one below got me from 21 seconds to 3 seconds for the same t-range. Those are cold-start times.

(defn component-ancestors2
  [db eid]
  (lazy-seq
    (when-let [;; is eid the value of any component attribute?
               component-parent (->> (d/datoms db :vaet eid)
                                     (filter #(:is-component (d/attribute db (:a %))))
                                     ;; Assumes no component references are shared:
                                     first
                                     :e)]
      (cons component-parent (component-ancestors2 db component-parent)))))

I’m definitely still curious to know about alternative approaches, and perhaps if there’s a lesson here about recursive rules. But, I know y’all don’t have infinite time, so I understand if you’d like to stick a fork in this one. :slight_smile:

An alternative approach:
Instead of write a “last changed” attribute into the database, you could keep that data in a peer-local index. This would need some time to initialize at start-up (with a potentially expensive query or re-calculation using historical transaction logs), but with that datastructure in place you could possibly have a way to make a very quick way of aggregating relations in components and detect changes from the transaction log.

I don’t have any datastructure in mind for you particular problem, but possibly something like:
entity-id → anscestor-id
and
ancestor-id → last change-tx

would do the trick quite easily. Some care has to be taken when new entities are created, but in my experience that is not very complicated to make some check if there is a new entity showing up.

Oh, btw, since entity ID:s are longs, consider using https://github.com/clojure/data.int-map

Hey, thanks for the suggestion! I’m pretty pleased with the current performance after replacing the recursive rules in the computation, but you raise a great idea.

If I understand, I believe we can accomplish something similar by merely memoizing get-root2 above – though it would not be as efficient in-memory, nor as direct, as what you outline using something like the data.int-map library.

P.S: Also, just to clarify, we do not write a :last-changed attribute, and prefer to compute that kind of data based on the component tree.

P.S-2: I would need to test this more to see if there were an appreciable speed up – but from a brief test, it appears that the peer-cache may already be benefitting get-root2 and get-component2 so much that memoizing provides little to no benefit. Again, this was a cursory test on my part.