Hi,
I need to apply additional logic (like mapping, conditionals, aggregating) to entities I get from Datomic. I had hard time translating it to Datomic query (I’m not sure if it’s even possible in my case), which is why I used datomic’s raw index access instead, so the most work and logic is done in Clojure.
It worked fine until I got to ~500K entries and the whole approach is getting very slow. The relevant code:
(defn entity->entry
"Map entity into entry"
[e]
{:id (:entry/uuid e)
;; each flat field increases mapping time (seems linearly)
:date (:entry/date e)
:summ (:entry/summ e)
;; although when using nested fields, mapping time rises significantly
:groups (map #(-> % :dimension/group :group/name)
(:entry/dimensions e))})
;; query code:
(->> (d/datoms db :aevt :entry/uuid)
(map #(->> %
:e
(d/entity db)
entity->entry))))
;; TODO: other actions on mapped entries ...
It takes about 30 seconds to run query code just to map entities and the more fields I need in my query, the more it takes.
Following optimisations were made:
Using pmap instead of map which results in ~2x speed.
Using :eavt index for accessing entity attributes instead of d/entity. It seems that the bottleneck comes from accessing individual fields from d/entity, so I’ve come up with idea of accessing all fields via (d/datoms db :eavt) and then mapping them in clojure with cached idents.
Optimization code:
(def idents
"Cached map of all idents for attributes"
(into {}
(d/q '[:find ?e ?ident
:where [?e :db/ident ?ident]]
(d/db conn))))
(defn eid->entry
[db eid]
(->> (d/datoms db :eavt eid) ; access all data by eid once
(seq)
(reduce (fn [m dtm]
(let [attr-key (get idents (:a dtm))
v (:v dtm)]
(assoc m attr-key v))))))
;; new query code
(->> (d/datoms db :aevt :entry/uuid)
(pmap #(->> %
:e
(eid->entry db))))
Now it runs in about ~7 seconds, which is good but I worry about future performance with more data to process.
Is my whole approch is wrong and there are easier ways to achieve better performance? Am I missing something?
My settings and parameters:
Datomic version: datomic-pro-0.9.5661
Datomic capacity settings in properties file: memory-index-threshold=32m memory-index-max=512m object-cache-max=1g
clj -Sdeps '{:deps {:github-terjesb {:git/url "https://gist.github.com/terjesb/c79d8a84e3eaa1bb7861fea17ee5a744" :sha "17005d2d36edc1783cfb0cb9eb3016cfc05cf798"}}}' -m user
I’m not sure if pull can elide a nested attribute in the result while still keeping its descendants, so I’m doing the grouping into :groups as a separate step.
I would like to go back to the beginning and understand why this was hard to write as a query. From the functional translation you posted, it looks like a lookup plus a few joins.
What is the schema, and what are the entity counts on both sides of the join? (If the 500K number is just the “entry” side, that number could be obscuring a ton of work on the “dimensions” side).
This query’s performance is likely to be dominated by the peer, not the transactor or storage. Can you break out your settings to cover the three different machines (transactor, peer, storage) in play here?
What are the requirements? How often does this query run? Is the peer cold or hot at the time? What else will be running on the same box at the same time? Does your performance test match the anticipated production situation?
The “hard to write a query part” wasn’t about “get all entries from database”, but about the whole task the query function tries to accomplish: it’s a process of mapping/transformings/conditionals/aggregating which was hard to translate to Datalog. So I resorted to traverse and process manually each entry, mapping them before passing to further logic (of transforming/conditionals/aggregating etc.), see ;; TODO: other actions on mapped entries ... in my code excerpt.
Why bother writing pull query with only [?e :entry/uuid] clause if raw :aevt gets the same result, only faster?
Current entities count — entries: ~500K, dimensions: ~7500, dimension groups: 10.
Schema (irrelevant parts and attributes are omitted):
Actually, all three parts are running on my single dev machine (PC: 8GB RAM, Intel Core I7-3770 CPU @ 3.40 GHz). In production all parts are supposed to be run on a single machine too, with a slightly better hardware specifications.
Not often, really, the query runs when user requires aggregated data about entries (which happens on initial load of UI and on applying filters, changing dates etc). I’m not sure what you mean by cold or hot peer?
More generally, “why use query instead of walking indexes myself?” Query has optimizations that your index walk might not, e.g. hash joins. And walking indexes cuts you off from future performance enhancements to query. So only do it when you surely need it.
Some query optimizations take advantage of fully realizing the result set in memory, so you might see better query vs. index performance with more RAM on the peer.
By using pmap for this query, and running all the processes on one box, you are expressing intent to let this query grind down all other activity while it runs. Is it really that urgent? If your SLA covers other jobs than this one query, you may hurt yourself overall.
Peers are part of your database, and they cache data as they go. If your test is from a cold start, then data will not be in the cache yet.
All that said, you can always beat a general thing with a custom specific thing – if you fully understand how all the pieces work! If you need it, do it.