Bad query performance or OOM for queries executing joining between two variables

nikolayandr · November 24, 2020, 6:22pm

I have the schema which can be described by the next relations
Order 1->* User 1->1 Address, in EDN it is is described by the next attributes:

{
	:db/ident :User/address
	:db/valueType :db.type/ref
	:db/isComponent true
	:db/cardinality :db.cardinality/one
}

{
	:db/ident :Address/zipCode
	:db/valueType :db.type/string
	:db/index true
	:db/cardinality :db.cardinality/one
}

{
	:db/ident :Order/users
	:db/valueType :db.type/ref
	:db/cardinality :db.cardinality/many
}

And the next original datalog query(in human text as get orders where users.address != 17592186093790 AND users.address.zipCode != “12345”):

[:find
 [?Order ...]
 :in $
 :where

 ; The next two conditions make ?User_1 containing 2000 elements 
 [?User_1 :User/address ?User_address_compare_value_2]
 [(!= ?User_address_compare_value_2 17592186093790)]

 ; The next three conditions make ?Address_4 containing 2000 elements 
 [?Address_4 :Address/zipCode ?Address_zipCode_compare_value_5]
 [(!= ?Address_zipCode_compare_value_5 "12345")]
 ; I believe issue is in the next  condition
 [?User_1 :User/address ?Address_4]

 [?Order :Order/users ?User_1]
]

This query is fast executed for small bunch of data but when for example there are 4000 orders where each one has link to 5 of 2000 users and each user has own address(so there are 2000 addresses) then out of memory happens. I believe the issue happens during executing the next condition [?User_1 :User/address ?Address_4] which does joining between ?User_1 and ?Address_4. The current workaround is to separate the above query into two different queries and intersect results of both queries:

(def orders1 (d/q '[:find
 [?Order ...]
 :in $
 :where

 [?User_1 :User/address ?User_address_compare_value_2]
 [(!= ?User_address_compare_value_2 17592186093790)]

 [?Order :Order/users ?User_1]
]
(d/db conn)))

(def orders2 (d/q '[:find
 [?Order ...]
 :in $
 :where

 [?Address_4 :Address/zipCode ?Address_zipCode_compare_value_5]
 [(!= ?Address_zipCode_compare_value_5 "12345")]
 [?User_1 :User/address ?Address_4]

 [?Order :Order/users ?User_1]
]
(d/db conn))
)

(clojure.set/intersection (set orders1) (set orders2))

In such way the result is gotten in the twinkling of an eye.
The question: Am I doing something in wrong way in original query? Can this issue be solved still getting result by executing single query?

adam · November 25, 2020, 2:53pm

I think that every datalog clause either adds tuples to the working set or removes already added tuples, depending on what is bound when that clause is processed. These sort of out-of-memory and performance issues usually crop up when you inadvertently introduce something like a cross product that needs to be materialized for the query process to continue.

In this case it looks like you’re binding ?User_1 :User/address to both ?User_address_compare_value_2 and ?Address_4. Working from the English description of your query, it looks like you’re only finding users based on address (i.e. all addresses except for one specific entity and one specific zip code). I think you should start your query with addresses. Then join users (1:1), then finally join orders (which will add a tuple for every user/order and removing the tuple for every user without orders). Also, I put your constants into the input.

[:find
 [?Order ...]
 :in $ ?notAddress ?notZip
 :where

 [?address :Address/zipCode ?zipCode]
 [(!= ?zipCode ?notZip)]
 [(!= ?address ?notAddress)]

 [?user :User/address ?address]
 [?Order :Order/users ?user]
]

Edit: I made some assumptions about your data. If you have many more addresses than users, it would make more sense to re-order this so that the [?user :User/address ?address] clause comes first and the “removal” clauses come after, but either way the key is not binding user address to two different variables.

nikolayandr · November 25, 2020, 4:56pm

Hi @adam, thank you for posting your answer. My oversight that I didn’t explain some general information. We have a system which accepts SQL based expressions(written by users according the the high level schema) like mentioned in the post and creates datalog datoms for each expression between operands(AND, OR), so actually the query might be not only get orders where users.address != 17592186093790 AND users.address.zipCode != “12345” but also users.address != 17592186093790 AND users.address.zipCode != “12345” AND users.address.line1 != “some line 1 text”. So we can’t improve manually each query like you did it for my example

P.S. For the last mentioned query it is

 ; Built from users.address != 17592186093790 
 [?User_1 :User/address ?User_address_compare_value_2]
 [(!= ?User_address_compare_value_2 ?User_address_value_3)]

 ; Built from users.address.zipCode != "12345" 
 [?Address_4 :Address/zipCode ?Address_zipCode_compare_value_5]
 [(!= ?Address_zipCode_compare_value_5 ?Address_zipCode_value_6)]

 ;Built from users.address.line1 != "some line 1 text"
 [?Address_4 :Address/line ?Address_line_compare_value_7]
 [(!= ?Address_line_compare_value_7 ?Address_line_value_8)]
 
 ; The next two conditions are common part for all three sql expressions, 
 ; so all interesting addresses are found, we should determine all users which 
 ; have those addresses and orders which have reference on found users
 [?User_1 :User/address ?Address_4]
 [?Order :Order/users ?User_1]

adam · November 25, 2020, 6:01pm

Sure, your query does have a “generated” flavor to it

So, two things:

Even though you’re generating the query, you still shouldn’t be generating different bindings to the same datom. I don’t know the exact semantics of your expression language (probably outside the scope of a forum thread!) but if you can recognize that, eg, ?User_address_compare_value_2, and ?Address_4 are the same thing and only use a single symbol for them, I think you will get most of the benefit, even if the clause order is sub-optimal.
Since query correctness in datomic datalog is entirely independent of clause order, you can think of a query optimizer as a function that takes a query and returns a query with the clauses (potentially) re-ordered. I haven’t tried it myself, but there was a project mentioned here a while ago that makes an attempt to apply some heuristics to achieve performance improvements: https://github.com/ParkerICI/datomic-query-improver ; generated queries are a key use-case

If you can’t manage to unify duplicate bindings, then mechanically splitting up the logical pieces as you illustrated in your original post is probably the way to go. Run each sub-expression independently and then combine the results based on whatever operator was used to combine those sub-expressions in the original SQL expression.

nikolayandr · November 25, 2020, 6:53pm

Big thanks you for:

Even though you’re generating the query, you still shouldn’t be generating different bindings to the same datom

It really helps even for the first case

If you can’t manage to unify duplicate bindings

Actually we can ?User_1 and ?Address_4 in the above example is already result of unifying duplicate bindings, but your suggest pointed me one more case where unifying should be done

Topic		Replies	Views
Not possible to pass more then one vector variable to or-join Datomic Pro	12	2303	December 27, 2023
Does not-join support working with collection variables? General	6	456	December 29, 2023
Not possible to pass attributes as argument for fulltext used inside or-join Datomic Pro	0	989	February 11, 2020
All clauses in 'or' must use same set of vars? Peer API	4	2011	April 25, 2018
Query performance with large(r) inputs Datomic Pro	2	679	July 11, 2025

Bad query performance or OOM for queries executing joining between two variables

Related topics