Performance degradation and heap overflow with large collection binding input

Hi everyone,

I’m working on a query pattern where I pass a large number of entity IDs into the query as a collection binding. Here’s a simplified version of the query:

[:find ?MaterialImpl1
 :in $ [?MaterialImpl1 ...] ?MaterialImpl1_value_0_VR
 :where
 [?MaterialImpl1 :Model/typeName ?MaterialImpl1_value_0_VR]]

When the number of IDs is relatively small (e.g., up to 100,000), the query executes within a reasonable time (a few seconds). However, once the input size increases (e.g., around 1,000,000 IDs), I observe a drastic performance slowdown. Eventually, the query fails with a Java heap space overflow, which I assume is due to memory pressure during query execution.

I’m trying to understand: Is there a known upper limit or best practice for using large collection bindings like [?e ...]?

Any advice or experiences would be appreciated — thank you!

Additionally, I’ve noticed that if we run a query like this:

[:find ?MaterialImpl1
 :in $ ?MaterialImpl1_value_0_VR
 :where
  [?_ :someAttribute ?MaterialImpl1]
  [?MaterialImpl1 :Model/typeName ?MaterialImpl1_value_0_VR]]

where ?MaterialImpl1 ends up containing exactly the same set of entity IDs as in the problematic case, the query executes much faster and is more memory-efficient.

The issue is that it’s not always possible to construct the query this way, since the set of entity IDs bound to ?MaterialImpl1 may come from multiple places in the application logic, not just from a single [?_ :someAttribute ?MaterialImpl1] clause. In those cases, we are forced to pass a large collection as an input binding, which causes the performance degradation and memory pressure described in this thread.

@jaret I’d really appreciate it if you could confirm whether an efficient solution for this problem exists.

P.S. It’s quite surprising that a query operating over the exact same data set can have such a significant difference in performance.