Imposing peer `query` resource limits

Are there any practical ways to impose resource limits on individual queries, without resorting to dedicating a JVM to the query? In general I’ve found that a poorly constructed query can very easily take out an entire Peer, and the chaos that follows makes it difficult to track down the source of the issue. The scenario usually goes something like this

  1. Misbehaved query is executed on peer. It attempts to do something that quickly exhausts available heap, but also produces copious amount of garbage which can be collected.
  2. The JVM goes into a tailspin, as more and more CPU cycles get consumed by the garbage collector. This isn’t always a simple “out of memory” condition…the working set of the query might be quite small, but it is doing something exponential in terms of the data it touches, produces garbage as fast as it can be collected
  3. The CPU load causes other things to time out, connections to get dropped and “Transactor Unavailable” exceptions start popping up all over the place, not really related to the bad query.

Are there any good “on JVM” ways of containing the impact that a single query can have? A timeout on its own doesn’t really help here, since the one query that “brings down” the JVM tends to make everything take forever. Given that all the “work” that queries do occurs in shared datomic controlled thread pools, I’ve found it generally difficult to associate a given poor behavior with the query that caused it.

@adam I can corroborate these exact behaviors in our production deployment. General code instrumentation tools like honeycomb have helped us identify bad apple queries, but the impacts your describing are similar to what our team has observed. I believe because the datalog engine needs to realize each logic variable into memory to perform any joins, and that there are not any heuristics to estimate the size of a where clause in advance, there may not be any obvious recourse. We suspect specifically the “transactor not available” errors are an outcome of the GC event skewing some internal timing. I’d be curious to hear if the Datomic team has addressed anything like this in the cloud offering

Yeah, I’ve had some pretty good success with glowroot recently on the instrumentation side. Fundamentally I assume it is doing the same thing as every other JVM APM type tool by measuring wall clock time between when the method is called and when it returns. Great to identify and log slow queries, capture stack traces over a certain time threshold, etc. Where it breaks down (not sure if honeycomb has a way around this) is when the query not only never finishes, but also creates the pathological behavior I described. Once you can’t keep a network connection open, all bets are kind of off.

I wonder if this kind of thing pushed the design of cloud and Peer server (as required by, e.g. analytics support, even for on-prem) away from the true “peer” architecture? After all JVM, OS, VM are all resource isolation “layers” maybe asking for an additional layer inside the JVM is too much.

I don’t expect the datalog engine to save me from these pathological queries apriori, and I generally appreciate the power and flexibility afforded. But having an opt-in mechanism to limit query execution in some way would be operationally very helpful. The incident which inspired this post was solved using very tedious process of elimination. The ability to request that query throw an exception after it does something-coincident-with-memory-allocation more than so-many times (not sure what that would be…number of unification operations?) would have saved hours operational firefighting.

In case you’re curious, the root cause in this case was an (ultimately unnecessary) [(identity ?a) ?b] clause that should have been [(identity ?b) ?a]…and up to that point, ?a had only every bound a single value which is apparently fine? Anyway, once it was bound to 25 values it created a scan-the-whole database type situation. Which, again, I don’t really expect the datalog engine to stop that in all cases, but I need to know where the problem is coming from.

@adam yep, this all sounds familiar. Due to the cascade of failure modes, it can be tricky to identify the original offender since well formed queries themselves start to show latency. I’m not aware of any JVM capabilities to essentially limit the allocation within a given thread (like a linux container), but I concur that reducing the blast radius of a bad query would certainly help counterweight the absence of query optimization (which itself IMO is not a deal-breaker)

@adam Perhaps the :timeout option in the client api cancels the query thread? If so, could you dispatch a new thread for the query and .interrupt()it if it exceeds a specified timeout?

1 Like

Adam, are you sure a timeout wouldn’t work? I can think of two cases that could occur by adding a timeout.

  1. If there are multiple long running queries then the offending query may cause the other long running queries to fail to finish. In this case multiple queries timeout. However, even though you cannot tell which of the multiple queries are the offending query, over time, in the logs, you’ll see a pattern where the offending query appears more often or by itself or in every set of queries that timed out.
  2. There may only be the offending query that is long running. In this case, all queries started before the offending query will finish even though the system is incredibly slow. The queries that timeout will all have started after the system became slow, except the offending query. In the logs, the offending query will always appear as the first query to timeout. Of course, “long running” is relative. But the first case I describe covers anything running “long enough” to not fall into this second case.

I hear you solved the problem, and I don’t know the details of what was going on. So maybe I’m just showing some ignorance here. I just want to offer the possibility that a timeout may work to diagnose such problems based on my understanding.

@jzwolak Yes, it could help to have timeouts. The offending query we found provides a good starting place for testing this and I plan to do so. The challenge will be that long running queries aren’t necessarily bad queries.

Ideally I could emit some log message like “query foo has been running for more than X milliseconds” without actually terminating the query. Unfortunately, with the bad query running it seems quite possible that the log message would never make it off the system. Probably better than nothing, though.

1 Like