Is calling `d/since` on every ion invocation problematic?


Problem Statement
I need an on disk lucene index kept up to date for a text search ion. Is calling d/since on every ion invocation problematic?

A system I am building will leverage information out of datomic to efficiently query an on disk lucene index, therefore I would like to use datomic cloud + ions to execute this query rather than creating a separate client application. The strings are small enough that they fit inside of a datomic string attribute type, thus datomic is the source of truth. (If these strings get bigger I can instead use a reference to s3 or somewhere else as needed)

I plan to make my own query group for this ion application. My lucene queries are more advanced than a simple full text search and require the results of datalog queries to limit the documents lucene searches and modify the way lucene scores the included documents for relevance.

I know there are concerns around using core.async with ions and some of the async methods in the client-cloud datomic.client.api, are there any similar concerns with d/index-range or d/since within an ion?

I could avoid the call to d/index-range and use d/q instead but I’m not sure if performance would suffer. I remember talking with Stu about wanting to avoid operations that would hammer the dynamodb tx-log, but I can’t recall all the specifics and may be totally wrong.

Use Case
Updating a lucene index on disk which will be used to fulfill a search request. I only have a few fields I need to index and can limit the size of the d/index-range by calling d/since. Is calling d/since frequently (on every ion based search request) problematic?

Example Code Sketch

(defn search-handler [input]
(let [db (d/db (get-conn))]
;; possibly do this in a single place so multiple ion invocations
;; don't attempt to update this resource at the same time
  (doseq [datom (d/index-range (d/since db) @last-t-i-ran-this-process-with) {:limit -1})]
    (update-lucene-index datom))
;; update the last-t-i-ran-this-process-with to the latest used t
;; after we have ensured the index is in sync
  (search-using-updated-query db input)

The above is just a sketch and its ugly with a bunch of holes, it’s intended to be pseudocode. I know there is a lot to consider when thinking about lucene index readers and how to manage them. I’ll handle that.

Other Considerations
How to update the index upon machine restart. I can snapshot the index in S3 as of a particular T but if my indexable fields change that obviously would require a full re-index.

There was talk recently about running a background process on an Ions instance, I could use that to trigger the re-index process once per second (or larger unit time, yay queueing theory!) like elasticsearch does. In that kind of a system my background process would have to dictate what the latest db value is to the ion function in order to guarantee consistency. This will likely be my fallback solution

I can solve these issues on my own if they detract, I’m mainly concerned about the perf and cost of calling d/since possibly in combination with d/index-range.


Hi @jplane,

Calling d/since on every ion invocation is not problematic. However, I think you are misunderstanding the properties of a filtered DB (i.e. d/since or as-of) and assuming that it will be less expensive than d/db conn. Intuitively this makes sense, because less of the DB should = less work. But in actuality filters require a full scan of the DB to do their filtering.

“The first query is limited only by time. Such a query can be answered directly from the log, which is a time index. Answering the same question via asOf and since would require a filtering scan of the entire database.”

As you can see from the docs above, database time filters are not cheaper as they require a scan of the entire database to do their filtering.

For your needs, I believe you would be better served using d/tx-range and calling d/tx-range on every ion invocation.