Problem Statement
I need an on disk lucene index kept up to date for a text search ion. Is calling d/since
on every ion invocation problematic?
Background
A system I am building will leverage information out of datomic to efficiently query an on disk lucene index, therefore I would like to use datomic cloud + ions to execute this query rather than creating a separate client application. The strings are small enough that they fit inside of a datomic string attribute type, thus datomic is the source of truth. (If these strings get bigger I can instead use a reference to s3 or somewhere else as needed)
I plan to make my own query group for this ion application. My lucene queries are more advanced than a simple full text search and require the results of datalog queries to limit the documents lucene searches and modify the way lucene scores the included documents for relevance.
I know there are concerns around using core.async with ions and some of the async methods in the client-cloud
datomic.client.api
, are there any similar concerns with d/index-range
or d/since
within an ion?
I could avoid the call to d/index-range
and use d/q
instead but I’m not sure if performance would suffer. I remember talking with Stu about wanting to avoid operations that would hammer the dynamodb tx-log, but I can’t recall all the specifics and may be totally wrong.
Use Case
Updating a lucene index on disk which will be used to fulfill a search request. I only have a few fields I need to index and can limit the size of the d/index-range
by calling d/since
. Is calling d/since
frequently (on every ion based search request) problematic?
Example Code Sketch
(defn search-handler [input]
(let [db (d/db (get-conn))]
;; possibly do this in a single place so multiple ion invocations
;; don't attempt to update this resource at the same time
(doseq [datom (d/index-range (d/since db) @last-t-i-ran-this-process-with) {:limit -1})]
(update-lucene-index datom))
;; update the last-t-i-ran-this-process-with to the latest used t
;; after we have ensured the index is in sync
(search-using-updated-query db input)
))
The above is just a sketch and its ugly with a bunch of holes, it’s intended to be pseudocode. I know there is a lot to consider when thinking about lucene index readers and how to manage them. I’ll handle that.
Other Considerations
How to update the index upon machine restart. I can snapshot the index in S3 as of a particular T
but if my indexable fields change that obviously would require a full re-index.
There was talk recently about running a background process on an Ions instance, I could use that to trigger the re-index process once per second (or larger unit time, yay queueing theory!) like elasticsearch does. In that kind of a system my background process would have to dictate what the latest db value is to the ion function in order to guarantee consistency. This will likely be my fallback solution
I can solve these issues on my own if they detract, I’m mainly concerned about the perf and cost of calling d/since
possibly in combination with d/index-range
.