Datomic, can you rely on indices for incremental updates


#1

Dear,

We have written incremental data pipeline on top of datomic. The time concept is perfect for this since basically if you define the used tables & links between tables in code for a transformation you can write a system that automatically knows which entities require an update if a source table changes.

That being said, we noticed in the target tables that it happens that:

  • the object itself is updated
  • but the indices are not
    With other words, if you do a get all on a table, there is an entity with name: “Micky Mouse”. If you filter on name == “Micky Mouse” then you get nothing, until the indices are updated. This is also written in the datomic docs: “Updates the datom trees only occasionally, via background indexing jobs.”

However, is there no way to force the index to be up to date after a certain batch job? If you can’t be certain about the index, won’t people write a lot of hidden mistakes? E.g. if I get all entities of a table that changed after a certain time and I try to link that with another table. Then I store the new date and next time I do the same procedure. What if the indices of that second table were not up to date? Since I’m working incrementally, there is no way I could know.

Or is my understanding of this incorrect?


#2

I’ll answer part of my own question.
https://docs.datomic.com/on-prem/capacity.html#indexing
“You can also explicitly request an indexing job by calling requestIndex.”


#3

Datomic uses a combination of persistent indexes on disk and a memory index to provide up-to-date access to the entire database at all times.

You’re correct that the persistent indexes on disk are only updated periodically, by indexing jobs. However, every time a transaction completes, the transactor notifies all connected peers with the updates included in that transaction so that they can incorporate them into their local memory index. When you issue a query, Datomic ‘answers’ the query from the combination of segments of the persistent index and the local memory index.

This is described in more detail here: https://docs.datomic.com/on-prem/architecture.html and here: https://docs.datomic.com/on-prem/indexes.html#efficient-accumulation (“Merges index trees with an in-memory representation of recent change so that peers see up-to-date indexes.”)

This means that if you issue a query against an up-to-date database value (i.e. you call d/db to get a new database value), you will see all transacted data, whether it has been incorporated into the persistent disk indexes or not.


#4

Thank you for the answer.
In that case it does not make sense that I receive inconsistent results for two different queries.

Extra information: the only difference between the getAll call and the filtered call is that we insert a condition. So how we could reproduce it: change an entity in our db, directly after that call we call the get all which fetches the entity ids with a query and does a pull
And then the get all filtered
which adds conditions to the query that fetches the ids, e.g. if we filter on firstName
[?e :AE_Employee/FirstName ?firstname1]
[(= ?firstname1 “Brecht” )]
If we would change the name to “Brecht” we notice that the results of the getAll query shows the firstname correctly, but the filtered one results in []… as if it is not updated yet in the index. After a while, we do get the correct result for the filtered one.

By now we added a ‘syncIndex’ each time after the bulk of transactions and it seems as if this fixed it.

datomic version: datomic-pro-0.9.5561.50


#5

There are a couple of possibilities here.
First, are you getting a new database value before issuing your query (i.e. calling d/db)?
Are you running the query on the same peer that issued the transaction? If so, you should use the db-after value returned from the transaction.

Also, the transactor does have to stream new data to connected peers. It is possible that your transaction has completed but the peers have not yet all been updated. You can use the sync API to force the peer to get the newest database value available.


#6

The transaction is issued by another peer on a different machine
Both queries are done on the same peer.
Every time I do a query I get a new db value since I do not store it anywhere and always get it through connection.db() (java api)
So I really don’t understand how it is possible.


#7

If you’re issuing the query very soon after the other peer has submitted the transaction, the querying peer may not have yet been updated by the transactor. There is a very small delay for that information to propagate over the network.

You can use the sync API to force the querying peer to wait for the latest update (or for a specific update by supplying a t value) before executing the query.