Datomic and linguistic data

Hi, I am working in a company that manages a huge amount of linguistic data (2 billion sentences - 20 million sentences among 100 languages)
Each sentence is likely to be edited over time and we would like to implement a data storage system that guarantees traceability of changes.
Datomic seems to be well suited for that purpose thanks to its very clever time model, but I would like to get some feedback from experimented Datomic user about the idoneity of Datomic for such a purpose. Thanks

My 2 cents

Data Size

According to the last information, Datomic is suitable for data set less than 10 billion datoms.
If your model consists of sentences, you are within the parameters, but if you are planning to break it down into words, or even further, you will exceed it.


If sentences evolution over time is a primary function of your application, it should be reflected in the model. I.e., I would not recommend relying on Datomic’s history model but tracking the changes myself.

In general, you want to access Datomic data using the database value, which provides the current state. I would use Datomic history database for rare use cases, which where you need a view of the database across time.

Adding history to your data model at the beginning allows you to consider other databases. E.g., Postgres functions/triggers can automatically write a datums history on change.

I hope that helps.