Datomic and linguistic data

LaurentBie · December 2, 2021, 12:43pm

Hi, I am working in a company that manages a huge amount of linguistic data (2 billion sentences - 20 million sentences among 100 languages)
Each sentence is likely to be edited over time and we would like to implement a data storage system that guarantees traceability of changes.
Datomic seems to be well suited for that purpose thanks to its very clever time model, but I would like to get some feedback from experimented Datomic user about the idoneity of Datomic for such a purpose. Thanks

ckws · December 6, 2021, 12:42pm

My 2 cents

Data Size

According to the last information, Datomic is suitable for data set less than 10 billion datoms.
If your model consists of sentences, you are within the parameters, but if you are planning to break it down into words, or even further, you will exceed it.

History

If sentences evolution over time is a primary function of your application, it should be reflected in the model. I.e., I would not recommend relying on Datomic’s history model but tracking the changes myself.

In general, you want to access Datomic data using the database value, which provides the current state. I would use Datomic history database for rare use cases, which where you need a view of the database across time.

Adding history to your data model at the beginning allows you to consider other databases. E.g., Postgres functions/triggers can automatically write a datums history on change.

I hope that helps.

-ck

Topic		Replies	Views
Datomic Cloud Status Update Announcements	17	2609	February 21, 2018
Peer Query Performance Peer API	4	1123	March 13, 2025
Looking for Best Practices for Modeling Evolving Entities in Datomic? General	1	24	June 25, 2025
Best Practices for Managing Schema Migrations in Datomic! General	2	102	February 13, 2025
Does excision count towards the 1 billion datom limit? Datomic Pro	4	1031	August 20, 2019

Datomic and linguistic data

Data Size

History

Related topics