`Loading database` exception

We are getting this exception more and more often on d/db calls, even on a freshly provisioned Datomic Cloud system, using the latest CF templates and almost latest Clojure libs (the versions before the 2024 Feb 12 release):

clojure.lang.ExceptionInfo: Loading database #:cognitect.anomalies{:category :cognitect.anomalies/unavailable, :message "Loading database"}
	at datomic.core.anomalies$throw_if_anom.invokeStatic(anomalies.clj:94)
	at datomic.core.anomalies$throw_if_anom.invoke(anomalies.clj:88)
	at datomic.core.anomalies$throw_if_anom.invokeStatic(anomalies.clj:89)
	at datomic.core.anomalies$throw_if_anom.invoke(anomalies.clj:88)
	at datomic.cloud.client.local.Client$thunk__29831.invoke(local.clj:175)
	at datomic.cloud.client.local$create_db_proxy.invokeStatic(local.clj:282)
	at datomic.cloud.client.local$create_db_proxy.invoke(local.clj:280)
	at datomic.cloud.client.local.Connection.db(local.clj:103)
	at datomic.client.api$db.invokeStatic(api.clj:186)
	at datomic.client.api$db.invoke(api.clj:175)

In the past years, it only happened on instances smaller than i3.large, eg. t3.medium & t3.small, so the issue seemed to be related to the instance type, but it did happen on our previous i3.large system in the ap-southeast-1 region and keeps happening even more often on our fresh system in us-west-2.

What should I do about it?

Should I wrap my d/db calls with with-retry too, like we wrap our d/connect calls?

We are not creating new databases on our ion startup, but we indeed call d/create-database unconditionally, since it seemed idempotent, returns fast, if the DB already exists and it’s less code, than trying to do it conditionally.

Is it possible that we are triggering the scenario described in the docs (Troubleshooting | Datomic) about this error?

Would it help to provide a :timeout to the d/connect call, assuming it’s also considered by the d/db call?

Correction:

We are not even calling d/create-database unconditionally anymore:

(defn ensure-db!
  "Returns Datomic DB reference (`{:db-name \"<database-name>\"}`) and ensures
   that the database is created."
  [datomic-client db-name]
  (let [db-ref {:db-name db-name}]
    (when-not (-> datomic-client (d/list-databases {}) set (contains? db-name))
      (d/create-database ($ :client) db-ref))
    db-ref))

I too have been issuing the create-database unconditionally and so far have not correlated that to any run time issues. I wish it were documented as being idempotent and efficient to perform unconditionally.

1 Like

For the record, this issue is still plaguing us.

It mostly happens on t3.medium instance sizes, mostly after a fresh deployment, which we do every fews days, but happens on i3.large instances too, though less often, probably because we deploy there less often.

@jaret would it be possible to enquire about this with the dev team?

i’ve tried so many things.

i’m tracking db connection times on cloudwatch.

have the d/connect in a retry loop.

(rmap/rval
               (do
                 (dcast/event {::msg (str "Connecting " ($ :db-name) " ...")})

                 (dc/with-retry
                   (fn []
                     (let [conn (d/connect
                                  ($ :client)
                                  (-> ($ :db-ref)
                                      (assoc :timeout 1000)))]
                       (dcast/event {::msg (str "Connected: " ($ :db-name))})

                       conn))

                   :retry?
                   (fn [ex]
                     (dcast/event {::msg (str "Connecting " ($ :db-name) " again ...")
                                   :ex   ex})
                     (-> ex #_(doto pst) ex-data ::anom/category
                         dc/retryable-anomaly?)))))

but this error seem to originate from d/db calls.

do we have to put those into an explicit retry loop too?

doesn’t it have a built-in retry, just so query group instances can reconnect to primary compute group instances?

does d/db handle the :timeout parameter documented in the datomic.client.api NS docstring?

just asking, because the d/db function docstring doesn’t have the usual “See namespace doc for timeout, offset/limit, and error handling.” sentence in it, but it would be important to know, to compute how long would a specific retry logic take to fail.

We have completely removed both the d/list-databases and the d/create-database calls from our ion deployment, but the result is the same; still seeing these Loading database errors as a result of d/db calls and even more frequently, but mostly on our t3.medium instances still.

We have even switched the datomic-<system> Dynamo DB table’s Capacity mode to On Demand, even after the latest upgrade to 1171-9390.

In on demand mode these errors seemed to be less frequent, but hasn’t disappear completely and we have even seen a burst of them (4-5 within a few hours) recently.