Difficulties connecting to production system for batch imports


#1

I’m really struggling to get a batch import job to work reliably with Datomic Cloud.

I’m running it from a an EC2 instance in the same VPC as my Cloud stack (prod, i3.xlarge), with a client arg-map looking like so:

{:server-type :cloud
                          :region "eu-central-1"
                          :system "linnaeus"
                          :endpoint "http://entry.linnaeus.eu-central-1.datomic.net:8182/"
                          :timeout (* 1000 60 5)}

Connecting to the db (via d/connect) fails about 3 out of 4 times with a Datomic Client Timeout error, and even when that works d/transact fails after a few Transactions with a Service Unavailable error:

#error{:cause "Service Unavailable",
       :data {:cognitect.anomalies/category :cognitect.anomalies/unavailable,
              :cognitect.anomalies/message "Service Unavailable",
              :http-result {:status 503,
                            :headers {"cache-control" "must-revalidate,no-cache,no-store",
                                      "content-length" "331",
                                      "server" "Jetty(9.3.7.v20160115)",
                                      "date" "Thu, 20 Dec 2018 12:25:05 GMT",
                                      "content-type" "text/html;charset=ISO-8859-1"},
                            :body "<html>
                                   <head>
                                   <meta http-equiv=\"Content-Type\" content=\"text/html;charset=ISO-8859-1\"/>
                                   <title>Error 503 </title>
                                   </head>
                                   <body>
                                   <h2>HTTP ERROR: 503</h2>
                                   <p>Problem accessing /api. Reason:
                                   <pre>    Async servlet timeout</pre></p>
                                   <hr /><a href=\"http://eclipse.org/jetty\">Powered by Jetty:// 9.3.7.v20160115</a><hr/>
                                   </body>
                                   </html>
                                   "}},
       :via [{:type clojure.lang.ExceptionInfo,
              :message "Service Unavailable",
              :data {:cognitect.anomalies/category :cognitect.anomalies/unavailable,
                     :cognitect.anomalies/message "Service Unavailable",
                     :http-result {:status 503,
                                   :headers {"cache-control" "must-revalidate,no-cache,no-store",
                                             "content-length" "331",
                                             "server" "Jetty(9.3.7.v20160115)",
                                             "date" "Thu, 20 Dec 2018 12:25:05 GMT",
                                             "content-type" "text/html;charset=ISO-8859-1"},
                                   :body "<html>
                                          <head>
                                          <meta http-equiv=\"Content-Type\" content=\"text/html;charset=ISO-8859-1\"/>
                                          <title>Error 503 </title>
                                          </head>
                                          <body>
                                          <h2>HTTP ERROR: 503</h2>
                                          <p>Problem accessing /api. Reason:
                                          <pre>    Async servlet timeout</pre></p>
                                          <hr /><a href=\"http://eclipse.org/jetty\">Powered by Jetty:// 9.3.7.v20160115</a><hr/>
                                          </body>
                                          </html>
                                          "}},
              :at [datomic.client.api.async$ares invokeStatic "async.clj" 56]}],
       :trace [[datomic.client.api.async$ares invokeStatic "async.clj" 56]
               [datomic.client.api.async$ares invoke "async.clj" 52]
               [datomic.client.api.sync$eval21871$fn__21876 invoke "sync.clj" 83]
               [datomic.client.api.protocols$eval17089$fn__17125$G__17074__17132 invoke "protocols.clj" 58]
               [datomic.client.api$transact invokeStatic "api.clj" 172]
               [datomic.client.api$transact invoke "api.clj" 155]
               [linnaeus.lab.crossref.datomic_import$transact_articles_from_chan_BANG_$fn__19902$fn__19905$f__15910__auto____19906
                invoke
                "datomic_import.clj"
                106]
               [clojure.lang.AFn run "AFn.java" 22]
               [io.aleph.dirigiste.Executor$3 run "Executor.java" 318]
               [io.aleph.dirigiste.Executor$Worker$1 run "Executor.java" 62]
               [manifold.executor$thread_factory$reify__15792$f__15793 invoke "executor.clj" 44]
               [clojure.lang.AFn run "AFn.java" 22]
               [java.lang.Thread run "Thread.java" 748]]}

Exponential backoff doesn’t help, even after several minutes. The EC2 nodes of the Datomic stack exhibit a near-zero utilization in terms of CPU, memory and network, while the Cloudwatch metrics show a constant HttpEndpointAsyncTimeout of 1.0.

What’s driving me crazy is that these failures seem so random. Everything works fine for a dozen txes, and then after virtually no load I get 100% failure.

What might be causing this?