Datomic version: 1.0.6362
After some time of working, clojure service is getting issues with failed connection to transactor:
org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: AMQ219006: Channel disconnected
org.apache.activemq.artemis.api.core.ActiveMQObjectClosedException: AMQ219017: Consumer is closed
org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: AMQ219010: Connection is destroyed
clojure.lang.ExceptionInfo: Error communicating with HOST 10.0.1.78 on PORT 4334
Connection is destroyed and Error communicating repeats till service is dead
We cannot find any particular reason for that.
One of the possible clues on this is GC allocation issue:
[Full GC (Allocation Failure) 57144M->33793M(57344M), 35.1363911 secs]
[Eden: 0.0B(2864.0M)->0.0B(6800.0M) Survivors: 0.0B->0.0B Heap: 57144.5M(57344.0M)->33793.5M(57344.0M)], [Metaspace: 155619K->155616K(159744K)]
[Times: user=50.45 sys=0.05, real=35.14 secs]
Maybe GC is the case here? Or there could be other possible issues out there?
@deivydasofc Did you update your version of Java recently? What version are you running? are you seeing this error on the peer or transactor?
Also, if this is transactor are you using the recommended GC flags? see: Transactor | Datomic
No, Java was not changed recently.
Currently Java version is good old 1.8.
Seeing this on peer.
I am seeing that
-XX:MaxGCPauseMillis=50 was not set on my side. Could that have any potential improvement?
A prolonged GC pause on the transactor could lead to a peer receiving timeouts or exceptions. I would set the MaxPauseGCMillis as recommended. If you encounter this error again after setting that, I’d like to see the peer logs and transactor logs at the time of the event. I should be able to see the metrics for GC pause/query timeout or refusal by looking at both logs along with other context clues.
If you’d like me to review the logs in this instance I recommend e-mailing email@example.com to open a case with us and then you can attach or link to the logs from that case.
GC pause was on the peer, not on transactor.
Java props, which I sent you earlier, are of the peer too (my bad).
Transactor contains jvm settings
-Xms200g -Xmx200g -XX:+UseG1GC -XX:MaxGCPauseMillis=50.
200g heap is a large heap for a transactor. I’d be interested in hearing more about your system. The trade-off of providing a larger heap to peer or transactor is increased GC overhead. I think if you have the logs I should look at both peer and transactor logs during this event. Could you supply me your transactor (include both active and standby) and peer logs for this event via our support portal? You can e-mail firstname.lastname@example.org or via the website here https://support.cognitect.com/hc/en-us/requests/new
Do you always run your peer with
-XX:+PrintGCDetails -verbose:gc and
-XX:+PrintGCDateStamps or did you add that for this case? I’d recommend not having that set unless you are specifically troubleshooting a GC issue.