After some time of working, clojure service is getting issues with failed connection to transactor:
org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: AMQ219006: Channel disconnected
at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.connectionDestroyed(ClientSessionFactoryImpl.java:374)
at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector$Listener$1.run(NettyConnector.java:1228)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
...
org.apache.activemq.artemis.api.core.ActiveMQObjectClosedException: AMQ219017: Consumer is closed
at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.checkClosed(ClientConsumerImpl.java:971)
at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.receive(ClientConsumerImpl.java:204)
...
org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: AMQ219010: Connection is destroyed
at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQClientProtocolManager.createSessionContext(ActiveMQClientProtocolManager.java:300)
....
clojure.lang.ExceptionInfo: Error communicating with HOST 10.0.1.78 on PORT 4334
at datomic.connector$endpoint_error.invokeStatic(connector.clj:53)
at datomic.connector$endpoint_error.invoke(connector.clj:50)
at datomic.connector.TransactorHornetConnector$fn__9390.invoke(connector.clj:224)
at datomic.connector.TransactorHornetConnector.admin_request_STAR_(connector.clj:212)
at datomic.peer.Connection$fn__9646.invoke(peer.clj:219)
...
Connection is destroyed and Error communicating repeats till service is dead
We cannot find any particular reason for that.
One of the possible clues on this is GC allocation issue:
A prolonged GC pause on the transactor could lead to a peer receiving timeouts or exceptions. I would set the MaxPauseGCMillis as recommended. If you encounter this error again after setting that, I’d like to see the peer logs and transactor logs at the time of the event. I should be able to see the metrics for GC pause/query timeout or refusal by looking at both logs along with other context clues.
If you’d like me to review the logs in this instance I recommend e-mailing support@cognitect.com to open a case with us and then you can attach or link to the logs from that case.
GC pause was on the peer, not on transactor.
Java props, which I sent you earlier, are of the peer too (my bad).
Transactor contains jvm settings -Xms200g -Xmx200g -XX:+UseG1GC -XX:MaxGCPauseMillis=50.
200g heap is a large heap for a transactor. I’d be interested in hearing more about your system. The trade-off of providing a larger heap to peer or transactor is increased GC overhead. I think if you have the logs I should look at both peer and transactor logs during this event. Could you supply me your transactor (include both active and standby) and peer logs for this event via our support portal? You can e-mail support@cognitect.com or via the website here https://support.cognitect.com/hc/en-us/requests/new
Do you always run your peer with -XX:+PrintGCDetails -verbose:gc and -XX:+PrintGCDateStamps or did you add that for this case? I’d recommend not having that set unless you are specifically troubleshooting a GC issue.