REPL on the Compute notes?

In debugging my ION Lambda I would really like to connect to a REPL on the compute node.

I see something is running on port 5555, and nc/telnet doesn’t give me a REPL prompt.

Is it possible to have a REPL into the running code?

Hi Danie,

Datomic does not start a server REPL, and nodes run in a security group and private VPC that is locked down.

We have worked to eliminate the need to connect to individual boxes, as it is both error-prone and a security risk.

When I am developing an ion, I sort out all the minor gotchas at a local REPL before ever deploying, and from there have found ion events sufficient for tracking down e.g. configuration errors. What sorts of problems are you encountering?

Cheers,
Stu

Hi Stu,

I appreciate the locked down nature of the nodes, and the need for that. And that individual boxes will be inconsistent if the state is changed via a REPL.

In this case, I wanted to see the state of a Kafka producer, and send a few messages to trace what was happening: It did not seem to take in the properties that I gave it.

The cast events added to my confusion, because bootstrap.servers were correctly printed:

{
    "Msg": "Instantiating producer",
    "xxxProducerProps": {
        "bootstrap.servers": "ip-10-211-11-111.us-east-2.compute.internal:9092",
    },
    "Type": "Event",
    "Tid": 56,
    "Timestamp": 1550055900480
}

But the Producer error message still said localhost:

{
    "Msg": "[Producer clientId=clientid] Connection to node 0 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.",
    "DatomicCloudSlf4jLevel": "Warn",
    "DatomicCloudSlf4jSymbol": "org.apache.kafka.clients.NetworkClient",
    "DatomicCloudSlf4jSource": "SLF4J",
    "Type": "Event",
    "Tid": 57,
    "Timestamp": 1550055922977
}

At this point, I would have liked to poke around in a REPL.

I did figure out after a while that my Kafka broker had this set: advertised.host.name=localhost. And the producer did actually connect to the broker, and from the broker going back it went weird, with the quite misleading error message.

This was not fun to track down.

We had an issue this morning, same symptoms as described over in…

… except this time it was persistent. I solved it by terminating the node. It returned 500 on requests, but for some reason the NLB was perfectly fine with this and didn’t terminate it automatically.

We’re now looking at more liberal sprinkling of casts throughout the code to catch it next time when/if it happens, but this seems a bit like black box engineering. It would have been nice to have a more direct way to inspect the state of the machine.