Ions Silent Timeout Failure

So my ions deployment has been silently failing for a while, just responding with {"message": "Endpoint request timed out"} . Looking at the logs there’s no info that anything is wrong, no alerts are being generated, it just appears to be silently timing out.

I turned on x-ray tracing errors, which just shows the request being sent to the lambda app and then timing out after 1 minute.

The thing that’s puzzling is that it should be just calling a ring get request, so I’m not sure why it’s timing out? All throughout development it was working fine, I released a version of it which my users were happy with. And have been subsequently working on other things for a bit, before I get some time to go back to it again.
So 3 questions:

  1. Any idea how I can debug what’s wrong?
  2. At the very least to get alerts etc if it just fails again? I really don’t want this to happen again…
  3. Do I have a wrong idea about the development model of ions? Do they need actively require a person to maintain them? This is sort of surprising for me, as that does essentially discount them as an option to deliver solutions for anyone but a dev shop.

Is this a solo or production system?
The first thing I would attempt is restarting your compute node(s) that are running the ion.

Have you examined your Datomic Cloudwatch dashboard and logs? Is the system up and reporting logs/metrics?

Hi @marshall,

Sorry for not responding sooner, I didn’t realise that you’d responded.

This is a solo system, we were evaluating how well Ion’s fits our various use-cases.

In this particular case I’d repointed our DNS to one of our fallback systems, I’d left the ion as-is because I wanted to sit down at some point and unpick what went wrong, but I’ve not had the time to do so yet.

Is restarting the main way of dealing with these? If so is there some way of managing these? Should I be looking to implement some health check and then automatically rebooting the machine if it fails?

The logs from what I can recall at the time didn’t output that anything was wrong as I mentioned in my original post.

I was quite frustrated at the time because I had no idea that the system had just been failing, I mean I’m ok with systems failure, that happens, but silently failing? That’s less great.

Happy to sit down and diagnose it if you want to?

I understand however if your followup is use the production tier =)…