Ions Silent Timeout Failure

Folcon · February 24, 2020, 4:07pm

So my ions deployment has been silently failing for a while, just responding with {"message": "Endpoint request timed out"} . Looking at the logs there’s no info that anything is wrong, no alerts are being generated, it just appears to be silently timing out.

I turned on x-ray tracing errors, which just shows the request being sent to the lambda app and then timing out after 1 minute.

The thing that’s puzzling is that it should be just calling a ring get request, so I’m not sure why it’s timing out? All throughout development it was working fine, I released a version of it which my users were happy with. And have been subsequently working on other things for a bit, before I get some time to go back to it again.
So 3 questions:

Any idea how I can debug what’s wrong?
At the very least to get alerts etc if it just fails again? I really don’t want this to happen again…
Do I have a wrong idea about the development model of ions? Do they need actively require a person to maintain them? This is sort of surprising for me, as that does essentially discount them as an option to deliver solutions for anyone but a dev shop.

marshall · March 17, 2020, 3:03pm

Is this a solo or production system?
The first thing I would attempt is restarting your compute node(s) that are running the ion.

Have you examined your Datomic Cloudwatch dashboard and logs? Is the system up and reporting logs/metrics?

Folcon · May 26, 2020, 2:37pm

Hi @marshall,

Sorry for not responding sooner, I didn’t realise that you’d responded.

This is a solo system, we were evaluating how well Ion’s fits our various use-cases.

In this particular case I’d repointed our DNS to one of our fallback systems, I’d left the ion as-is because I wanted to sit down at some point and unpick what went wrong, but I’ve not had the time to do so yet.

Is restarting the main way of dealing with these? If so is there some way of managing these? Should I be looking to implement some health check and then automatically rebooting the machine if it fails?

The logs from what I can recall at the time didn’t output that anything was wrong as I mentioned in my original post.

I was quite frustrated at the time because I had no idea that the system had just been failing, I mean I’m ok with systems failure, that happens, but silently failing? That’s less great.

Happy to sit down and diagnose it if you want to?

I understand however if your followup is use the production tier =)…

Topic		Replies	Views
Ion deployment failure Troubleshooting	6	2967	May 28, 2019
Solo node became unresponsive and required reboot Troubleshooting	3	912	December 11, 2019
Cache behaviour on production topology Datomic Cloud	4	1394	February 22, 2019
Ions on production topology deploy Datomic Cloud	1	918	May 28, 2019
Ion-dev 0.9.282 and Ion 0.9.50 Announcements	0	555	February 23, 2021

Ions Silent Timeout Failure

Related topics