API Gateway internal server error

This kind of looks like a cold start issue, since it only happens after I make a request against API gateway routes after not using them for a little while. The first request always returns an internal server error to me.

From the associated lambda logs, I see this error:

java.io.IOException: Connection reset by peer: datomic.ion.lambda.handler.exceptions.Fault
datomic.ion.lambda.handler.exceptions.Fault: java.io.IOException: Connection reset by peer
at datomic.ion.lambda.handler$throw_anomaly.invokeStatic(handler.clj:24)
at datomic.ion.lambda.handler$throw_anomaly.invoke(handler.clj:20)
at datomic.ion.lambda.handler.Handler.on_anomaly(handler.clj:139)
at datomic.ion.lambda.handler.Handler.handle_request(handler.clj:155)
at datomic.ion.lambda.handler$fn__4075$G__4011__4080.invoke(handler.clj:70)
at datomic.ion.lambda.handler$fn__4075$G__4010__4086.invoke(handler.clj:70)
at clojure.lang.Var.invoke(Var.java:396)
at datomic.ion.lambda.handler.Thunk.handleRequest(Thunk.java:35)

It takes about 5 seconds for the error to happen. The lambda timeout is set to 3 minutes, and the API Gateway timeout is the default.

Subsequent requests (with the exact same request parameters) work without issue. This also doesn’t seem to be 100% reproducible, since sometimes I get a response, with the ~5 second pause before the response arrives.

I have also hit this exception, so I opened a ticket with the Datomic team. Their response was: retry the request. I added retry to my code and haven’t hit the exception yet. IMO, the exception should be fixed rather than worked around.

I’m unsure what I could even retry in this case. I just have API Gateway pointing at the lambda, and my code literally returns a constant value.

I experienced this when I switched to the production topology. My entire app is served through a single ion connected to API gateway, and I would randomly get 500 errors for certain routes. Sometimes the request for the CSS file would fail, other times for the main JS file, and sometimes even the / route itself (so it would just display a blank page with “Connection reset by peer”). After switching back to the solo topology the problem went away.

These errors indicate that the lambda has lost its connection to the Datomic node. This would be expected any time you deploy an ion or re-launch a system (i.e. anything that cycles the Datomic process or EC2 instances will break the connection).

Did you wrap your requests to the endpoint(s) in retry?

Can you provide more details as to what you’re building? It sounds like you’re building a full web app via a single Ion. We tend to use something like S3 for static site resources (i.e your CSS), while using the Ion as a service endpoint for dynamic resources.

It may be that the lambda-backed endpoint approach is not an ideal fit for this use case. However, an upcoming release will provide alternative “plumbing” that doesn’t rely on lambdas, which may better fit this type of use.

1 Like

The “Connection reset by peer” errors were happening a day after I deployed, so it wasn’t due to the node restarting.

I am not sure what this means.

The site I was referring to is Midibin. I agree that serving static resources through an ion is not ideal, but it’s just a small hobby project so I was more concerned with convenience.

The reason I wanted to use the production topology is that Midibin is very memory hungry (it is synthesizing music server-side) and the t2.small node just doesn’t have enough ram, so I have to re-deploy the site regularly. But as I mentioned, the production topology so far has been no bueno. Not an emergency though; I look forward to the upcoming release.

For our use case, we had a very simple REST API with three routes (each of them is only a GET).

These routes are called by our mobile apps on launch, and we don’t have any automatic retry for these routes, so an error here leads to a error dialog being presented to the user (yeah, we can debate that approach, but the solution of “just add a retry” is not that simple).

We see this error very often, and it’s not related to a deploy.

We basically migrated this API from a dedicated EC2 instance to an Ion (with a Lambda and API Gateway) in our dev stack, and our automated UI tests began failing occasionally because of this error. We have since migrated back to running this service on a dedicated EC2 instance, because the reliability of the Lambda wasn’t tenable as things stand.

Is there something else I should investigate that might be cycling the Datomic process?

I should be more clear - cycling the Datomic process is one thing that can cause this issue. Anything that disrupts the Lambda-Datomic Node connection will likely do so. I suspect there are a number of Lambda timeout-related issues that can cause a similar issue.

If you’re unable to implement a retry strategy at the client-side of this interaction, I’d suggest either doing what you did (move to an instance that runs your client) or use the (upcoming) non-lambda connection that will allow API gateway to call directly to the Datomic Node.

OK. This error seems to coincide with a Lambda cold-start (though I don’t have clear evidence that it only occurs with cold-starts), so some kind of timeout seems possible. I’ll wait and reevaluate once a Lambda-free option is available.

But it’s kind of a shame, because this option (Ion+Lambda+API Gateway) seemed like an ideal solution for us; we tried running this code as a stand-alone Lambda, but startup times were abysmal. Even though there were some slow starts, the Ion solution was approaching something viable for production. We’re also not terribly happy with having the costs of dedicated EC2 instances for this service, and wanted a serverless option.

Thanks, we will keep an eye out.

This is also something we’re experiencing, exclusively on cold starts. It feels like a timeout, or a request being made too quickly at some stage in the chain.

In my case, I had the same problem with the Solo topology, with a completely different app and set of dependencies. Thus it seems to be something unrelated to our code.

I’d be happy to help beta-test a non-lamdba template if that helps.

1 Like

Setting up a cloudwatch event to ping your endpoint(s) every 5 min seems to resolve the cold start issue.

I used this provided cloudformation template, but you can make the cloudwatch event yourself.

Apparently using a VPC with the lambda causes up to 15s response delay on cold starts: https://www.zrzka.dev/2016/10/30/aws-journey-api-gateway-lambda-vpc-performance.html