Transactor stops responding with "Too many open files" error

We run Datomic Cloud in production topology that serves accounting documents (ie. invoices) in JSON format through a REST API (Lambda Ion + AWS API Gateway).

There is a customer that queries hundreds of documents in a batch process at a rate of approximately 20/sec.

After some time, the transactor hangs with a “Too many open files” error. Has someone experienced a similar error?

{
    "Msg": ":valcache/get-exception",
    "Ex": {
        "Via": [
            {
                "Type": "java.nio.file.FileSystemException",
                "Message": "/opt/ssd1/datomic/valcache/761/7dc57059-f6db-4398-b4a5-5cedc0704761: Too many open files",
                "At": [
                    "sun.nio.fs.UnixException",
                    "translateToIOException",
                    "UnixException.java",
                    91
                ]
            }
        ],
        "Trace": [
            [
                "sun.nio.fs.UnixException",
                "translateToIOException",
                "UnixException.java",
                91
            ],
            [
                "sun.nio.fs.UnixException",
                "rethrowAsIOException",
                "UnixException.java",
                102
            ],
            [
                "sun.nio.fs.UnixException",
                "rethrowAsIOException",
                "UnixException.java",
                107
            ],
            [
                "sun.nio.fs.UnixFileSystemProvider",
                "newFileChannel",
                "UnixFileSystemProvider.java",
                177
            ],
            [
                "java.nio.channels.FileChannel",
                "open",
                "FileChannel.java",
                287
            ],
            [
                "java.nio.channels.FileChannel",
                "open",
                "FileChannel.java",
                335
            ],
            [
                "datomic.cloud.valcache$direct_get",
                "invokeStatic",
                "valcache.clj",
                435
            ],
            [
                "datomic.cloud.valcache$direct_get",
                "invoke",
                "valcache.clj",
                426
            ],
            [
                "datomic.cloud.mc_cache.McCache$f__21348",
                "invoke",
                "mc_cache.clj",
                163
            ],
            [
                "clojure.lang.AFn",
                "run",
                "AFn.java",
                22
            ],
            [
                "java.util.concurrent.ThreadPoolExecutor",
                "runWorker",
                "ThreadPoolExecutor.java",
                1149
            ],
            [
                "java.util.concurrent.ThreadPoolExecutor$Worker",
                "run",
                "ThreadPoolExecutor.java",
                624
            ],
            [
                "java.lang.Thread",
                "run",
                "Thread.java",
                748
            ]
        ],
        "Cause": "/opt/ssd1/datomic/valcache/761/7dc57059-f6db-4398-b4a5-5cedc0704761: Too many open files"
    },
    "Type": "Event",
    "Tid": 84,
    "Timestamp": 1622445049738
}
{"Msg":":valcache\/get-exception","Ex":{"Via":[{"Type":"java.nio.file.FileSystemException","Message":"\/opt\/ssd1\/datomic\/valcache\/761\/7dc57059-f6db-4398-b4a5-5cedc0704761: Too many open files","At":["sun.nio.fs.UnixException","translateToIOException","UnixException.java",91]}],"Trace":[["sun.nio.fs.UnixException","translateToIOException","UnixException.java",91],["sun.nio.fs.UnixException","rethrowAsIOException","UnixException.java",102],["sun.nio.fs.UnixException","rethrowAsIOException","UnixException.java",107],["sun.nio.fs.UnixFileSystemProvider","newFileChannel","UnixFileSystemProvider.java",177],["java.nio.channels.FileChannel","open","FileChannel.java",287],["java.nio.channels.FileChannel","open","FileChannel.java",335],["datomic.cloud.valcache$direct_get","invokeStatic","valcache.clj",435],["datomic.cloud.valcache$direct_get","invoke","valcache.clj",426],["datomic.cloud.mc_cache.McCache$f__21348","invoke","mc_cache.clj",163],["clojure.lang.AFn","run","AFn.java",22],["java.util.concurrent.ThreadPoolExecutor","runWorker","ThreadPoolExecutor.java",1149],["java.util.concurrent.ThreadPoolExecutor$Worker","run","ThreadPoolExecutor.java",624],["java.lang.Thread","run","Thread.java",748]],"Cause":"\/opt\/ssd1\/datomic\/valcache\/761\/7dc57059-f6db-4398-b4a5-5cedc0704761: Too many open files"},"Type":"Event","Tid":84,"Timestamp":1622445049738}

I’m also seeing the java.io.IOException "Too many open files", though our exception is coming from SLF4J:

{
    "Msg": "",
    "DatomicCloudSlf4jLevel": "Warn",
    "DatomicCloudSlf4jSymbol": "org.eclipse.jetty.server.AbstractConnector",
    "DatomicCloudSlf4jSource": "SLF4J",
    "Ex": {
        "Via": [
            {
                "Type": "java.io.IOException",
                "Message": "Too many open files",
                "At": [
                    "sun.nio.ch.ServerSocketChannelImpl",
                    "accept0",
                    "ServerSocketChannelImpl.java",
                    -2
                ]
            }
        ],
        "Trace": [
            [
                "sun.nio.ch.ServerSocketChannelImpl",
                "accept0",
                "ServerSocketChannelImpl.java",
                -2
            ],
            [
                "sun.nio.ch.ServerSocketChannelImpl",
                "accept",
                "ServerSocketChannelImpl.java",
                421
            ],
            [
                "sun.nio.ch.ServerSocketChannelImpl",
                "accept",
                "ServerSocketChannelImpl.java",
                249
            ],
            [
                "org.eclipse.jetty.server.ServerConnector",
                "accept",
                "ServerConnector.java",
                388
            ],
            [
                "org.eclipse.jetty.server.AbstractConnector$Acceptor",
                "run",
                "AbstractConnector.java",
                702
            ],
            [
                "org.eclipse.jetty.util.thread.QueuedThreadPool",
                "runJob",
                "QueuedThreadPool.java",
                773
            ],
            [
                "org.eclipse.jetty.util.thread.QueuedThreadPool$Runner",
                "run",
                "QueuedThreadPool.java",
                905
            ],
            [
                "java.lang.Thread",
                "run",
                "Thread.java",
                748
            ]
        ],
        "Cause": "Too many open files"
    },
    "Type": "Event",
    "Tid": 125,
    "Timestamp": XXXX
}

I think it began around the time that I changed the logback config, but I can’t understand why.

Were you able to solve yours?

1 Like

@adamtait I do want to clarify that you’re getting this message on Cloud under the primary compute machine.

In general, the message indicates that you have too many connections open. If you can share some of your code I can look for likely culprits or where monitoring might help Are you using Cognitect’s AWS API? Specifically are you calling aws/client ? If you can’t provide your code, can you describe what your code is doing? It’s possible that whatever you are doing is creating a too many open files or there is a leak in the open file descriptor. But I’d like to know more about your application before speculating.

If you revert the change to logback config does the error go away? What change did you make in logback config?

1 Like

Hey guys, I did not resolve it completely but managed to isolate it. I separated the web server Ion into a query group and left the database functions and other Ions in the main transactor.

Now I got the “Too many open files” less frequently and only in the query group. I’m still trying to debug this issue. I found this library that helps you detect file leaks:

https://file-leak-detector.kohsuke.org

Will report back as soon as I find the offending line of code and share it here.

1 Like

Hi @jaret - thanks for replying.
Yes - I’m getting this error on a primary compute machine (not on the transactor… so probably not related to @eraad’s problem).

I cannot share my code, but the connections idea has got me thinking.

In general, the message indicates that you have too many connections open.

I use github.com/igrishaev/etaoin to open a connection to a webdriver on another machine. Our machine would do this many dozens of times per day. It’s possible that the connection is not being closed correctly and dead connections are building up.

I think the logback config change may be a red herring. I’m imagining that Datomic Ions probably ignore an bundled logback config anyways…