Thursday, April 26, 2018

How to Debug "java.io.IOException: Connection reset by peer"?

There are many reasons that WebLogic server may throw below exception:
java.io.IOException: Connection reset by peer
In this article, we will use one specific case for discussion.

Stack Trace


####<Apr 26, 2018, 8:42:37,381 AM UTC> <Error> <HTTP> <myserver> <CloudConsoleServer_MyServices> <[ACTIVE] ExecuteThread: '23' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <XaUWE0Co100000000> <1524732157381> <[severity-value: 8] [rid: 0:1:2] [partition-id: 0] [partition-name: DOMAIN] > <BEA-101019> <[ServletContext@15863685[app:cp-myservices.ear module:mycloud path:null spec-version:3.1 version:_18.2.4.0.0_180422.1400]] Servlet failed with an IOException.
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        at weblogic.socket.NIOOutputStream$SingleBufferWrite.writeTo(NIOOutputStream.java:841)
        at weblogic.socket.NIOOutputStream$BlockingWriter.flush(NIOOutputStream.java:455)
        at weblogic.socket.NIOOutputStream$BlockingWriter.write(NIOOutputStream.java:334)
        at weblogic.socket.NIOOutputStream.write(NIOOutputStream.java:220)
        at weblogic.servlet.internal.ChunkOutput.writeChunkTransfer(ChunkOutput.java:625)
        at weblogic.servlet.internal.ChunkOutput.writeChunks(ChunkOutput.java:587)
        at weblogic.servlet.internal.ChunkOutput.flush(ChunkOutput.java:471)
        at weblogic.servlet.internal.ChunkOutput$3.checkForFlush(ChunkOutput.java:757)
        at weblogic.servlet.internal.ChunkOutput.write(ChunkOutput.java:373)
        at weblogic.servlet.internal.ChunkOutputWrapper.write(ChunkOutputWrapper.java:165)
        at weblogic.servlet.internal.ServletOutputStreamImpl.write(ServletOutputStreamImpl.java:186)
        at java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:167)
        at oracle.adfinternal.view.faces.caching.filter.ResponseOutputStream.writeContentTo(ResponseOutputStream.java:74)
        at oracle.adfinternal.view.faces.caching.filter.AdfFacesCachingResponse._flushContent(AdfFacesCachingResponse.java:147)
        at oracle.adfinternal.view.faces.caching.filter.AdfFacesCachingResponse.flush(AdfFacesCachingResponse.java:136)

How to Debug


In this case, our server is connected to many applications in other servers. So, the peer-in-suspect could be from either a browser or an application running in our infrastructure.

Given the stack trace, the first thing to check is find some clues from it.  For example, in this case, we saw:

 weblogic.servlet.internal.ServletOutputStreamImpl.write(ServletOutputStreamImpl.java:186)
 java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:167)
 oracle.adfinternal.view.faces.caching.filter.ResponseOutputStream.writeContentTo(ResponseOutputStream.java:74)
 oracle.adfinternal.view.faces.caching.filter.AdfFacesCachingResponse._flushContent(AdfFacesCachingResponse.java:147)
 oracle.adfinternal.view.faces.caching.filter.AdfFacesCachingResponse.flush(AdfFacesCachingResponse.java:136)


which means that WebLogic server is 
Writing the servlet response back to the client when the IOException was thrown

If this client were a browser, what happened could be:
The browser has shutdown the connection — either the browser crashed, or the browser shut the connection explicitly because the user closed that page or cancelled navigation on that page.

If this client were an application (i.e., a selenium or other testing tools), what happened could be:
Timeouts in those tools for how long they will wait for the response.
then
Maybe their logs would show you that they had closed the socket after some time.


HTTP Keep-Alive


In [1], the author surmized that it could be:
 " Very likely an issue with HTTP keepalive (persistent connections)."
However, this is not our case because:
Keep-alive is to make sure the socket stays open between requests. Our case is in the middle of a request, so there is no keepalive in use at that time. But conceptually it is sort of the same thing: if the client (i.e., browser) decides that the response isn't coming, it closes the socket.

If You Find Out Who's the Peer


Let assume the client is another Linux application, here are possible debugging steps:
The only thing to check at the system level is that if the machine was up the entire time — you can check its uptime, and look at dmesg for messages about the link going up or down. Otherwise, maybe the application logs will tell you if the process restarted/crashed, which is the more likely cause. 

Could tcpdump help in this case?  Probably not  because
There will probably be too much data from a tcpdump unless you know how to filter what you are looking for. 


References

  1. Possible Causes for "Connection reset by peer" when using NIOReferences

No comments: