Java and TCP status CLOSE_WAIT

In a previous post, I talked briefly about the TCP status CLOSE_WAIT. The problem occurs between two nodes (we will refer to the node initiating the connection as the client and the one receiving the connection the server) when the server initiates a shutdown and closes the connection. The client however, has not closed the connection (socket) so now the connection on the client’s side is only half open.

With that review of the problem, we will now dive into the troubleshooting/debugging process. On my local machine, I was running two services that would be communicating (client and server for our example). The server was started and happily serving requests from the client. Since I had the HTTP connection pooling in the client that I described in my previous post, there were long-lived connections to reduce the overhead of creating a new connection on each request. In order to establish the baseline and see what things looked like in a happy scenario I did the following:

ps aux | grep client.jar

which produces the following output:

USER       PID %CPU %MEM     VSZ    RSS TTY    STAT START   TIME COMMAND
root     28559 34.4  1.0 4702560 348888 ?      Ssl  20:22   1:16 /jre/bin/java -jar /apps/client.jar

This would find me the process ID that was running (which is actually a process started in a docker container that I can see on the host system). Now we will the following:

cd /proc/28559/net/
ls | grep tcp

This will show us the following:

tcp
tcp6

You will notice that there is a tcp and a tcp6 file. Depending on how the process was binding to localhost, the connection information would be in the corresponding file where tcp is the file for IPv4 connections/bindings and tcp6 is for IPv6 connections/bindings. Next we will look for the single connection that we care about with the following command:

watch -n 0.001 "grep 1F90 tcp6"

The watch command will simply run the grep 1F90 command every 0.001 seconds (-n option) and output to stdout. This will show the output below:

sl  local_address                         remote_address                        st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
 5: 0000000000000000FFFF00000C0011AC:9CC0 0000000000000000FFFF0000120011AC:1F90 01 00000000:00000001 00:00000000 00000000     0        0 4648852 1 0000000000000000 20 4 30 10 -1

What we are concerned about here is the 01 in the st column. The values in the column can be mapped to the Linux kernel TCP states. This indicates that the connection is TCP_ESTABLISHED (the first enum in the list). Now if we continue to watch this output, we will see that if we were to terminate the server.jar pid, the connection will go from 01 to 08.

This simulates what happens when servers are being terminated and sending the shutdown signals to the applications running on them in a real life environment. The server will initiate it’s shutdown sequence, close all sockets it has open and send a message saying “I’m shutting down and you should too!”. It was at this point that we learned why we would get so many of the Java NoHttpResponseException errors and we started seeking out a way to prevent them.

comments powered by Disqus