TCP Stream Stopping

Discussion in 'Computing and Networks' started by eeabe, Oct 26, 2015.

  1. eeabe

    Thread Starter Member

    Nov 30, 2013
    59
    9
    I'm using a Wiznet W5300 as a TCP server that waits for a client connection and then pushes data without receiving any data. It will run for many hours and then a socket will stop. The W5300 never sets the "SENDOK" flag which is an internal flag that indicates the previous send was accepted (may or may not have gone out physically) and I can send more data. I'm not sure if there's a bug in that chip, but I'd like to understand on the network level what if anything could possibly cause a TCP connection to stall.

    The W5300 reports the TCP connection as still established, and the client reports the connection as good, but doesn't receive any data anymore. The data is moving at maximum throughput, so I would expect buffers to fill up on a regular basis.

    If I understand, when someone sends an ACK, it includes a window size of how much more data it can receive. Is it possible that the client buffer fills up, it sends an ACK with 0 window size, and then it reads some data and sends an ACK with a non-zero window size, but that second ACK gets lost? If that happened, what would ever cause a recovery? Might a keepalive (which is an option in the W5300) force an updated ACK with a current window size?

    Are there any other reasons a TCP connection might get held up? The W5300 seems to have a strange status in a reserved byte, and I'm inquiring with them if it might have some meaning I can use, but other than that, it seems like it just gets held up waiting for the previous send to go through.

    I've played a little with Wireshark but I'm thinking it's not a great method to debug this since it happens seemingly randomly after many hours, and the amount of data collected would be many Gigabytes.

    Thanks for any ideas.
     
    Last edited: Oct 28, 2015
  2. eeabe

    Thread Starter Member

    Nov 30, 2013
    59
    9
    Update: I was able to duplicate the failure with 3 units running overnight. For the "frozen" units, I used Wireshark to analyze any traffic and found the following:

    The W5300 is trying to send a re-transmission over and over with a sequence number near the 32-bit rollover, and the ACK response has a non-matching sequence number. For example, the retransmission is 1460 bytes with sequence 0xFFFFFA70, and the ACK responsds with sequence 0x000000DC, which is the original sequence offset by 1644 instead of 1460.

    The other two units failed in similar fashion:
    - one retransmitting 1460 bytes with sequence 0xFFFFFAD8, and the ACK coming back as 0x000001C4 (off by 1772 instead of 1460)
    - one retransmitting 1460 bytes with sequence 0x0000088A, and the ACK coming back as 0x00000FF6 (off by 1900 instead of 1460)

    I looked at some of the normal data stream before the freeze up, and it looks like the packets were most often 1460 bytes, with some occasional smaller packets of sizes: 340, 416, 1028, 1112, 1368, 1452. It also seems common for the W5300 to send multiple packets at high speed and only get a single ACK.

    I was wondering if it's possible that the W5300 is sending multiple packets around the time the sequence rolls over, and the PC responds with a single ACK that is not being processed correctly. I've requested a response from Wiznet, but haven't heard yet. Any thoughts or workaround ideas?
     
  3. eeabe

    Thread Starter Member

    Nov 30, 2013
    59
    9
    Update 2:

    I was able to capture one of the failures with Wireshark. It looks like the problem starts when the PC sends an ACK with an ack sequence number that only accepts part of a packet. Does anyone know if that is acceptable per the TCP standard?

    I think there may be a problem with Labview or Windows 7 sharing buffer space because I see the window size between ACKs shrink by more than the amount of data that comes through that socket. I'm wondering if the window buffers are shared and data on another socket is filling up the space that was supposed to be available.

    I find it strange that this only occurs when the sequence number is near the 32-bit wraparound value. It seems to be correlated somehow but I don't know for sure. There may or may not be other occurrences of ACKs that only accept partial packets, but I haven't seen one in browsing some of the data so far.
     
  4. eetech00

    Active Member

    Jun 8, 2013
    650
    112
     
  5. eeabe

    Thread Starter Member

    Nov 30, 2013
    59
    9
    The previous "normal" ack had ack number of 0x00000A6E, and a window size of 2208.

    Then, the W5300 unit sends a packet of 1460 bytes.

    Then, the PC replies with an ack that has ack number of 0x00000C96, and a very large window size of 243072. If it had accepted the entire packet, the ack number should have been 0x00001022. Are you saying that this is incorrect behavior per the TCP standard? That would seem reasonable since you shouldn't advertise a window size and then not be able to take that much data.

    Anyways, at that point, the W5300 sends another data packet and then a FIN and then gets weird, but I'm hoping we can fix the PC/Labview side so this just doesn't happen.

    I think some other process may have used part of the 2208 bytes of buffer space in the meantime, so this socket could only accept less than the entire 1460 bytes. Then, the processes seem to have consumed a bunch of data to make more room.

    We can probably adjust the timing somewhat to try to avoid buffer overruns, but I'm also interested in learning if Labview might share buffers somehow, and if we could change that behavior. The system uses multiple units, and multiple sockets per unit, so maybe there is some resource sharing that is biting us.
     
Loading...