The file sfdp_types.api already defines SFDP_API_SESSION_STATE_TIME_WAIT value, but no code currently sets it, so there must be a bug or a missing feature.
Let me describe how this is affecting upcoming SFDP performance tests in CSIT. While packet loss there usually happens as a RX miss on VPP, it is possible that it happens as a RX miss on TRex (simulating responder TCP stack). If that affects the last ACK, responder will retransmit its FIN+ACK, but if VPP has deleted the session already, this FIN+ACK will create a session in the opposite direction. I am not currently sure what is the TIME_WAIT value on TRex simulated initiator TCP stack, but CSIT does forcefully abort all the traffic at the end of the trial, so in unfortunate timing there is no RST packet generated by the initiator stack notifying VPP that the reverse session is wrong. When next trial starts in a TPUT test, this reverse direction session is observed to prevent forwarding in old forward (new reverse) direction, causing unavoidable losses even at low loads, thus causing the performance test to fail.
I tried to create a python test (for make test) that replicates the situation, and I am also trying to sketch a fix that adds such "connection tracking time-wait timeout" to SFDP (to sfdp-tcp-check service, triggered by sfdp-l4-lifecycle service, also adding one more per-tenant configurable timeout value).
The file sfdp_types.api already defines SFDP_API_SESSION_STATE_TIME_WAIT value, but no code currently sets it, so there must be a bug or a missing feature.
Let me describe how this is affecting upcoming SFDP performance tests in CSIT. While packet loss there usually happens as a RX miss on VPP, it is possible that it happens as a RX miss on TRex (simulating responder TCP stack). If that affects the last ACK, responder will retransmit its FIN+ACK, but if VPP has deleted the session already, this FIN+ACK will create a session in the opposite direction. I am not currently sure what is the TIME_WAIT value on TRex simulated initiator TCP stack, but CSIT does forcefully abort all the traffic at the end of the trial, so in unfortunate timing there is no RST packet generated by the initiator stack notifying VPP that the reverse session is wrong. When next trial starts in a TPUT test, this reverse direction session is observed to prevent forwarding in old forward (new reverse) direction, causing unavoidable losses even at low loads, thus causing the performance test to fail.
I tried to create a python test (for make test) that replicates the situation, and I am also trying to sketch a fix that adds such "connection tracking time-wait timeout" to SFDP (to sfdp-tcp-check service, triggered by sfdp-l4-lifecycle service, also adding one more per-tenant configurable timeout value).