[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#754294: Regression: While routing Kernel chokes on spurious "too big" IP packets



Package: linux
Version: 3.2.60-1+deb7u1
Severity: important

Dear Maintainer,

the Kernel upgrade via Debian Security on Friday 2014-07-04 made routing service (in this case with NAT) somewhat broken.

*High level description*
This is experienced by users as "very slow network access" to some servers, and only by some client computers.

*Setup*
A gateway using Linux for routing and NAT running Debian stable (amd64) was updated from linux-image-3.2.0-4-amd64:amd64 version 3.2.57-3+deb7u2 to version 3.2.60-1+deb7u1

The gateway is supposed to route and NAT traffic from a private network to the public internet, translating RFC1918 client source addresses to public addresses.

In the following tcpdumps I have replaced the RFC1918 address of a client with "CLIENT", the public IP address of the relevant gateway with "NAT" and the public IP address of a server in the Internet with "SERVER" for reasons of privacy, as they were gathered in a LIVE environment.

*Problem description*
After the update the Kernel chokes on apparently "too big" IP packets that don't fit the MTU:

17:11:00.917355 IP SERVER.80 > NAT.44991: Flags [.], seq 1:2921, ack 98, win 5840, length 2920
17:11:00.917384 IP NAT > SERVER: ICMP NAT unreachable - need to frag (mtu 1500), length 556

Note the large IP packet (2960 > MTU of 1500). It has the DF bit set. The packet cannot have arrived via the network, though, as it is an Ethernet with an MTU of 1500, so this is odd. *1

*Workaround*
The problem disappears when GRO is deactivated:

ethtool -K eth0 gro off

The kernel then receives only valid packets of up to MTU in size:

17:14:53.288712 IP SERVER.80 > NAT.44996: Flags [.], seq 1:1461, ack 98, win 5840, length 1460
17:14:53.288730 IP SERVER.80 > CLIENT.44996: Flags [.], seq 1:1461, ack 98, win 5840, length 1460
17:14:53.288735 IP SERVER.80 > NAT.44996: Flags [.], seq 1461:2921, ack 98, win 5840, length 1460
17:14:53.288887 IP SERVER.80 > CLIENT.44996: Flags [.], seq 1461:2921, ack 98, win 5840, length 1460

GRO is a performance optimization where the NIC assembles packets into larger packets for smaller processing/interrupt overhead. GRO defaults to on (on this hardware).

*Regression*
The problem did not exist in 3.2.57-3+deb7u2. In that version the Kernel forwards those big packets as many smaller packets of up to MTU size:

16:23:01.394351 IP SERVER.80 > NAT.44943: Flags [.], seq 1:2921, ack 98, win 5840, length 2920
16:23:01.394375 IP SERVER.80 > CLIENT.44943: Flags [.], seq 1:1461, ack 98, win 5840, length 1460
16:23:01.394525 IP SERVER.80 > CLIENT.44943: Flags [.], seq 1461:2921, ack 98, win 5840, length 1460

Note this is not IP fragmentation, as the smaller packets contain one TCP segment each.

*Possible causes*

I suspect the reason for how the error manifests to end users ("very slow network access" to some servers, and only by some client computers) is that the actual operation of GRO is influenced by the NIC/driver, timing of packet flow, and IP/TCP options used (which depend on client OS and configuration and server OS and configuration). Then, the server's retransmit behaviour may cause single packets to be transmitted, which are then not mangled by GRO and can be successfully forwarded to clients, although that is very slow.

There are two changes between 3.2.57-3+deb7u2 and 3.2.60-1+deb7u1 that look related, because they were supposed to fix a similar issue with IP packets that arrive fragmented but have the DF bit set:

In the Debian specific patch set patches/bugfix/all/netfilter-ipv4-defrag-set-local_df-flag-on-defragmen.patch:
[quote]
From: Florian Westphal <fw@strlen.de>
Date: Fri, 2 May 2014 15:32:16 +0200
Subject: netfilter: ipv4: defrag: set local_df flag on defragmented skb
Origin: https://git.kernel.org/linus/895162b1101b3ea5db08ca6822ae9672717efec0

else we may fail to forward skb even if original fragments do fit
outgoing link mtu:

1. remote sends 2k packets in two 1000 byte frags, DF set
2. we want to forward but only see '2k > mtu and DF set'
3. we then send icmp error saying that outgoing link is 1500

But original sender never sent a packet that would not fit
the outgoing link.

Setting local_df makes outgoing path test size vs.
IPCB(skb)->frag_max_size, so we will still send the correct
error in case the largest original size did not fit
outgoing link mtu.

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Suggested-by: Maxime Bizon <mbizon@freebox.fr>
Fixes: 5f2d04f1f9 (ipv4: fix path MTU discovery with connection tracking)
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/nf_defrag_ipv4.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
index 12e13bd..f40f321 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -22,7 +22,6 @@
 #endif
 #include <net/netfilter/nf_conntrack_zones.h>
 
-/* Returns new sk_buff, or NULL */
 static int nf_ct_ipv4_gather_frags(struct sk_buff *skb, u_int32_t user)
 {
 	int err;
@@ -33,8 +32,10 @@ static int nf_ct_ipv4_gather_frags(struct sk_buff *skb, u_int32_t user)
 	err = ip_defrag(skb, user);
 	local_bh_enable();
 
-	if (!err)
+	if (!err) {
 		ip_send_check(ip_hdr(skb));
+		skb->local_df = 1;
+	}
 
 	return err;
 }
[end quote]

and in the vanilla kernel diff:
[quote]
diff -r -u linux-3.2.57/net/ipv4/ip_forward.c linux-3.2.60/net/ipv4/ip_forward.c
--- linux-3.2.57/net/ipv4/ip_forward.c  2014-04-09 03:20:47.000000000 +0200
+++ linux-3.2.60/net/ipv4/ip_forward.c  2014-06-09 14:29:18.000000000 +0200
@@ -42,12 +42,12 @@
 static bool ip_may_fragment(const struct sk_buff *skb)
 {
        return unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0) ||
-              !skb->local_df;
+               skb->local_df;
 }

 static bool ip_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
 {
-       if (skb->len <= mtu || skb->local_df)
+       if (skb->len <= mtu)
                return false;
[end quote]


Reply to: