[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Network Performance Degrading over random amount of time



Am 2014-04-22 10:38, schrieb h@xx0r.eu:
Am 2014-04-20 23:49, schrieb Karl E. Jorgensen:
Hi

On Sun, Apr 20, 2014 at 01:01:53PM +0200, h@xx0r.eu wrote:
Hi List,
maybe you have a clue about the issues im having since several months.
My Homeserver is running Debian Jessy right now, the network issues
where there with wheezy aswell.
after a fresh boot my network behaves like it should archiving near
gbit speeds which is nice, after a random amount of uptime though my
throughput degrades below 100mbit network speeds (about ~3.5MB/s)
i measured using iperf.

You don't explicitly say... Does a reboot "cure" the problem (temporarily?)

Yep thats exactly what a reboot does for me, i tenad to reboot about
once every 2-3 days because of this issue, not something you would
expect from a unix OS :D


If so, does a "ifdown eth0"[1] + "ifup eth0" have the same effect? (if
necessary: Unplug and re-plug the cable between "ifdown" and
"ifup"...)  [A full reboot is a bit like a sledge hammer... very
crude]

I have yet to try this, will report back when i have the performance
problem again and try it


Just got the chance to try, and yes, an ifdown eth0 -> Cable replug -> ifup eth0 also cures this problem


From the point-of-view of the switch, this should be almost
indistinguishable from a full reboot...

Anything in the kernel message log? (e.g. output of "dmesg" or
/var/log/kern.log) It would be interesting if the kernel spat out some
messages around the time of the degradation...  E.g. link-level
renegotiation or similar.

Also: Anything interesting in the output of "ifconfig eth0" ?  I'm
particularly interested in the counters for errors, dropped, overruns,
frame/carrier counts: These counters may show interesting changes
around the time of the degradation...


I will write this down for next performance degration aswell

Output of dmesg looks a bit suspicious:

[40886.039833] irq 16: nobody cared (try booting with the "irqpoll" option) [40886.039963] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.13-1-amd64 #1 Debian 3.13.7-1 [40886.039965] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3806 08/20/2012 [40886.039967] ffff88040bdca4bc ffffffff814a1327 ffff88040bdca400 ffffffff810aa4a8 [40886.039970] ffff88040bdca400 0000000000000010 0000000000000000 ffffffff810aa93a [40886.039973] 0000000000000000 0000000000000000 0000000000000010 0000000000000000
[40886.039976] Call Trace:
[40886.039977]  <IRQ>  [<ffffffff814a1327>] ? dump_stack+0x41/0x51
[40886.039988]  [<ffffffff810aa4a8>] ? __report_bad_irq+0x28/0xc0
[40886.039991]  [<ffffffff810aa93a>] ? note_interrupt+0x1ba/0x210
[40886.039994] [<ffffffff810a8471>] ? handle_irq_event_percpu+0xc1/0x1b0
[40886.039997]  [<ffffffff810a8593>] ? handle_irq_event+0x33/0x50
[40886.040000]  [<ffffffff810ab358>] ? handle_fasteoi_irq+0x58/0x100
[40886.040004]  [<ffffffff81014388>] ? handle_irq+0x18/0x30
[40886.040007]  [<ffffffff81013f20>] ? do_IRQ+0x40/0xb0
[40886.040011]  [<ffffffff814a6ead>] ? common_interrupt+0x6d/0x6d
[40886.040012] <EOI> [<ffffffff8107eb47>] ? __hrtimer_start_range_ns+0x1b7/0x3e0
[40886.040019]  [<ffffffff81388c9a>] ? cpuidle_enter_state+0x4a/0xc0
[40886.040022]  [<ffffffff81388db9>] ? cpuidle_idle_call+0xa9/0x1d0
[40886.040025]  [<ffffffff8101adb5>] ? arch_cpu_idle+0x5/0x30
[40886.040028]  [<ffffffff810a777e>] ? cpu_startup_entry+0xbe/0x280
[40886.040032]  [<ffffffff8103c484>] ? start_secondary+0x1d4/0x230
[40886.040034] handlers:
[40886.040173] [<ffffffffa010a7a0>] ata_bmdma_interrupt [libata]
[40886.040336] [<ffffffffa00a5450>] e1000_intr [e1000]
[40886.040506] Disabling IRQ #16


IRQ16 is related to eth0 according to /proc/interrupts:

16: 3164992 3462922 0 0 IO-APIC-fasteoi pata_via, eth0


Output of ifconfig looks unsuspicious, a few dropped packets but nothing major:

eth0      Link encap:Ethernet  Hardware Adresse 00:0e:0c:b9:5e:1d
inet Adresse:192.168.1.20 Bcast:192.168.1.255 Maske:255.255.255.0 inet6-Adresse: fda3:32bd:abab:0:20e:cff:feb9:5e1d/64 Gültigkeitsbereich:Global inet6-Adresse: fe80::20e:cff:feb9:5e1d/64 Gültigkeitsbereich:Verbindung
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metrik:1
          RX packets:11881446 errors:0 dropped:882 overruns:0 frame:0
          TX packets:29392900 errors:0 dropped:0 overruns:0 carrier:0
          Kollisionen:0 Sendewarteschlangenlänge:1000
          RX bytes:7149517599 (6.6 GiB)  TX bytes:69090488843 (64.3 GiB)


Current Hardware:

- Asus P8H67-M PRO
- Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz
- 16 GB DDR3 Ram (2*8GB Kingston ram)
- Intel Corporation 82541PI Gigabit Ethernet Controller
- TP-Link 8-Port gbit switches (2 of em between home-server and clients)

*two* switches between server and clients?  Sounds a bit unusual - at
least for a home set-up...

Well im a Console Collector and my Linux Server is right behind my
Home Entertainment area in the living room. thats where the first
switch is located to hook all the entertainment stuff up to the lan,
then there is one uplink line going to the other end of the room where
the 2. switch is located, connecting my Desk PC, Laptop, printer, wifi
ap and internet Gateway to the lan aswell.


Ive tried diffrent things so far:

- Switched from a switched cabling setup to Crosslink.

Hm... AFAIK modern network cards tend to adjust themselves to both
"normal" and cross-over cables (which I believe that "crosslink"
means).


Jup i wasnt clear enough here i guess, i connected the client (one at
a time) with the server directly using a normal off the shelf patch
cable, i just call it crosslinking because no switch is in between

- Swapped out the cheap asrock motherboard with asus
- Changed from onboard realtek network chip to PCI Intel Gbit card

Hm.. That would likely rule out any network card issues.

- Reinstalled OS several times

.. which would most likely rule out any OS bugs. But not administrator
configuration mistakes...

Jup i guess (wheezy and jessy)


- Testing from diffrent clients (Win 7, Linux Mint, Debian, Ubuntu)

... which would then most likely rule out administrator mistakes: Win7
is sufficient differently from anything else to make it difficult to
make the same mistake across platforms.


And Hardware issues on my client aswell since they all are diffrent
chipset network hardware

- Downloading vendor drivers and using them instead of the kernel
inbuild ones

Nothing so far had worked to get my gbit speeds stable over a few days.

When you measure the speed, between which two points do you measure
the speed?

client -> tp-link -> tp-link -> server

or with direct connection circumventing the switches

client -> server


I'm concerned about the TWO TP-Link switches: The diagnostics you have
done so far does not appear to rule them out.... Does your traffic
have to pass through both of them?  If so, how are they switches
connected?


imho i ruled them out with directly connecting my client(s) one at a
time to the server using an patch cable circumventing those switches
in question. The reported degration in network speed happens there
aswell.

Based on what you have written, my main suspects would be the two
switches - with a focus on the "nearest" switch...

im open to ANY suggestions here even if they involve building a
custom kernel or other magical hakkery ;D

Well - it looks like you have put a fair amount of effort into solving
this.... But until the problem is narrowed down, this would probably
be as likely to resolve the problem as a goat sacrifice ... You
haven't got a spare goat[2], have you? :-)

Mhm i have a few gots (Long Goat, Feather Goat etc...)
Dunno if they count as spare's?


Hope this helps

Jup defenetely, a few more ideas and things i should gather to help
debugging pinpointed :)


[1] I'm assuming eth0 here....
[2] A live one would constitute a "hot spare", right?  Yeah. Tangent.

--
Karl E. Jorgensen


Lukas Wingerberg

Lukas Wingerberg ²


Reply to: