Re: Ethernet bonding mode 5 only using one Slave adapter.

To: Muhammad Yousuf Khan <sirtcp@gmail.com>
Cc: debian <debian-user@lists.debian.org>
Subject: Re: Ethernet bonding mode 5 only using one Slave adapter.
From: Stan Hoeppner <stan@hardwarefreak.com>
Date: Thu, 10 Oct 2013 01:12:10 -0500
Message-id: <[🔎] 5256453A.8060304@hardwarefreak.com>
Reply-to: stan@hardwarefreak.com
In-reply-to: <[🔎] CAGWVfMkgKLvENg_HBvyTQF9WstRA=t=UwYLpqcrQZuQ=RfnbMg@mail.gmail.com>
References: <[🔎] CAGWVfMkMSkD-0sg3jCLq9Zn2_Z95vv7urn32Bh8JmVavZmC8EA@mail.gmail.com> <[🔎] 52551987.1080909@hardwarefreak.com> <[🔎] CAGWVfMkgKLvENg_HBvyTQF9WstRA=t=UwYLpqcrQZuQ=RfnbMg@mail.gmail.com>

On 10/9/2013 5:51 AM, Muhammad Yousuf Khan wrote:
> [cut]...........
> 
> 
> What workload do you have that requires 400 MB/s of parallel stream TCP
>> throughput at the server?  NFS, FTP, iSCSI?  If this is a business
>> requirement and you actually need this much bandwidth to/from one
>> server, you will achieve far better results putting a 10GbE card in the
>> server and a 10GbE uplink module in your switch.  Yes, this costs more
>> money, but the benefit is that all client hosts get full GbE bandwidth
>> to/from the server, all the time, in both directions.  You'll never
>> achieve that with the Linux bonding driver.
> 
> I appreciate your detailed email. it clears lots of confusion going inside
> my mind.
> the reason of increasing bandwidth is testing clustering /VM hosting on NFS
> and VM backups. My company is about to host their product inside our
> foreigner office-premises and i will be maintaining those servers
> remotely,. therefore i need to consider high availability of our service
> and that's why trying to test different technologies that can full-fill our
> requirement.
> 
> Specificity testing Ceph clustering for hosting purposes. and for backing
> up my VMs as you know VMs are huge and moving them around on 1GB x-over
> point to point link takes time. so i thought i could increase some of the
> bandwidth and can use link aggregation to avoid single point of failure.
> 
> i agree with you on buying 10GB LANs but unfortunately  as i am testing
> this stuff very far far away from US these cards are not  easily available
> in my country, thus unnecessarily expensive.

Are dual and quad port Intel NICs available in your country?

> if you still have any advice for such scenario as mine i will be more glad
> to have it.

Before a person makes a first attempt at using the Linux bonding driver,
s/he typically thinks that it will magically turn 2/4 links of Ethernet
into one link that is 2/4x as fast.  This is simply not the case, and is
physically impossible.  The 802.3xx specifications do not enable nor
allow this.  And TCP is not designed for this.  All of the bonding modes
are designed first for fault tolerance, and 2nd for increasing aggregate
throughput, but here only from one host with bonded interfaces to many
hosts with single interfaces.

There is only one Linux bonding driver mode that can reliably yield
greater than 1 link of send/receive throughput between two hosts, and
that is balance-rr.  To get it working without a lot of headaches
requires a specific switch topology.  And its throughput will not scale
with the number of links.  The reason for this is that you're breaking a
single TCP session stream into 2 or 4 streams of Ethernet frames each
carrying part of a single TCP stream.  This can break many of the TCP
stack optimizations such as window scaling, etc.  You may also get out
of order packets, depending on the NICs used, and how much buffering
they do before generating an interrupt.  Reordering of packets at the
receiver decreases throughput.  Thus each link will have less throughput
than when running in standalone mode.  Most of the above information is
covered in the kernel document.

The primary driving force you mentioned behind needing more bandwidth is
backing up VM images.  If that is the case, increase the bandwidth only
where it is needed.  Put a 4 port Intel NIC in the NFS server and a 4
port Intel NIC in the backup server.  Use 4 crossover cables.  Configure
balance-rr and tweak bonding and TCP stack settings as necessary.  Use a
different IP subnet for this bonded link and modify  the routing table
as required.  If you use the same subnet as regular traffic you must
configure source based routing on these two hosts and this is a big
PITA.  Once you get this all setup correctly, this should yield
somewhere between 1-3.5 Gb/s of throughput for a single TCP stream
and/or multiple TCP streams between the NFS and backup servers.  No
virtual machine hosts should require more than 1 Gb/s throughput to the
NFS server, so this is the most cost effective way to increase backup
throughput and decrease backup time.

WRT Ceph, AIUI, this object based storage engine does provide a POSIX
filesystem interface.  How complete the POSIX implementation is I do not
know.  I get the impression it's not entirely complete.  That said, Ceph
is supposed to "dynamically distribute data" across the storage nodes.
This is extremely vague.  If it actually spreads the blocks of a file
across many nodes, or stores a complete copy of each file on every node,
then theoretically it should provide more than 1 link of throughput to a
client possessing properly bonded interfaces, as the file read is sent
over many distinct TCP streams from multiple host interfaces.  So if you
store your VM images on a Ceph filesystem you will need a bonded
interface on the backup server using mode balance-alb.  With balance-alb
properly configured and working on the backup server, you will need at
minimum 4 Ceph storage nodes in order to approach 400 MB/s file
throughput to the backup application.

Personally I do not like non-deterministic throughput in a storage
application, and all distributed filesystems exhibit non deterministic
throughput.  Especially so with balance-alb bonding on the backup server.

Thus, you may want to consider another approach:  build an NFS
active/stand-by heartbeat cluster using two identical server boxes and
disk, active/active DRBD mirroring, and GFS2 as the cluster filesystem
atop the DRBD device.  In this architecture you would install a quad
port Intel NIC in each server and one in the backup server, connect all
12 ports to a dedicated switch.  You configure balance-rr bonding on
each of the 3 machines, again using a separate IP network from the
"user" network, again configuring the routing table accordingly.

In this scenario, assuming you do not intend to use NFS v4 clustering,
you'd use one server to export NFS shares to the VM cluster nodes.  This
is your 'active' NFS server.  The stand-by NFS server would, during
normal operation, export the shares only to the backup server.
Since both NFS servers have identical disk data, thanks to DRBD and
GFS2, the backup server can suck the files from the standy-by NFS server
at close to 400 MB/s, without impacting production NFS traffic to the VM
hosts.

If the active server goes down the stand-by server will execute scripts
to take over the role of the active/primary server.  So you have full
redundancy.  These scripts exist and are not something you must create
from scratch.  This clustered NFS configuration w/DRBD and GFS2 is a
standard RHEL configuration.

With Ceph, or Gluster, or any distributed storage, backup will always
impact production throughput.  Not from a network standpoint as you
could add a dedicated network segment to the Ceph storage nodes to
mitigate that.  The problem is disk IOPS.  With Ceph your production VMs
will be hitting the same disks the backup server is hitting.

So after all of that, the takeaway here is that bonding is not a general
purpose solution, but very application specific.  It has a very limited,
narrow, use case.  You must precisely match the number of ports and
bonding mode to the target application/architecture.  Linux bonding will
NOT allow one to arbitrarily increase application bandwidth on all hosts
in a subnet simply by slapping in extra ports and turning on a bonding
mode.  This should be clear to anyone who opens the kernel bonding
driver how-to document I linked.  It's 42 pages long.  If bonding were
general purpose, easy to configure, and provided anywhere close to the
linear speedup lay people assume, then this doc would be 2-3 pages, not 42.

-- 
Stan

Reply to:

Follow-Ups:
- Re: Ethernet bonding mode 5 only using one Slave adapter.
  - From: Muhammad Yousuf Khan <sirtcp@gmail.com>

References:
- Ethernet bonding mode 5 only using one Slave adapter.
  - From: Muhammad Yousuf Khan <sirtcp@gmail.com>
- Re: Ethernet bonding mode 5 only using one Slave adapter.
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: Ethernet bonding mode 5 only using one Slave adapter.
  - From: Muhammad Yousuf Khan <sirtcp@gmail.com>

Prev by Date: Re: Fwd: USB 3.0 Device Enumeration failed on USB 3.0 Port in Debian 6.0 "squeeze" 64 Bit OS
Next by Date: Re: GnuPG with OpenPGP card implementation
Previous by thread: Re: Ethernet bonding mode 5 only using one Slave adapter.
Next by thread: Re: Ethernet bonding mode 5 only using one Slave adapter.
Index(es):
- Date
- Thread