[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Beowulf Cluster is very slow. Suggestions needed to increase the speed.



Hello again,

Well.. with 10Gbit.. it's an overkill :D with this hardware lab.
Imagine that good 48-port 10Gigabit switch cost around 25-30k US $ (in words: thousands). You'll also need 10Gigabit cards (let's say 500 US $ each). I didn't saw 8-port 10GbE switch anywhere, if You found it please - let me know! :D

Shouldnt You stay at Gigabit? but with good performance switch.
It means high values of:
- switching capacity (in Mpps)
- switching bus (in Gbps)
and low value of:
- switching latency (fractions of ns)

Less expensive way is.. Infiniband network. It usualy starts at 10Gbit speeds! There are Infiniband adapters and switches available on Ebay (I own two of them) on PCI-Express x4 interface. Some, not all, support TCP/IP networking as usual NICs. This interface is specially designed for clusters due very low latency times and low CPU usage at high speeds. But all cluster nodes have to be close to each other since cables ARE expensive :)
They also support more equipment than networking (for example: storage).
But on beginning make sure that Infiniband card is well supported under Linux, my isn't.

Regards,
TooMeeK




W dniu 2014-10-10 05:34, suresh kannan pisze:
Hi,

The link you have provided me was very helpful. I haven't prepared
myself before starting clusters. Few years back i did clustering with
three system using PVM for different application i.e protein ligand
docking. With three systems it took some 2 hours to finish a particular
library screening. Since, it is few hours i was happy and haven't
checked the speed of the clusters. I thought to myself i learned a
clustering technique. Therefore, this time i started blindly before
preparing clustering for different applications which takes months to
complete a specific job. Therefore, now i learned depending upon the
purpose of clustering requirement of things will be different.

I learned that speed of the network is a biggest bottleneck for
clustering especially for our need. In our lab I have found a 1000Mbs
switch which presently reduces some time compared to 100Mbs however it
is not efficient. Still it takes a lot of time. I assume we need
10gigabit switch. I was not aware of the price these "Gigabit routing
switch supporting layer 3". It seems to me that even 8-port 10 gigabit
switch cost approx. 800$. I am still hesitant to ask my mentor for the
10 gigabit switch. Since, I dont have no experience i had a thought what
if i am wrong somewhere although gigabit switch have to work. Now, I am
looking for 10 gigabit switch in near by labs so that i will connect and
check whether it sufficiently efficient for our job and then i can ask
my mentor with some confidence.

i also learned from someone we can use cluster OS specialized designed
for this clustering purpose for instance Pelican cluster
http://pareto.uab.es/mcreel/PelicanHPC/. I will install that cluster os
and check the performance too. I am reading materials to increase the
speed and performance of cluster in terms of hardware as well as the
software's. I will tune our application according to our need.

And, also in our lab room we have dual boot systems for 8 people
(windows with either linux mint, ubuntu, fedora). In given time those
systems will be either in windows or linux environment depending upon
our work. Daily 6 hours (nights) and on Sundays our system will be idle.
I am not capable to use those free computer timing for our advantage due
to my skill and also time. I guess we can use these computer time if we
have gigabit switch with networking skills. If someone did similar
stuffs please document either in your blogs or in email lists so that
users like me might get benefited.

Although, it took time i learned a lot during this clustering. Thank you
for your time.


regards
Suresh




On Wed, Oct 8, 2014 at 7:38 AM, Tomcio <toomeek_85@o2.pl
<mailto:toomeek_85@o2.pl>> wrote:

    Hello,

    been here since years, but very low number of questions so I'm so
    happy to see new thread here ;)

    I'm not boewulf cluster admin but..

    I belive Your network is slow because this routing device is
    internally limited to low throughput, You need Gigabit-capable
    routing device (read this as: Gigabit routing switch supporting
    layer 3).
    I suspect all Your class rooms are in different IP subnets?
    If yes then the routing device is bottleneck..

    Even modern Core i5 based linux router will create additional delay
    times in routing when it comes to cluster network, of course under
    high load at high throughput (what clustering requires). So simple,
    raw networking is best option.

    Example capable switch models that can do routing are:
    Cisco SG500X
    Cisco Catalyst series
    Dell PowerConnect 7000 series (or lower)
    Of course, they could be too expensive so just look around on switch
    that can do IP routing. There are even 8-port Gigabit Managed
    switches available on market.

    If Your sys admin isn't blocking other network subnets You could set
    static IP on all Your cluster nodes to different network subnet
    (let's say.. 192.168.99.1-192.168.99.10 with netmask 255.255.255.0)
    and check if they see each other (probably this isn't possible) and
    use one of them to provide Internet access (master node?) with 2
    Gigabit cards.

    USB Gigabit cards are most up to ~480Mbit in throughput since
    they're USB 2.0. It's half of needed performance and! be aware of
    missing Linux drivers.. ;)
    I see all Your hardware is Gigabit, so I belive You need Layer 3
    Gigabit switch. Also, please check are Your NICs supporting offload
    functions, then can help with high network load.
    I personally these in /etc/rc.local
    echo "Setting offload functions on Intel PRO/1000 NICs..."
    ethtool -K eth1 rx on tx on sg on gso on gro on tso on
    ethtool -K eth2 rx on tx on sg on gso on gro on tso on

    And there are tips:
    http://cs.boisestate.edu/~__amit/research/beowulf/beowulf-__setup.pdf <http://cs.boisestate.edu/~amit/research/beowulf/beowulf-setup.pdf>
    See the section: 1.2 Networking Hardware

    Hope this helps.

    Cheers,
    TooMeeK




    W dniu 2014-10-07 05:50, suresh kannan pisze:

        I am an Indian student in suwon, korea. I built a Beowulf cluster
        (system information below) with four systems in our lab for our
        simulation work with the help of good tutorials. In those
        tutorials they
        have mentioned all the system should have static ip addresses.
        Unfortunately, in all our labs we have been provided with dynamic ip
        address[5 ips for 15 members in three separate labs]. I have
        requested
        four more ip's from our university system admin. Due to the language
        problems, i conveyed the requirement through my korean lab mate
        and i
        dont know the reason why he denied us the static ip. So i found
        another
        way to skip this procedure
        http://www.reddit.com/r/__linuxquestions/comments/__2gubad/why_static_ip_address___is_necessary_for_linux/
        <http://www.reddit.com/r/linuxquestions/comments/2gubad/why_static_ip_address_is_necessary_for_linux/>.


        Someone suggested to use a router (one static ip) and set static
        ip for
        the four computers through a router. I did that and it worked.
        However,
        the cluster is very slow. For instance If i submit my simulation
        job in
        a single computer [4 core processor], it takes 2 months to
        complete a
        specific job. Although, if i connect 4 systems it shows it take
        6 months
        to complete the same job. It is actually using 10 core processor
        [3,3,2,2-100% each]. I used TOP command to see how much
        processor the
        head and other nodes are using. I have used openMPI to do
        parallel the
        systems. I am using GROMACS (Parallelization based on MPI has
        been part
        of this software). I followed a parallel configuration for the
        Gromacs
        with the help of this tutorial
        http://flakrat.blogspot.kr/__2013/04/how-to-compile-__gromacs-461-with-openmpi.html
        <http://flakrat.blogspot.kr/2013/04/how-to-compile-gromacs-461-with-openmpi.html>.
        After reading few posts
        http://www.reddit.com/r/__linuxquestions/comments/__2gbgbg/what_would_be_the_best___linux_distro_for_folding/
        <http://www.reddit.com/r/linuxquestions/comments/2gbgbg/what_would_be_the_best_linux_distro_for_folding/>
        i suspected the network router might be an issue.

        Can you suggest me how can i troubleshoot this problem? Some one
        suggested to use 2 network ports and make linux as a router and
        use a
        gigabitswitch to get the speed. However, we dont have 2 network
        ports
        system. If this is compulsory i can buy network ports (USB one).

        Where do i start now?

        Can i make my head node as a router and use USB network port
        (for the
        second network port) and connect to a gigabitswitch (any model
        suggestion?) to connect other nodes. I dont know much about
        networking
        stuffs. It will be helpful if any experts can suggest to
        troubleshoot
        this issue.

        Thank you for your time.

        regards

        Suresh


        System Informations

        Head node Processor : Intel core i3
        RAM : 1 GB
        No. of processor : 4
        Network cards : 03:00.0 Ethernet controller: Realtek
        Semiconductor Co.,
        Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
        (rev 06)
        System company : Samsung
        Architecture : x86_64
        OS flavour : Linux Mint 17 Qiana

        Node1 Processor : Intel Quad core
        RAM : 3 GB
        No. of processor : 4
        Network cards : 03:00.0 Ethernet controller: Realtek
        Semiconductor Co.,
        Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
        (rev 02)
        System company : TG DREAMSYS
        Architecture : x86_64
        OS flavour : Linux Mint 17 Qiana

        Node2 Processor : Intel core i3
        RAM : 1 GB
        No. of processor : 4
        Network cards : 03:00.0 Ethernet controller: Realtek
        Semiconductor Co.,
        Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
        (rev 06)
        System company : Samsung Architecture : x86_64
        OS flavour : Linux Mint 17 Qiana

        Node3 Processor : Intel core i3
        RAM : 1 GB
        No. of processor : 2 Network cards : 02:00.0 Ethernet controller :
        Qualcomm Atheros Attansic L2 Fast Ethernet (rev a0)
        System company : JOOYONTECH
        Architecture : x86_64
        OS flavour : Linux Mint 17 Qiana

        Router Company : ipTIME N604R
        Maximum speed : 160Mbps (LAN to WAN)



    --
    To UNSUBSCRIBE, email to debian-beowulf-REQUEST@lists.__debian.org
    <mailto:debian-beowulf-REQUEST@lists.debian.org>
    with a subject of "unsubscribe". Trouble? Contact
    listmaster@lists.debian.org <mailto:listmaster@lists.debian.org>
    Archive: https://lists.debian.org/__54346B82.6000500@o2.pl
    <[🔎] 54346B82.6000500@o2.pl">https://lists.debian.org/[🔎] 54346B82.6000500@o2.pl>




Reply to: