[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

OpenMPI ORTE/HNP TCP Error & bcmgenet_xmit() tx ring full Error



Hi

I have generated this problem myself by tweaking the MTU of my 8 node Raspberry Pi 4 cluster to 9000 bytes, but I would be grateful for any ideas/suggestions on how to relate the Open-MPI ORTE message to my tweaking.

When I run HPL Linpack using my “improved” cluster, it runs quite happily for 2 hours with P=1 & Q=32 using 80% of memory, and this give me a 7% performance increase to 97 Gflops. And I can quite happily Iperf 1GB of data between nodes with an improved bandwidth of 980Mb/s. So, the MTU tweak appears to be relatively robust.

However, as soon as the HPL.dat parameters change to P=2 & Q=16, from within the same HPL.dat file, I get the following message... 

--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

 HNP daemon   : [[19859,0],0] on node node1
 Remote daemon: [[19859,0],5] on node node6

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
—————————————————————————————————————



…and the affected node becomes uncontactable, and /var/log/kern.log is flooded with…

bcmgenet fd580000.ethernet eth0: bcmgenet_xmit: tx ring 1 full when queue 2 awake


The bcmgenet.c driver code which generates this message is...

if (ring->free_bds <= (nr_frags + 1)) {
                if (!netif_tx_queue_stopped(txq)) {
                        netif_tx_stop_queue(txq);
                        netdev_err(dev,
                                   "%s: tx ring %d full when queue %d awake\n",
                                   __func__, index, ring->queue);
                }
                ret = NETDEV_TX_BUSY;
                goto out;
}


I’m thinking the Open-MPI message sizes with P=2 & Q=16 are not working with my imperfect MTU tweak, and I’m corrupting the TCP stack somehow. But I don’t know how this relates to the code’s free_dbs or nr_frags etc.


My tweak consisted of the following kernel changes:

1.) include/linux/if_vlan.h

#define VLAN_ETH_DATA_LEN 9000
#define VLAN_ETH_FRAME_LEN 9018

2.) include/uapi/linux/if_ether.h

#define ETH_DATA_LEN 9000
#define ETH_FRAME_LEN 9014

3.) drivers/net/ethernet/broadcom/genet/bcmgenet.c

#define RX_BUF_LENGTH 10240

The Raspberry Pi 4 ethernet driver does not expose many knobs to turn, most ethtool options are not available, and there is no publicly available NIC documentation, so my tweaks are educated guesswork based upon Raspberry Pi forum threads.

Any ideas/suggestions would be much appreciated. With P=2 & Q=16 prior to my tweak I can achieve 100 Gflops, a potential increase to 107 Gflops is not to be sniffed at.

Kind regards


Reply to: