The Mysterious Dropped Jumbo Frames in ESXi (solved)

I admin a handful of blades in an HP BladeSystem. We have recently transitioned from using HP 6120xg switches for uplinks to a pair of Cisco B22HP Fabric Extenders connected to parent Nexus 93180YC-EX switches. Each blade is running VMware vSphere 6.0 Enterprise Plus, with virtual Distributed Switches (vDS). Most of the blades are HPE BL460c Gen8. We recently purchased a BL460c Gen9 and have run into problems adding it into the cluster.

Hardware note: the Gen9 is outfitted with onboard 536FLB (Broadcom 57840 based) and 534M Mezzanine (Broadcom 57810) Converged Network Adapters (CNAs). My problem was with the 534M, but it’s possible that the 536FLB would have the same issue detailed here.

The problem:

We run jumbo frames (9000-byte MTU) for our iSCSI SAN. MTU end-to-end on the parent Nexus 9k and the FEX ports is set at 9216. All blades except the Gen9 can successfully pass 9k frames to the NetApp. The Gen9 server is dropping ICMP frames larger than 2344 bytes with the DF bit set.

Troubleshooting:

Before ESXi was installed on the new blade, Windows Server 2016 was installed for Hyper-V testing. During this test phase, 9k jumbo frames were successfully passed as verified via

ping -l 8972 -f netapp.example.org

Once ESXi was installed, the blade was added to the vDS and a VMkernel interface (vmk) created on the iSCSI network. The vmk settings were changed to increase MTU from the standard 1500 to 9000. This should have changed the MTU on the uplink interface to 9000:

[root@blade:~] esxcli network nic list
Name   PCI Device   Driver Admin Status Link Status Speed Duplex MAC Address       MTU  Description 
------ ------------ ------ ------------ ----------- ----- ------ ----------------- ---- ----------------------------------------------------------------------------
vmnic2 0000:09:00.0 bnx2x  Up           Up          10000 Full   00:00:00:00:00:01 9000 Broadcom Corporation QLogic 57810 10 Gigabit Ethernet Multi Function Adapter
vmnic3 0000:09:00.1 bnx2x  Up           Up          10000 Full   00:00:00:00:00:02 9000 Broadcom Corporation QLogic 57810 10 Gigabit Ethernet Multi Function Adapter

Looks good. Test with ‘ping’:

[root@blade:~] vmkping -I vmk1 192.168.1.10 -d -s 8972
PING 192.168.1.10 (192.168.1.10): 8972 data bytes

--- 192.168.1.10 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss

No-go. A few other MTU sizes were tested until one was found which worked:

[root@blade:~] vmkping -I vmk1 192.168.1.10 -d -s 2344
PING 192.168.1.10 (192.168.1.10): 2344 data bytes
2352 bytes from 192.168.1.10: icmp_seq=0 ttl=64 time=0.177 ms
2352 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.163 ms

--- 192.168.1.10 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss

I’ve omitted the output here but using 2345 bytes for the ping also resulted in a failure. Given that the FEX Host Interface (HIF) was configured via port-profile, the port-profile is the same as working for the other blades, and that 9k frames were functioning in Windows, where’s the problem?

By this time, HPE had sent replacement CNAs (FLB and Mezz card), so the hardware should be good. The server wasn’t in production, so time wasn’t a critical factor. What about different operating systems?

Linux (Centos) was installed. Ping of 8972 bytes with DF-bit set were successful
ESXi 6.5 was installed. 8972 byte ping failed.
- Changed the driver used in ESXi 6.5 from the Broadcom bnx2x driver to the native qfle3 driver in ESXi. 9k packets flowed as expected!
Back to ESXi 6.0 with the bnx2x driver (the qfle3 driver is not listed as supported in 6.0, and we are running vCenter 6.0, so that limits the highest version we could use on the blade). As expected, the 9k frames failed again.

By now, it’s confirmed without a doubt that the hardware can pass 9k frames. It seems to be an issue with the ‘bnx2x’ driver. More testing to confirm. On the Nexus 9k parent, we can see if anything is captured with ethanalyzer:

ethanalyzer local interface inband display-filter "eth.addr==00:00:00:00:00:01" limit-captured-frames 0

Attempting to ping the SVI on the parent Nexus 9k with 9k frames resulted in no output. Attempting to ping the SVI with standard (64 byte) frames resulted in:

Capturing on inband
2018-06-01 15:51:56.995468 192.168.1.254 -> 192.168.1.20 ICMP Echo (ping) reply
2018-06-01 15:51:57.995994 192.168.1.254 -> 192.168.1.20 ICMP Echo (ping) reply
2018-06-01 15:51:58.998428 192.168.1.254 -> 192.168.1.20 ICMP Echo (ping) reply

So, the 9k jumbo frames aren’t even seen on the Nexus 9k – it looks like they never leave the blade. Frames sent with packet size of 2344 bytes ARE seen in the ethanalyzer output, though. It’s looking more and more like an issue with the ‘bnx2x’ driver in ESXi for the installed Broadcom 57810 CNAs.

The Solution:

In one of many searches trying to find an answer to this problem, I began reading the installation guide for QLogic BR-Series Adapters. These are not at all the same adapters, but a search in the guide for “jumbo” brought me to the following note:

“NOTE: The jumbo frame size set for the network driver cannot be greater than the setting on the attached switch that supports Data Center Bridging (DCB) or the switch cannot accept jumbo frames.”

Could that be the problem?

Rebooted the blade, entered the setup utility, and checked the hardware configuration for each converged NIC. DCB was enabled for each adapter. Changed the setting to disabled, restarted, and tested a ping to the NetApp:

[root@blade:~] vmkping -I vmk1 192.168.1.10 -d -s 8972
PING 192.168.1.10 (192.168.1.10): 8972 data bytes
8980 bytes from 192.168.1.10: icmp_seq=0 ttl=64 time=0.177 ms
8980 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.163 ms

--- 192.168.1.10 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss

You’ve got to be kidding! I was pretty sure that during the whole testing process, DCB was set to disabled… but apparently this was not the case. Doing a little reading in the Cisco Nexus 9000 Series NX-OS System Management Configuration Guide, I came across this:

“Note: Only front-panel fixed ports are supported with DCBXP. FEX ports are not supported.”

If FEX ports are not supported, and the blade CNAs are connected to a FEX in the form of a B22HP, how was DCB being negotiated and resultant MTU dropping to 2.5k? This sounds a lot like the MTU for Fibre Channel over Ethernet (FCoE)… but the storage personality for the CNA was set to iSCSI. Furthermore, why didn’t this affect other operating systems or the native driver (qfle3) supplied with ESXi 6.5? I don’t have the answers for these questions yet, and don’t know that I will be able to dedicate the time to do much more research on this subject. Too much time has been spent on this, and there are pending projects. If I learn the answer, an edit or follow-up post will be made.

In the meantime, I hope this post can help someone experiencing the same scenario. The end result is that 9k jumbo frames are being passed properly, the iSCSI vmk interfaces have been attached to the BCM 57810 dependent iSCSI adapters, and the blade successfully added to the cluster.