vSphere 6.5 with Intel X710 network adapter

User Rating: 5 / 5

Star ActiveStar ActiveStar ActiveStar ActiveStar Active
Published: Wednesday, 28 March 2018 09:04

In the past I was really impressed by Intel's product stability and support. You could take every single hardware from Intel, put it into your server and all applications and operating systems supported them instantly. Driver support was excellent so no reason to think about sometimes instable chipsets from QLogic, NetXen, Emulex or even Broadcom. With the release of the X710 chipset this success story should continue. Don't ask my why but I had my first Intel X710 NIC installed in a new hardware used for vSphere 6.5 these days. I installed ESXi 6.5 and as expected, the hardware was correctly recognized. 

After base configuration of two of these ESXi hosts I connected them to vCenter. The first one could be attached successfully, the second one was unable to connect. I double checked all settings like ip address, VLAN, password etc. The system was even pingable. The error from vCenter directed to not running management agents on the ESXi hosts. Well, this was a brand new installation and I never had to restart the host or the management agents before I can attach the host to vCenter. Nevertheless, I restarted the management agents and, voila, the second ESX host connected to vCenter.

In the next few minutes I configured host 2 to macth my standard. Suddently host 1 disconnected from vCenter. The system was still pingable but reconnect didn't work. Restarting the management agent did the job once again. During the day no more disconnets were observed so I didn't crawl any deeper into this problem. The weekend came and on monday the customer told me that all of his VMs were in suspend mode and both ESXi were disconnected from vCenter. Sh.....!

Once again checking and restarting the management agents causes both ESXi to reconnect. Vms were shutdown because of an HA isolation event. Nothing special here, this is the default setting for host isolation response.

As I was really convinced that all settings regarding network are correct I did a quick search on the internet for such problems. You can't imagine what I've found. Disconnects, PSODs, strange behavior with servers using this NIC are quite "common". One user already reported that problem 2!!! years ago. Most problems com from the DELL front where the X710 is obviously a standard NIC on the PowerEdge systems but generally it's always related to the X7xx chipset from Intel. Neither Intel nor VMware nor the hardware seller was able to correct that problem in the past and pointed to each other. Not a good sign of team-work, especially when big players like Intel and VMware are involved. All user reported of testing other drivers for the X710 card. Normally ESXi loads the i40en driver for that chipset but some users reported being able to load the i40 driver. With this driver disconnects were nearly gone but PSODs are more often reported. Not a good trade but I gave it a try. You have to manually uninstall the VIB for the i40en driver from the installed ESXi image and reboot the host. Disabling the driver is not working as ESXi will still load it during the next reboot. My problem was that the i40 driver was not suitable for the X710 chipset and so I was left with no network card at all after the reboot. As you can't reinstall the driver without copying the VIB to the ESXi host I rebooted once againd and loaded the older image from the recovery partition.

Some of the posts regarding that problem were quite old and they used drivers with version 1.4 or less. With all patches applied an ESXi 6.5 the used driver is 1.3.1. Our NIC firmware was 5.05. As this seems to be a driver issue I searched for newer drivers on the internet. Surprise, there is version 1.5.8 available on VMware's homepage. Digging a bit deeper there is one post on Intel's forum that says that version 1.5.8 with recent firmware 6.01 solves the problems with this chipset. So I downloaded the driver from here: https://my.vmware.com/de/web/vmware/details?downloadGroup=DT-ESXI65-INTEL-I40EN-158&productId=614 and the firmware from Intel's homepage (https://downloadcenter.intel.com/download/25796/Non-Volatile-Memory-NVM-Update-Utility-for-Intel-Ethernet-Adapters-VMware-ESX-?product=82947 ) . First install the new driver by uploading the zip to ESXi, unzip the content to a temporary folder like /tmp/Intel and install the depot file with "esxcli software install vib -d /tmp/Intel/". Reboot the server. After the reboot upload the firmware package, unzip the file with tar -zxvf xxx.tar.gz, change to the directory where the nvmupdate64en file resides, change execution mode with "chmod 755 nvmupdate64en" and then execute the file with ".\nvmupdate64en". First the file will scan your system for valid controllers and list them for you. Choose which one you want to upgrade (I would recommend upgrading all compatible adapters), let the upgrade run and restart your host after the procedure.

After the next reboot, connect with SSH once again and check the current firmware and driver version with "esxcli network nic list". Note the name (vmnicX) of one of your X710 NICs and issue the command "esxcli network nic get -n vmnicX" to get details about driver and firmware. The output should show driver version 1.5.8 and firmware 6.01. Now you're done and hopefully bring this story to a happy end.

Beside the fact that the solution of this problem took over 2 years it's a shame that this adapter is still in the HCL of supported NICs. There is no note on issues with the driver that comes with the install image and no link to newer drivers available. I used an OEM customized image with 6.5 Update 1 from FTS and even on these customized ISOs there is still the unstable driver version 1.3.1 included. VMware normally has very strict requirements for all kind of hardware but obviously this time the problem was ignored. I was in the luck of implementing the first X710 adapter AFTER the error-free driver and firmware came out but I really don't envy the customers out there who struggled with this issue for several months or even years. Hopefully, the ripped out these adapters and exchanged them with "instable" adapters from QLogic, NetXen, Emulex or Boradcom.