It is not news, of course, that all software suffers from bugs. I think it is safe to say, however, that most of these bugs in GA products are innocuous due to the extensive testing before releasing them. So you might rightly assume that if I am calling out one in a post it must not be so innocent. And that is the case with this one. As the title suggests the bug concerns path management and in particular path failover with VMware’s Native Multipathing Plugin (NMP) on vSphere 6.
Let’s start with a quick refresher about losing access to storage from the ESXi hosts – in particular All Paths Dead (APD) and Permanent Device Loss (PDL).
All paths down or APD, occurs on an ESXi host when a storage device is removed in an uncontrolled manner from the host (or the device fails), and the VMkernel core storage stack does not know how long the loss of device access will last. VMware, however, assumes the condition is temporary. A typical way of getting into APD would be if the zoning was removed.
Permanent device loss or PDL, is similar to APD except it represents an unrecoverable loss of access to the storage. VMware assumes the storage is never coming back. Removing the device backing the datastore from the storage group would produce the error.
The condition that this particular bug impacts is PDL. VMware relies on the storage array to send the correct SCSI sense code when there is an issue accessing a device on a particular path. For PDL, the sense code the storage array sends is 0x5 0x25 0x0 – ILLEGAL REQUEST – LOGICAL UNIT NOT SUPPORTED. Now if there are multiple paths to the device and VMware receives this sense code on one of those paths, it should then retry on one of the other paths. VMware will not issue a PDL unless all paths that access the device return the sense code. But here is where the bug strikes. If you are using NMP and vSphere 6 (U1 and earlier) and VMware receives the sense code on one of the paths, it will take some time before VMware figures out the path is dead, and therefore it won’t immediately try another path. In the meantime the VM recognizes it as a device loss and the VM and applications running in the VM may act accordingly. For instance, in my basic testing of a Linux VM, my file system went read-only immediately, which required that I reset the VM. Disruptive to say the least. When NMP eventually recognizes the dead path and the remaining active path(s), it will mark the one as dead and then select an active path and continue I/O. Whether the VM and/or applications can survive this delay will vary.
Note that our testing was done with VMAX where we found some resiliency in path acquisition, though VMware indicates in their KB (below) that there are cases where VMware will not find the other active path and an ESXi boot will be required.
This bug is VMware specific and as such it impacts many storage vendors and their arrays. On the VMAX, personally I have not seen this issue in the field or even in my labs (until forced by removing a port from my port group, for example). I think this could be for a number of reasons. The foremost reason is that the VMAX is resilient and we just don’t often have issues that would lead to exposing the bug. Another is that our recommended multipathing software to use with VMware, PowerPath/VE, does not experience this issue. PP/VE handles things like port loses perfectly fine. A third reason might be that this only affects PDL events, not APD. Other vendors and arrays have not been so fortunate in their experiences, hence why I think it is important to know about this bug.
In any case if you are using NMP (note PSP doesn’t matter – Fixed, RoundRobin) with vSphere 6 (vSphere 5 is not impacted) you may wish to obtain and apply the patch, or in the near future upgrade to a release with the patch included (next major release of ESXi 6 which I believe is this quarter). As I noted we haven’t seen this on the VMAX except in our targeted testing so the risk to our array is minimal.
Details on the bug itself are here: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2144657
And the process for obtaining a patch can be found in this KB: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003433
The relevant section is found under item number 6: