VMware vSphere 6 NMP PDL bug worth noting

It is not news, of course, that all software suffers from bugs.  I think it is safe to say, however, that most of these bugs in GA products are innocuous due to the extensive testing before releasing them.  So you might rightly assume that if I am calling out one in a post it must not be so innocent.  And that is the case with this one.  As the title suggests the bug concerns path management and in particular path failover with VMware’s Native Multipathing Plugin (NMP) on vSphere 6.

Let’s start with a quick refresher about losing access to storage from the ESXi hosts – in particular All Paths Dead (APD) and Permanent Device Loss (PDL).

All paths down or APD, occurs on an ESXi host when a storage device is removed in an uncontrolled manner from the host (or the device fails), and the VMkernel core storage stack does not know how long the loss of device access will last. VMware, however, assumes the condition is temporary. A typical way of getting into APD would be if the zoning was removed.

Permanent device loss or PDL, is similar to APD except it represents an unrecoverable loss of access to the storage. VMware assumes the storage is never coming back. Removing the device backing the datastore from the storage group would produce the error.

The condition that this particular bug impacts is PDL.  VMware relies on the storage array to send the correct SCSI sense code when there is an issue accessing a device on a particular path.  For PDL, the sense code the storage array sends is 0x5 0x25 0x0 – ILLEGAL REQUEST – LOGICAL UNIT NOT SUPPORTED.  Now if there are multiple paths to the device and VMware receives this sense code on one of those paths, it should then retry on one of the other paths.  VMware will not issue a PDL unless all paths that access the device return the sense code.  But here is where the bug strikes.  If you are using NMP and vSphere 6 (U1 and earlier) and VMware receives the sense code on one of the paths, it will take some time before VMware figures out the path is dead, and therefore it won’t immediately try another path.  In the meantime the VM recognizes it as a device loss and the VM and applications running in the VM may act accordingly.  For instance, in my basic testing of a Linux VM, my file system went read-only immediately, which required that I reset the VM.  Disruptive to say the least.  When NMP eventually recognizes the dead path and the remaining active path(s), it will mark the one as dead and then select an active path and continue I/O.  Whether the VM and/or applications can survive this delay will vary.

Note that our testing was done with VMAX where we found some resiliency in path acquisition, though VMware indicates in their KB (below) that there are cases where VMware will not find the other active path and an ESXi boot will be required.

This bug is VMware specific and as such it impacts many storage vendors and their arrays.  On the VMAX, personally I have not seen this issue in the field or even in my labs (until forced by removing a port from my port group, for example).  I think this could be for a number of reasons. The foremost reason is that the VMAX is resilient and we just don’t often have issues that would lead to exposing the bug.  Another is that our recommended multipathing software to use with VMware, PowerPath/VE, does not experience this issue.  PP/VE handles things like port loses perfectly fine.  A third reason might be that this only affects PDL events, not APD.  Other vendors and arrays have not been so fortunate in their experiences, hence why I think it is important to know about this bug.

In any case if you are using NMP (note PSP doesn’t matter – Fixed, RoundRobin) with vSphere 6 (vSphere 5 is not impacted) you may wish to obtain and apply the patch, or in the near future upgrade to a release with the patch included (next major release of ESXi 6 which I believe is this quarter).  As I noted we haven’t seen this on the VMAX except in our targeted testing so the risk to our array is minimal.

Details on the bug itself are here:  http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2144657

And the process for obtaining a patch can be found in this KB:  http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003433

The relevant section is found under item number 6:

kbClick to enlarge – use browser back button to return to post


3 thoughts on “VMware vSphere 6 NMP PDL bug worth noting

  1. I’ve been to both of the articles listed and neither of them state that there is a patch available for download.. I’m experiencing this now in my environment of 70+ ESX hosts and it’s causing some major issues.

    • Adam,

      There is no released, posted patch for the problem so you’ll need to open an SR with VMware. It is a poorly worded statement but here is where VMware says this in KB 2144657: “If this issue is leading to a critical condition in your environment, or if you think you are likely to encounter this issue due to pending upgrades/updates to the storage infrastructure, please file a Support Request with VMware Support to determine your exposure and discuss potential options. For more information, see How to file a Support Request in My VMware (2006985)”

      VMware changed the phrasing recently which has contributed to the confusion. VMware has patches for the various vSphere 6 releases.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s