Here’s one of those old issues that seems to keep coming up – consistent LUN IDs with VMware. As this discussion has been around since VMware introduced vMotion, I’ll be brief with the history.
I'm going to just talk about LUN IDs from the VMware side here. On the PowerMax you can use something called dynamic LUNs to force VMware to use a particular LUN ID but that is typically used during advanced corrective procedures.
What is the LUN ID?
When you present a device to an ESXi host, it gets assigned a LUN ID – a device identifier. For example, here is a list of devices from one of my hosts, dsib0180 – the LUN column is the LUN ID.
So the first thing you probably notice is I have 3 LUN IDs of 1. How is that possible? Well, each array gets its own list of LUN IDs and I have two: my array in blue is xxx1883 and the one in red is xxx1879. But on close inspection you see in purple, that is also array xxx1879. Correct, but that is an iSCSI device so being on a different adapter (iSCSI software), it also gets its own LUN ID list. And yes, if I presented an iSCSI device from xxx1883 it could also be LUN ID 1. Granted, a little confusing but understandable I hope. These LUN IDs would continue to increment as I present storage from my arrays to the ESXi hosts up to whatever that vSphere version supports.
Consistent LUN ID
So with this information, we can then ask what is consistent LUN ID? This requires a second host to make sense. Consistent LUN ID means that any and all hosts in my ESXi cluster have the same storage device assigned to the same LUN ID. For example, here is another host in my cluster, dsib0182. I’ve highlighted the devices in blue that have the same network address (naa.) as dsib0180 and as you see they also have the same LUN ID, 1.
So because our LUN IDs match for each device it is said we have consistent LUN IDs.
Why does it matter?
And now here we’ve reached the big question. In pre-vSphere 5.5, VMware required that in order to vMotion a VM which had RDMs (not VMFS) from one host to another, you needed consistent LUN IDs. If you did not, VMware would fail the compatibility check and you would be prevented from doing a vMotion. In vSphere 5.5, this restriction was lifted. Cormac Hogan, who works for VMware, has an old blog post on it which explains it well here. Basically VMware stopped using the LUN ID and instead uses the network addressing authority identifier or naa (or also eui). Since the naa for a device does not change based on what order the device is presented to the host, each host sees the same one no matter what LUN ID is assigned. For the PowerMax, the naa is the WWN of the device. So if we take the 6 TB device in the picture above as an example, the naa is 60000970000197601883533030304533, and if we look on the array at that 6 TB device it is the same as the WWN. [Note this also holds true if you use the mobility ID as I’m only showing the compatibility ID here.]
So as everyone is at least at vSphere 5.5 (OK so not everyone but let’s agree most) what’s the big deal? The problem is Microsoft Clustering (MSCS) and SCSI-3 reservations which requires pRDMs prior to VMFS clustering in 7.0. This application must have consistent LUN IDs to vMotion. And, because of this, VMware continues to call out consistent LUN IDs in their documentation which only serves to confuse customers in my opinion. I think it would be far better to explain the corner cases that need it but there it is.
But to be clear, you don’t need consistent LUN IDs to vMotion a VM with an RDM if not using MSCS. Here’s a quick example GIF just to prove the point. Just click on it if it is not resolving.
Device expansion
One quirky issue which I’ve seen is the inability to expand a datastore in vCenter, after expanding the device on the array, when the LUN IDs are different across hosts in a cluster. This is a strange bug because, again, VMware should be using the WWN and not the LUN ID. Perhaps it is old code they never fixed, but if you try to expand a datastore and your resized device is not showing up in the wizard, check if your LUN IDs are different between one or more hosts in the cluster. The workaround is easy enough, just login to an ESXi host in the cluster directly (vSphere Client) and expand it there. Then rescan the cluster in vCenter and all hosts will see the new size. Datastore expansion is such a common problem VMware has a KB on it here: https://kb.vmware.com/s/article/1017662 but VMware doesn’t talk about LUN ID.
Best practice
The use of consistent LUNs is still a good idea if you can do it. On the PowerMax side you can set initiator groups to use consistent LUNs so that when you provision to ESXi clusters they also have consistent LUN IDs. However, one use case where that is problematic is SRDF/Metro with a uniform configuration – presenting both the R1 and R2 to all ESXi hosts. The arrays cannot coordinate LUN IDs so unless you force dynamic LUNs as I mentioned in the aside at the beginning, they will be different. Case in point, that same 6 TB device above on my host is in an SRDF/Metro uniform config and note here how I have two LUN IDs, 1 and 2.
Again, no issue unless you are using MSCS with pRDMs on SRDF/Metro uniform because the naa number for the R1 and R2 is the same, and that is what VMware uses.
Future
This is an issue that will work itself out over time because now that VMware supports clustered VMFS for MSCS, soon the pRDMs will no longer be needed.
IMPORTANT
Just a quick final note that while VMware does not require the LUN ID to be the same, I can't speak to all arrays out there and what they say is needed. For example, the Unity platform requires that when presenting devices to ESXi from the array, you must use consistent LUNs and each ESXi host must have the same LUN ID. That holds true for all devices, so VMFS and RDM. It doesn't break VMware if you don't have the same LUN ID, but it might impact your optimized path and that is bad for Unity's performance. VMAX/PowerMax have no such issue as our architecture is not SP-based LUN ownership. So just be sure to check with your array vendor.
In a Unity support case we were told inconsistent IDs across hosts would cause excessive trespass across the SPs. I’ve yet to find any kb articles or other documentation to support this.
Unfortunately I don’t know Unity well, but as I recall an SP owns a particular LUN no matter which hosts you present it to. So what LUN ID VMware assigns seems irrelevant to me in terms of IO to the device as you are going to that SP. Anyway I can poke around internally and see if someone can explain it. If you have the case number you could email it to me so I can provide reference to whomever I find.
What I did find is that consistent lun is a requirement with Unity arrays. The connectivity guide states: “Each LUN must present the same LUN ID to all hosts.” I am guessing that failure to do so means that the unoptimized paths (to the non-owner SP) get used more frequently on some ESXi hosts and then you get the trespassing. I’ll make a note in the blog post.
Thanks Drew for making me aware of “clustered VMFS for MSCS” 🙂