With our new platform we’ve introduced another NVMeoF solution – NVMe/TCP. We will not be supporting FC-NVMe on the PowerMax 2500/8500 in the GA release, however, I’d guess it will come in the future. The reason for the focus is simple, the future is in TCP. IDC is projecting that TCP will have the majority of growth in NVMe, not FC. Yes, it would be fair to say IDC isn’t always right – see iSCSI for reference – but they do set the tone for the industry so we ignore them at our own peril. However, I digress as I’m here to show you how it works on PowerMax and VMware. It’s use, implementation, and future is completely up to you.
The first thing to know about NVMe/TCP is that it’s new. Right now the only OS we have fully certified is vSphere 7 U3+, which is the version of vSphere that supports TCP.
As NVMe/TCP is a flavor of NVMe over fabrics, it’s useful to return to one of my previous posts on FC-NVMe because most of the caveats, restrictions, etc., I wrote about still hold true for TCP. NVMeoF is still not a fully realized end-to-end solution. Just like with FC-NVMe, there are SCSI translations required and the limited VAAI command set (no XCOPY) is still in place. I’m going to take liberally from my other post because I don’t want to simply re-direct you since we are talking about a different protocol.
Requirements:
- A 25 Gb Clearsky SLIC on the PowerMax 2500 or 8500 array. This SLIC is the same one that can support iSCSI and SRDF GigE. I run both iSCSI and TCP on mine.
- A host NIC that supports NVMe/TCP on the ESXi – be sure you have the latest firmware from vSphere 7 U3.
You can, as I do, have a different speed on the host. My lab only has 10 Gb so the switch does the translation work between the speeds. This is not recommended as obviously performance is impacted, but we work with what we got.
The following sections cover the important VMware/Dell/General restrictions around NVMe/TCP. Note that this is not an inclusive list, however these are the ones most likely to be of interest to the user. These restrictions will be lifted over time.
Restrictions:
VMware
- vVols are not supported with NVMeoF. The Protocol Endpoint can only be presented via FC or iSCSI. Support is on the road map, but it will require development on both sides – VMware and the storage vendors, so it will take some time.
- RDMs are not supported on NVMeoF, only VMFS. Note that there is no intent on this changing. For those using solutions like Microsoft Clustering that have been reliant on RDMs, expect that VMware will support clustered VMFS with NVMeoF in the future. If you are using RDMs for say an Oracle database, VMware believes vVols should be used if you want RDMs.
- No Site Recovery Manager (SRM) support on NVMeoF yet.
- No SCSI-2 reservations.
- Only 4 paths per namespace (device) and only 32 namespaces (devices) per host – so 128 total paths. (This is perhaps the most limiting of the restrictions.)
- VMware only supports ALUA with NVMeoF. Our devices advertise ALUA so this is not an issue.
- No NMP support as all NVMeoF devices use the VMware HPP plugin, which is essentially equivalent to NMP.
Dell
- You can only present a device with one protocol.
- There is no support for SRDF/Metro. Future support will require work from both VMware and Dell, so it is not a small undertaking.
- There is both PowerPath\VE support for NVMe/TCP with VMware, and VMware’s HPP plugin (NMP plugin is not supported). We strongly recommend PP/VE. I have examples of both below.
General
- There are no adjustable queues with NVMeoF because, well, the technology works differently. In other words, don’t worry about it.
- VAAI Primitives
- ATS Compare and Write supported.
- UNMAP (deallocate in NVMe) supported.
- XCOPY not supported. This particular primitive has not been ported over to the NVMe command set yet.
- Block zero or WRITE SAME (Write Zeroes in NVMe) supported.
Wizard
Let’s walk through deploying NVMe/TCP in your VMware environment. Unlike FC-NVMe which works through an attribute of the HBA, we need a software adapter, akin to iSCSI. So navigate to Host -> Configure -> Storage Adapters -> ADD SOFTWARE ADAPTER.
In step 2 select Add NVMe over TCP adapter.
In step 3 select the correct NIC. This NIC should have a VMkernel port configured that has the NVMe over TCP service selected as shown below. Failure to add the service will leave you scratching your head when you can’t discover any controllers even though they are perfectly pingable from the host. You’ll get the error: The object or item referred to could not be found. Heaven forbid VMware simply says you don’t have the service activated. I know too much to ask. Anyway onwards to the MTU. It is particularly important if you are using Jumbo Frames that the networks and MTU are all set to 9000+. If you have a mismatch in size, vSphere will have trouble discovering the controllers – all in all it will be strange behavior. We support VLANs also as I am using below.
On the array side we need to configure the TCP interfaces which we will discover as controllers from vSphere. Within Unisphere, therefore, navigate to System -> iSCSI + NVMe -> NVME tab. Note that you can configure NVMe with either embedded or external Unisphere. File and vVols are the only features you must use embedded for.
In step 1 start by activating the NVME/TCP CONFIGURATION WIZARD.
Step 2 select the one of the directors and then put a value in the Network ID. This network ID is used for identification purposes only. The port will default. The Endpoint Name is optional. Here in this example, I’ve used the array ID-director-network ID as my name. You won’t have to supply this name in vSphere so feel free to let it default to a system-generated name if you wish. As I use multiple arrays in the same vSphere environment, this makes it easier for me to track.
Step 3 supply an IP address, prefix, and if necessary a VLAN ID. As I mentioned, if you want to use Jumbo Frames check the box which will set MTU to 9000. Be sure vSphere is the same. Leave the Centralized Discovery Controller as default.
Review the changes in step 4 and then Run Now.
Follow the same process for all other directors/ports. I have my second port below. Note my different network ID and custom Endpoint Name.
OK now we have TCP configured on the array. For the remainder of the steps I’ve done a demo. It will walk through:
- Adding the array controller to vSphere
- Creation of autoprovisioning objects, including presenting three devices to the cluster
- Creation of datastores on the TCP devices
Pathing
Native Multipathing plug-ins
By default, the native multipathing plug-in (NMP) supplied by VMware is used to manage I/O for non-NVMeoF devices. NMP can be configured to support fixed and round robin (RR) path selection polices (PSP). In addition, Dell supports the use of ALUA (Asymmetrical Logical Unit Access) only with the Mobility ID for non-NVMeoF devices.
NMP is not supported for NVMe/TCP. VMware uses a different plug-in called the High-Performance Plug-in or HPP. This plug-in has been developed specifically for NVMe devices, though it is the default only for NVMeoF devices. For local NVMe devices, NMP is the default, though it can be changed to use HPP through claim rules. HPP only supports ALUA with NVMeoF devices, but unlike NMP, it is unnecessary to create a different claim rule for these devices as HPP is designed for ALUA. To support multi-pathing, HPP uses the Path Selection Schemes (PSS) when selecting physical paths for I/O requests. HPP supports the following PSS mechanisms:
- Fixed
- LB-RR (Load Balance – Round Robin)
- LB-IOPS (Load Balance – IOPs)
- LB-BYTES (Load Balance – Bytes)
- Load Balance – Latency (LB-Latency)
HPP Path Selection Schemes (PSS)
The High-Performance Plug-in uses Path Selection Schemes (PSS) to manage multipathing just as NMP uses PSP. As noted above, HPP offers the following PSS options:
- Fixed – Use a specific preferred path
- LB-RR (Load Balance – Round Robin) – this is the default PSS. After 1000 IOPs or 10485760 bytes (whichever comes first), that path is switch in a round robin fashion. This is the equivalent of NMP PSP RR.
- LB-IOPS (Load Balance – IOPs) – When 1000 IOPs are reached (or set number), VMware will switch paths to the one that has the least number of outstanding IOs.
- LB-BYTES (Load Balance – Bytes) – When 10 MB are reached (or set number), VMware will switch paths to the one that has the least number of outstanding bytes.
- Load Balance – Latency (LB-Latency) – this is the same mechanism available with NMP, VMware evaluates the paths and decides which one has the lowest latency.
Because the PSSs LB-IOPS, LB-BYTES, and Load Balance offer intelligence, they are superior PSSs to LB-RR or Fixed. As performance is paramount for NVMeoF, Dell recommends using the Load Balance PSS, or LB-Latency. It offers the best chance at uniform performance across the paths.
To set the PSS on an individual device, issue the following as seen below:
esxcli storage hpp device set -P LB-Latency -d eui.04505330303033380000976000019760
You can add a claimrule so that this PSS is used for every NVMe/TCP device at reboot. Note that we cannot pass the usual flag of “model” because that field is restricted to 16 characters and our Dell EMC model is 17 characters (EMC PowerMax_8500, EMC PowerMax_2500). For cases like these VMware offers the –nvme-controller-model flag. Here is an example of adding a claimrule for the 8500. If you have a 2500 just change that number.
esxcli storage core claimrule add -r 914 -t vendor --nvme-controller-model='EMC PowerMax_8500' -P HPP --config-string "pss=LB-Latency"
If you want the rule to take effect immediately for any devices added before rebooting the box, issue:
esxcli storage core claimrule load
Latency threshold setting
By default, every I/O that passes through ESXi, goes through the I/O scheduler. It is possible that because of the speed of NVMe, using the scheduler might create internal queuing, thus slowing down the IO. VMware offers the ability to set a latency threshold so that any IO with a response time below the threshold will bypass the scheduler. When this mechanism is enabled, and the IO is below the threshold, the I/O passes directly from PSA through the HPP to the device driver.
For the mechanism to work, the observed average I/O latency must be lower than the set latency threshold. If the I/O latency exceeds the latency threshold, the IO temporarily returns to the I/O scheduler. The bypass is resumed when the average I/O latency drops below the latency threshold again.
There are a couple different ways to set the latency threshold. To list the existing thresholds, issue:
esxcli storage core device latencythreshold list
To set the latency at the device level issue:
esxcli storage core device latencythreshold set -d eui.36fe0068000009f1000097600bc724c2 -t 10
To set it for all Dell NVMe/TCP devices issue:
esxcli storage core device latencythreshold set -v 'NVMe' -m 'EMC PowerMax_8500' -t 10 esxcli storage core device latencythreshold set -v 'NVMe' -m 'EMC PowerMax_2500' -t 10
These settings will persist across reboot, but any new devices would require latencythreshold to be set on it. Dell is making no specific recommendation around latencythreshold as VMware does not. There has not been any scale testing to date that provides data on the value of this parameter; however, Dell supports the use of it if desired.
Managing HPP in vSphere Client
Claim rules and claiming operations must all be done through the CLI, but the ability to choose the HPP multipathing policy for NVMe/TCP devices can be performed in the vSphere Client itself. By default, PowerMax NVMe/TCP devices being managed by HPP will have a PSS set to the policy of “LB-RR”. As this is not a best practice, the PSS can be changed to LB-Latency either through CLI or the vSphere Client. Through CLI, for each device execute:
esxcli storage hpp device set -P LB-Latency -d <device_id>
Alternatively, each device can be manually changed in the vSphere Client below.
PowerPath/VE
PowerPath fully supports vSphere 7 U3 with NVMe/TCP. Once installed, no further configuration is required. PP/VE will automatically recognize NVMe/TCP devices as ALUA, whether the mobility ID or compatibility ID is used. You can see below how PP recognizes the devices and applies ALUA policy.
SmartFabric Storage Software
As there will be other coverage of our SmartFabric Storage Software (SFSS) I’m not going to spend much time on it here. It is a new software we offer which will help automate the discovery of some parts of the steps I included above. It is delivered as an OVA file – vApp. With it you can configure zones and policies for your NVMe/TCP environment. Here’s is a screenshot of an environment I setup.
I think customers might find it useful in very large environments. Once restriction with the first release is that it expects two VLANs in your TCP environment. I only had one which restricted me to use of only one of my paths.
Migration
VMware’s current migration strategy to move from SCSI (FC or iSCSI) to NVMe/TCP is Storage vMotion (SvMotion). There is no other supported methodology. The array, of course, has no issue simply removing the current SCSI masking view and creating one with the TCP adapters, but VMware does. So you can’t simply shutdown your VMs, and change protocols. The reason for this is that VMFS has a signature associated with it that is generated by the device ID. For SCSI the WWN is part of that signature but NVMe/TCP uses the NGUID instead. So if you present that former SCSI datastore as TCP, VMware will show a mismatch of signatures. This will probably be addressed in the future by VMware via a resignature and some other steps, but right now SvMotion (hot or cold) is the only choice.
Bug/caveat
Finally, in keeping with my warts and all policy, I have a bug to share. In testing I had to remove the TCP adapter and add it back a number of times. Sometimes the vmnic released, sometimes it did not, which was strange in and of itself. But on more than one occasion, and seen by more than one engineer, the purple screen of death made an appearance upon reboot. In all cases it looked exactly like this:
This does seem to be an issue known to VMware as far as I could tell and they mentioned they are fixing it. You’ll have to warm or cold boot to get out of it. And I should say I’ve had it occur for no reason at all, in addition to removing an adapter.
Conclusion
It seems safe to say NVMeoF in some flavor is the future of storage access. We still aren’t there yet, but Dell is certainly ahead of the game here, awaiting support from the operating systems. VMware will add more support in each release, moving closer to FC/iSCSI parity. I think companies will take things slow, particularly if they are looking at TCP as for most enterprises it will be a rip and replace model from FC to IP. For those already running IP (iSCSI), and from my experience it is a small number, the transition will be easier.