NVMe/TCP with PowerFlex 4.0

Yes, that’s not a typo: PowerFlex not PowerMax. In truth I work on both PowerMax and PowerFlex, but going forward I expect to spend more time on PowerFlex so more blog posts on that platform will be inevitable. Fortunately there’s a good deal of overlap between the platform features like our topic today, so it seemed like a good one to start with.

Since this is a different platform, I don’t want to point you back to a PowerMax post for NVMeoF base information so I’m going to repeat some of it here for PowerFlex. If you don’t use PowerMax, then it will all be new to you anyway.

BTW for those who may be new to my blog – perhaps since you use PowerFlex not PowerMax – you won’t often find me getting lost in technical weeds. I present technology as I understand it, usually from a view rather than the trail. While you’ll find lots of information, I try not to duplicate available documentation which you can peruse at your leisure. I hope to impart how things work, a bit of why they work that way, and with any luck, give you something to take away to your own environment. I’m also incorrigibly sarcastic (or cheeky take your pick), straightforward and don’t do marketing. So if that didn’t put you off, carry on.


PowerFlex

I’m going to give the briefest of descriptions of what the PowerFlex system (array isn’t really the correct term here) is and is not and rely on your curiosity to fill in the gaps (doc link here!). PowerFlex (PF) is actually a rebranding for the ScaleIO system. PF is a software or prebuilt software/hardware solution (software-defined if you like) that consumes the local storage of a server and creates a virtual SAN. There is even an option in the cloud (these days of course). PF is IP-based and aggregates the devices into shared block storage.

PowerFlex software is installed on nodes and communicates over IP to handle the application I/O requests sent to the shared storage. As its original name suggests, PF can massively scale via these nodes but is entirely flexible (new name) in its deployment. There are a few flavors of PF shown in this picture from the far-left rack to the fully customized solution on the right.

PF can be implemented as a VMware-based solution or on bare metal using a variety of different operating systems. Custom solutions can even be a combination of ESXi and bare metal as my development environment is. The Management Plane mentioned in the image is where the GUI lives. It’s built on a three node kubernetes cluster. The GUI looks like this:

I want to define a few of the most important software components of the PowerFlex so my pictures below make sense.

MDM (Metadata Manager) – This is the management and configuration agent of PowerFlex – think rebuilds, migration, system stuff, etc. There are usually 2 or more and a tiebreaker, though it can be run in single server mode.

SDS (Storage Data Server) – This software is what controls all the aggregate storage and receives the requests from SDC or SDT. It is installed only on servers that have local storage to be used in the virtual SAN.

SDC (Storage Data Client) – This is a device driver that exposes the PowerFlex volumes as block devices. In the ESXi space, you will see a generic storage adapter (vmhbaxx) that is used for device exposure. In a PowerFlex environment you might only have this driver installed on a node (ESXi or bare metal) and nothing else.

SDT (Storage Data Target) – An NVMe target. It is installed with the SDS since it must communicate directly with it. The NVMe initiator from the host (ESXi or otherwise) talks to this SDT. And if you are only using NVMe/TCP and not SCSI, you actually do not require any PowerFlex components on your host.

There are a few other software pieces – for replication and NFS – but they are optional and don’t concern us here. I should mention SDT is not required since it is only for NVMe/TCP, too, but I need it here so it’s not optional for us 🙂

With that intro let’s look at the difference between data flow with SCSI (SDC) and TCP (SDT). We’ll start with the environment.

Environment

To set the stage, my environment is a Frankenstein monster of sorts. It’s a lab so I’m permitted some indiscretion with the configuration – in other words don’t do this at home. The consumption model (the pic above) is the SDS one, meaning it is a fully customized, software setup. I have three RHEL 8 nodes which have all the components above installed: MDM, SDS, SDC and SDT. The SDC is actually optional since I don’t plan on using storage on the nodes but I put it on so I could easily test. PowerFlex is best with a robust, redundant network – multiple switches, NICs, ports, paths, etc. I have a single switch, port, and NIC. Yeah not great but, well, my lab (you can see my connectivity complaints in the dashboard image above). My GUI kubernetes cluster is installed on three SLES VMs. In addition, I’ve installed the SDC on a four node, 7.0U3 vSphere cluster. We currently support up to vSphere 7.x. vSphere 8 support is coming.

So how do we access the storage? Well that depends on the protocol so let’s see how data access flows in SCSI and NVMe/TCP.

Data Flow

SDC

This image is a good depiction of my environment (save for the extra storage node). Each SDC on my four ESXi hosts can talk to every SDS in the cluster. This is critical since my data might live on any of those nodes (remember disk is local, but aggregated into the virtual SAN).

Storage is mapped to ESXi via IP (yes TCP like the picture shows but trying not to confuse it with NVMe/TCP). The SDC, however, exposes the device as a block volume, much like a traditional FC might look in VMware. In fact you will see the SDC represented as an FC storage adapter with no associated model name.

SDT

Now in an NVMe/TCP environment, no software is installed on the ESXi hosts. The SDT, which represents the NVMe target, is installed on the nodes with storage, i.e. those with SDS. The initiator on the VMware side, talks to the SDT which then communicates with the SDS. So when running NVMe/TCP in your VMware environment with PowerFlex, no additional components are necessary on ESXi. The one drawback with NVMe/TCP in this architecture, is that since ESXi talks to SDT which then talks to SDS, there is an extra hop that SDC (SCSI) doesn’t have to take since it talks directly to SDS. Therefore NVMe/TCP performs just a tad slower than SDC. This will be resolved in a future release.

On ESXi, the initiator is represented as a software-based storage adapter. This TCP adapter is tied to a NIC (vmnic) on the ESXi host that has the driver to support NVMeoF (which most do these days). I’ll show you how to add it below.

VMware and NVMe/TCP

With the basics of PowerFlex behind us, I want to discuss a bit more about how NVMe/TCP works on VMware (this being a VMware blog). The most important prerequisite is that NVMe/TCP is only available in vSphere 7.0U3 and higher. If you aren’t at 7.0U3 (or the many patch levels beyond it), you will be unable to use this protocol. Beyond the vSphere level, however, NVMe/TCP has, well, lots of restrictions and caveats. I am simply going to list them below.

The following sections cover the important VMware/Dell/General restrictions around NVMe/TCP. Note that this is not an inclusive list, however these are the ones most likely to be of interest to the user. These restrictions will be lifted over time (releases).

Restrictions:

VMware

  • RDMs are not supported on NVMeoF. Note that there is no intent on this changing. For those using solutions like Microsoft Clustering that have been reliant on RDMs, expect that VMware will support clustered VMFS with NVMeoF in the future.
  • No Site Recovery Manager (SRM) support on NVMeoF.
  • No SCSI-2 reservations.
  • Only 4 paths per namespace (device) and only 32 namespaces (devices) per host – so 128 total paths. This is perhaps the most limiting of the restrictions. It has been increased significantly in vSphere 8, but we’re not there yet.
  • VMware only supports ALUA with NVMeoF.
  • No NMP support as all NVMeoF devices use the VMware HPP plugin, which is essentially equivalent to NMP. For PowerFlex, devices will be claimed automatically by the HPP rule.

Dell

  • You can only present a volume with one protocol at one time. A volume has no attributes that make it SCSI or NVMe/TCP so both are possible, it’s just that you can only use one at a time.
  • There is both PowerPath\VE support for NVMe/TCP with VMware, and VMware’s HPP plugin (NMP plugin is not supported). We strongly recommend PP/VE.
  • No vVol support for NVMeoF on the PowerFlex currently (VMware only supports FC-NVMe in vSphere 8 anyway).

General

  • There are no adjustable queues with NVMeoF because, well, the technology works. differently. In other words, don’t worry about it.
  • VAAI Primitives have limited support.

Adding NVMe/TCP adapter to VMware

Let’s walk through deploying NVMe/TCP in your VMware environment. In step 1 navigate to Host -> Configure -> Storage Adapters -> ADD SOFTWARE ADAPTER.

In step 2 select Add NVMe over TCP adapter.

In step 3 select the correct NIC. This NIC should have a VMkernel port configured that has the NVMe over TCP service selected as shown below in the second image. Failure to add the service will leave you scratching your head when you can’t add any controllers even though they are perfectly pingable from the host. You’ll get the error:  The object or item referred to could not be found. This example shows a standard switch, but most customers use distributed.

Adding NVMe hosts to PowerFlex

Now on the PowerFlex side, we need to add each of these ESXi hosts as NVMe hosts in the GUI. In a PowerMax world this would be creating the host group. PowerFlex will automatically recognize the NVMe targets (SDT) from the installation, so we don’t have to add those:

But we do need to tell PowerFlex about our NVMe hosts or initiators. To do that we first need the NQN of the ESXi host. That can be found by highlighting the TCP adapter we just added from above and accessing the adding a controller screen in step 1. The Host NQN is listed there in step 2:

Using that information, we can create an NVMe host in PowerFlex. Access the Hosts screen through the Block menu. Then in step 1 select +Add Host and add a new host in step 2 with the host NQN you just collected. The Host Name is completely arbitrary.

The final step then is to add the SDT, i.e., the NVMe target, in VMware. Normally, this can be done with the same discover controller process above where we got the host NQN, but I was unable to get that to work in PowerFlex for the initial IP address. I’m unclear if that is intended or my environment is not quite there. In any case, VMware permits adding the controller automatically, manually (in the same screen below), or via CLI. Therefore I added the target through the CLI (esxcli) which is perfectly fine. Once I did that it allowed me to use the GUI to add the other paths (which for me is irrelevant since I have one port with multiple VLANs). Strange I know. Here is the esxcli for adding the target:

esxcli nvme fabrics connect -a vmhba68 -I 192.168.150.31 -s nqn.1988-11.com.dell:powerflex:00:09ea5e153fa91c0f

The one thing I am not showing here is how to get the NQN (nqn.1988-11.com.dell:powerflex:00:09ea5e153fa91c0f) of the PowerFlex system. Unfortunately, that must be obtained using the PowerFlex CLI (scli –query_properties –object_type SYSTEM –properties NQN –flat_list –all_objects). Once you add the first target manually, CLI or in the interface, auto discovery will find all the other paths which you can then add:

Now you’re ready to provision storage to the vSphere cluster. Because I cover this in the demo on migrations, let’s move right along.

Migrations

So if NVMe/TCP is the future (certainly up for debate), how do we move our environment from SCSI to NVMeoF. The first thing to remember is that a device can only be presented by one protocol at a time. This holds true for SCSI and NVMeoF as it does for FC and iSCSI. One customer at a time. Logically, therefore, we have two options to move our VMs: online or offline. That’s just logically speaking so let’s talk reality. As of today, VMware only supports online migration. Online migration is accomplished via the universal mover technology, Storage vMotion. Dell is working with VMware directly on the offline procedure. I’ve done quite a bit of testing for both PowerMax and PowerFlex and I’m confident VMware will support a process soon.

Why not just use SvMotion for everything? Well, some customer environments are huge. We’ve done some basic calculations with customer data and given the concurrent SvMotion limitations, they would be looking at weeks to complete the migration. True, some customers might be OK with the slow approach, but most want a migration that could be accomplished in a weekend.

As all VMs are not the same, the vast majority of customers can categorize them into 24/7 versus those with some down time. The offline process, as currently tested, is relatively quick, and there is no concurrency limit to the number of datastores you could convert in one go. Therefore we envision a dual migration of online/offline. But let’s not get ahead of ourselves since VMware is the final word on the offline supportability.

The other thing to remember about migrations, is the SvMotion restrictions in general. You can’t move a VM with shared disk, or SCSI bus sharing, and obviously no RDMs. And as I mentioned there is no cluster vmdk (attribute of VMFS) so that won’t work either. Also be cognizant if your VM is in more than one datastore in case you want to keep the vmdks on different TCP datastores.

Snapshots

If you are really nervous about moving, feel free to take a snapshot of the SCSI device right before you start.

Demo

I put together a demo of the process to move a VM from SCSI to NVMe/TCP. It’s about 12 minutes because I narrate it and I get a bit wordy. I usually stick to callouts in demos instead of narration, but it would have looked like the chalkboards from Interstellar so I went with voice. You can always do what my boss does, set the playback to 2x 🙂

Wrap

As we just released PowerFlex 4.0, I don’t suspect we have many customers yet moving to NVMe/TCP. It is a nascent technology which has some way to go so adoption will be slow-going for some time. But it’s always good to be aware of the direction of the technology so you can be prepared.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

Up ↑

%d bloggers like this: