Native Multipathing Plugin (NMP) with VMAX3

* Note a special caveat to this post about IOPS with RR is that it does NOT apply to SRDF/Metro environments running uniform (cross-connect) configurations (support via RPQ).  If your environment has been approved for cross-connect, the IOPS setting should remain at the default of 1000.  Failure to do so could result in significant response time delays, depending on the distance between arrays.


In addition to testing all our exciting new features that integrate VMAX and VMware, I frequently return to existing functionality to validate performance with new code releases of the platform.  I think in this area most of my time is taken with VAAI, but recently I decided to embark on NMP testing with the VMAX3.  As I am sure most of you are aware, VMware provides a generic Multipathing Plugin (MPP), called Native Multipathing Plugin (NMP) which, among many other important functions, processes I/O requests to logical devices (LUN) by selecting an optimal physical path for the request (load balance). The path selection is based on the policy/method which is set for that particular logical device.

Although EMC recommends PowerPath/VE as the preferred multi-pathing software, many customers use VMware’s NMP functionality.  When using NMP, EMC recommends the Round Robin path selection policy (PSP) and in fact defaults to that policy since vSphere 5.1.  Round Robin does exactly what it says, it switches between existing paths to the LUN on a regular interval.  By default, this interval is based on the number of IOs (default 1000) sent down a path, though there are other criteria that can be utilized such as number of bytes.  We (and VMware too) found out a long time ago that adjusting this IO interval (iops) to 1, greatly improved performance on the VMAX2 platform.  We have been recommending it ever since in our best practices (VMAX BP on VMware) TechBook.  When the VMAX3 was released, the change in architecture, specifically the sharing of CPUs across FAs, led to a different iops recommendation for those LUNs – 4.  This value delivered a more balanced performance approach across all types of workloads.

After the release of our 5977 2015 Q3 SR  which had some performance changes, and in consultation with our gurus, I thought it was worthwhile to re-test the iops setting with a variety of values.  The end result of these tests is that we are now going to recommend that for both VMAX2 and VMAX3 arrays, the iops is set to 1.  Having done all the testing though, I might as well share some of it 🙂

Just a quick rundown of the process.  I used Iometer to generate the load against the VMAX3 array, utilizing 16 VMs (Windows and Linux) accessing the same device.  I examined the IO throughput performance at different points of the solution stack namely – the virtual machines, the ESXi hosts and the VMAX3 by tuning the Round Robin I/O operation limit parameter to different values between 1 and 1000. The following I/O parameters were changed in Iometer to generate a variety of workloads:

  • Block Size – Small Block (4K), Large Block(32K or 64K)
  • I/O type – Read, Write and Distribution of Reads/Writes in any workload
  • Workload Type – Random, Sequential, Custom (OLTP)
  • Burstiness – I/O burst injection into workloads

The resulting combinations were:

tableClick to enlarge – use browser back button to return to post

Let me show you some of the results from the testing (and quick disclaimer – the lab, setup, etc. are all my own – this isn’t some benchmark test).  Before I do, when you look at the graphs be sure you examine the values on the vertical axis closely when comparing the iops value.  You’ll see for the most part we aren’t talking massive differences between the settings, and in particular between an iops of 1 and 4; but in testing various workloads an iops of 1 generally has better results than 4 and in particular for sequential and bursty workloads.

For example, the results for a 4k read sequential workload (first IOs then response time) are:

seq_read_IOsClick to enlarge – use browser back button to return to post

seq_read_timeClick to enlarge – use browser back button to return to post

For larger IOs, here are 64k write and read results for a bursty workload:

burstyClick to enlarge – use browser back button to return to post

And to be fair, as I noted overall 1 is the best, however there are use cases such as a 4k read random workload where an iops of 4 just edges 1:

random_read_IOsClick to enlarge – use browser back button to return to post

A couple final thoughts.  First, and once again, note that the differences between the various iops, particularly 1, 4 and 1000 are not large.  Though we are recommending a change in iops to 1 in 5977 2015 Q3 SR (and beyond), it is not going to make or break your performance objectives if you are running the default or 4.  Second, I have not updated my TechBook for the Q3 SR so the recommendation for VMAX3 still says 4 (which is accurate for the releases covered by the TB).  The TB also covers how to set the iops for individual devices or the array as a whole.  As I’ve said before, one of the reasons I keep this blog is to get the information out early, even if I can’t get the documentation updated immediately.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s