DTA 537000 – VMFS corruption with XCOPY

*** Update 2-27-2020 ***

We have now updated the DTA for this issue where you can find the relevant patch number for the array. Please see article 000537000 below for details.

******

******* 11/6/19 update
I’ve been asked to provide any update I can on this DTA since it has been over a month, so here it is. Both VMware and Dell EMC are actively testing and attempting to reproduce this issue. We have customers who experienced the problem running special code to help us diagnose it if it recurs. In addition our labs at both companies have been testing non-stop. Despite all that coverage, we have not seen the issue again at a customer or internally. A reproduction is going to be essential before we can determine the underlying cause and potentially produce a fix so we are continuing the work. I wish I had a more concrete update but rest assured we are doing all we can.
*******

I’ve had lots of questions for this particular Dell EMC Technical Advisory (DTA) so although it is not resolved, I wanted to explain what we currently know. Let me start by saying we have only a small number of our customers who have experienced this issue – less than 1% who match the environment. However, the nature of the bug and its ability to cause data loss has naturally had customers concerned.

Here is the issue as Dell EMC documents it:

537000 : DTA 537000: VMAX AFA, PowerMax: After Storage vMotion, VM Clone or any activity which invokes XCOPY in VMware ESXi 6.x, the target datastore VMFS metadata may be completely overwritten https://support.emc.com/kb/537000

And VMware’s KB on this is:

EMC VMAX and PowerMax arrays experience VMFS corruption when using VAAI XCOPY on vSphere 6.x (74595) https://kb.vmware.com/s/article/74595

The basic issue is that during a VMware task that invokes XCOPY, the target datastore header is overwritten with the header from the source datastore. That results in the corruption of the target datastore and loss of VMs. Dell EMC and VMware engineering have been working for quite some time on this issue to determine the problem. Each company is working under the assumption that it is their bug so they can work independently to resolve because only 1 of 2 things can be true: either VMware is sending us the wrong LBAs to copy, or VMware is sending us the correct LBAs to copy but we are copying the wrong ones anyway.

So far we’ve only seen the issue with vSphere 6.x and VMAX All Flash and PowerMax arrays with PowerMaxOS 5978.144.144 or 5978.221.221, running a task that invokes XCOPY (e.g. clone, SvMotion (manual or SDRS)). And again, only in a small number of customers. I say so far because we can’t rule out what we haven’t seen, though experience and logic would suggest that early code releases have been out much longer and have never reported this issue, and thus are unlikely to have the problem. I want to also note that if you use VPLEX in front of these arrays you are not subject to the issue because VPLEX has its own XCOPY implementation and does not issue the SCSI commands on the VMAX/PowerMax arrays.

The DTA provides instructions for users who hit the problem which involves implementing some special debug code, but to avoid the issue the current remedy is to disable XCOPY on the ESXi hosts. This is done easily, and online, through the GUI or CLI. You can find instructions in my WP here. You can also use PowerCLI to disable XCOPY – here’s an example script which would turn off XCOPY for all hosts in the Datacenter.

# Change the $DatacenterName variable for the vCenter Datacenter you wish to set.
Connect-VIServer "server.lss.emc.com" -Username "administrator" -Password "password"
$DatacenterName = "Datacenter"
$AdvancedSettings = "DataMover.HardwareAcceleratedMove"

# Set the VAAI settings. Be sure to validate Set-AdvancedSetting -Value (of 0 or 1).
Write-Host "Checking $DatacenterName. Value of 0 is Disabled; Value of 1 is Enabled."
Get-Datacenter -Name $DatacenterName | Get-VMHost | Get-AdvancedSetting -Name $AdvancedSettings | Set-AdvancedSetting -Value 0 -Confirm:$false | Select @{N="VM";E={$_.Entity.Name}},Name,Value | Format-Table -AutoSize

How does disabling XCOPY impact the environment? Well it will require a few more host resources, though given the size of hardware resources these days this is of little concern. The more noticeable change is the time it takes to complete an associated VMware task. So, for instance, a clone is likely to take longer than before. No functionality difference, however.

Now would I proactively disable XCOPY if I match these current code levels? Yes, I would. I think the chances of hitting the issue are slim, but I also think the impact of disabling XCOPY is minimal while not doing it could mean corruption.

I will be sure to update the post as soon as we have a root cause which I suspect will be followed by a patch either from us or VMware.

Advertisement

One thought on “DTA 537000 – VMFS corruption with XCOPY

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

Up ↑

%d bloggers like this: