SRM reprotect fails with SRDF/A

Issue

We’ve had a few customers (less than 5 currently) run into an issue with VMware SRM reprotect operations that for me warranted a post because I suspect more may hit it. The issue is that reprotect fails because invalid tracks accumulate on the R1 during failover and thus reprotect cannot succeed due to inconsistency. In normal SRDF operations this is not an unknown occurrence, though usually it is associated with say a network failure. In these SRA cases, the customer is doing a normal failover/reprotect for testing purposes (not an SRM testfailover mind you, a real failover but for test reasons), so the expectation is it should succeed.

The first customer facing this was way back in Nov 2022 running SRDF/Asynchronous with SRDF SRA 10.0.0. The asynchronous replication mode is the key in all customers. In particular this customer had a significant lag across the network which appeared to contribute to the invalids. This is perfectly acceptable, of course, and SRDF was working entirely normally. It is the SRDF SRA, however, that was not setup to handle the scenario. These corner cases do come up, and testing simply can’t catch absolutely everything. I say corner case because again, less than 5 customers in 2 years. Most customers with SRDF have robust networks with low latency, even when using SRDF/A, as they desire to be as close to production as possible for DR. This probably accounts for the limited exposure here.

Since the first customer we’ve also had it in SRDF SRA 10.1.0 because by the time the hot fix was produced for 10.0.0, it was too late to get into the 10.1.0 code. It will be included in the next release though.

Diagnosis

Now, there are plenty of reasons why a reprotect can fail, so how do you check if you have this particular problem? Easy enough – the SRA log file. Assuming you grab the log right after the failure, open it and from the bottom search for “RESUME”, which is part of the reprotect operation. If you have the issue you will see a block similar to this, informing that there are invalid tracks. It will also list them:

[20221123150941 406 6552 DoSraRdfAction@SraReplicationGroup.cpp] Performing RDF [RESUME -FORCE] on DG [Test]
[20221123150941 406 6555 DoSraRdfAction@SraReplicationGroup.cpp] Performing SymDgRdfControl() for the RDF action [RESUME -FORCE]
[20221123150941 406 6628 DoSraRdfAction@SraReplicationGroup.cpp] Performing SymDgRdfControl() RESUME -FORCE for the RDF group [Test]
[20221123150944 406 6770 DoSraRdfAction@SraReplicationGroup.cpp] [ERROR]: Failed to perform RDF operation [RESUME -FORCE] on DG [Test], Symm [000120200599]. 
[ERROR]: [SYMAPI_C_RDF_CONF_INV_TRACKS : Cannot proceed because conflicting invalid tracks were found in the device group]
[20221123150944 406 344 setMarker ] _______Backtrace begin_______
Remote R2 invalid tracks: 0

Device pair: (0014E, 00146)
RDF pair state: Suspended
RDF mode: Asynchronous
RDF type: R1
RDF consistency: Disabled
Local device state: Ready
Local R1 invalid tracks: 0
Local R2 invalid tracks: 115
Remote R1 invalid tracks: 2
Remote R2 invalid tracks: 0

Resolution

If you do diagnose the issue, you can go about obtaining the hot fix. The fix in the code now checks for the invalids before running a resume and rectifies them before the reprotect is issued. All of this, though, happens after the user runs the reprotect in SRM so they are none the wiser. I wrote the following KB which you need to reference when you open a support ticket. The fix is not posted to our normal support site so the KB tells the analyst how to get it to you. For either 10.0.0 or 10.1.0 it’s a normal upgrade process.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

Up ↑