Linux KVM DR failover with PowerMax

The final feature in my work with KVM I wanted to discuss is disaster recovery. DR is essential for any enterprise customer, and if you are considering moving off VMware to a more open source solution like KVM, you’ll need to understand how it works, and perhaps how it is different than Site Recovery Manager. I’m going to demonstrate a failover, leaving aside the failback for now. I intend on covering DR more fully in a future paper using Oracle Linux Virtualization Manager which is like using another flavor of oVirt.

oVirt supports two types of DR, active-active and active-passive. As I am working with PowerMax, active-active would require SRDF/Metro, and active-passive would be any of the other modes: asynchronous, synchronous, or adaptive copy. In active-active configurations, oVirt can simply move VMs from one site to the other since the storage is already read/write on both sides. In active-passive configurations, the storage domain must be imported into the DR site and then the VMs imported/registered. I’m going to cover active-passive here using SRDF/S.

Prerequisites for active-passive environments

These items must be met to run a failover. Note these are basically the same as when I covered importing a storage domain, save for the last one.

- - An active oVirt Engine.
  - A data center and clusters.
  - Networks with the same general connectivity as the primary site.
  - Active hosts capable of running critical virtual machines after failover.
  - A separate host running the Red Hat Ansible Engine (preferably with HA).

Like VMware SRM, DR in oVirt requires manual execution. The system cannot detect failover and run it, rather the user decides when to execute it. SRM uses recovery plans, oVirt utilizes Ansible playbooks. And while SRM is a productized solution, with all the orchestration pre-coded and a detailed user interface to enter relationships between source and target (e.g., network mapping, folders, etc.), oVirt requires the user to create both the mapping files and any customized playbooks. On the storage side, SRM offers Storage Replication Adapters like the Dell SRDF SRA, which handles the array failover orchestration. oVirt does not have any such component, and therefore the storage actions must either be executed manually, or through a created script. This script could then become part of the Ansible playbook if desired to complete the solution. This is a high-level look at the architecture.

Ansible Engine

The Ansible Engine should be installed on Red Hat Enterprise Linux 7 or 8. In my setup I used 8. The installation itself is a matter of adding repositories and then installing Ansible. This depends on whether you have a subscription to the Ansible platform or just limited support. After adding the appropriate repositories, install Ansible:

yum install ansible

This Red Hat article can be helpful.

Scripts

You will find the DR scripts installed here:

/usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/disaster_recovery

From here, change directory into the files folder. A single script, ovirt-dr, can handle all the Ansible tasks:

Generation of the mapping file
Validation of the mapping file
Failover
Failback

Usage:

# ./ovirt-dr generate/validate/failover/failback
[--conf-file=dr.conf]
[--log-file=ovirt-dr-log_number.log]
[--log-level=DEBUG/INFO/WARNING/ERROR]

The script uses the parameters in the file /usr/share/ansible/roles/oVirt.disaster-recovery/files/dr.conf. The file used in this configuration is below. Each action for ovirt-dr has a section. In order to run the failover, first you need to generate the mapping file which is done via the action generate.

[root@dsib2011 files]# more dr.conf
[log]
log_file=/tmp/ovirt-dr-{}.log
log_level=DEBUG

[generate_vars]
site=https://dsib2010.drm.lab.emc.com/ovirt-engine/api
username=admin@ovirt@internalsso
password=
ca_file=/etc/pki/ovirt-engine/ca2010.crt
output_file=../examples/disaster_recovery_vars.yml
ansible_play=../examples/dr_play.yml

[validate_vars]
var_file=../examples/disaster_recovery_vars.yml

[failover_failback]
dr_target_host=secondary
dr_source_map=primary
vault=../examples/ovirt_passwords.yml
var_file=../examples/disaster_recovery_vars.yml
ansible_play=../examples/dr_play.yml

The [generate_vars] section should contain references to the production environment. The script pulls the information from there and generates the output file, disaster_recovery_vars.yml. Note that the ca_file references a local cert, not a cert on the production environment. Therefore you will need to copy it from the production host, or pull it via a curl command.

Generate

Execute the ovirt-dr script and pass the generate command to create the mapping file. Since the script only queries the production environment, the script will duplicate the entries for both the primary (production) and secondary (recovery) locations. It is up to the user to manually change the mapping file.

To be sure the mapping is done correctly for most variables, the user can change the dr.conf to point to the recovery site and then re-run the dr-ovirt script. Then modify the file with the appropriate recovery site information as such:

The oVirt documentation provides an appendix which details the different sections in the mapping file. One important difference you can see between the generated file and my updated one above is that I use quotes around the strings. The appendix also does not include them. If you do not use quotes for the string values, the VMs will fail to import on the recovery side, even if the storage domain succeeds. You will get string value errors in the log file like this which is not very informative:

TypeError: The 'value' parameter must be a string
fatal: [localhost]: FAILED! => {

Adding the quotes will insure the VMs are imported.

Storage domains

The generate script will pull the storage domains from the primary site, except that used for the oVirt Engine since each site runs its own (think of this in the same way you need two vCenters with SRM for DR). Note that if you are just testing failover, and leaving production up, you should remove the NFS storage domains from the mapping file unless you plan on replicating it and creating a separate export and/or IP. In my example I kept with a single FC storage domain replication with SRDF/S. Here is what the generate script created, though note I had to change the dr_secondary_dc_name from Production to Recovery:

dr_import_storages:
- dr_domain_type: "fcp"
dr_wipe_after_delete: False
dr_backup: False
dr_critical_space_action_blocker: 5
dr_storage_domain_type: "data"
dr_warning_low_space: 10
dr_primary_name: "FC-TEST"
dr_primary_master_domain: False
dr_primary_dc_name: "Production"
dr_discard_after_delete: False
dr_domain_id: "4157d061-1fe0-419f-918b-a59fdb09d3bb"
# Fill in the empty properties related to the secondary site
dr_secondary_name: "FC-TEST"
dr_secondary_master_domain: False
dr_secondary_dc_name: "Recovery"

Unfortunately, remember there is no intelligence here with Ansible. It can only do what you tell it. To that end, simply listing this storage domain is not enough for failover to succeed. Next we need to tell oVirt how the individual LUN underlying the storage domain maps, i.e., R1 and R2.

LUN mapping

The LUN section of the mapping file is completely empty. The appendix has an example of how to construct it for iSCSI, and there are other examples you can Google. In my environment since I am using FC, there are only two required variables: the unit ID and the storage type. The unit ID is the WWN, prefixed by the Linux type ‘3’. Below I list out my SRDF pair and then the WWNs queried separately.

With this information we can create the LUN mapping section in the file:

# Mapping for external LUN disks
dr_lun_mappings:
- primary_logical_unit_id: "360000970000120001473533030313532"
primary_storage_type: "fcp"
secondary_storage_type: "fcp"
secondary_logical_unit_id: "360000970000120001672533030313236"

This is the final piece so we can run the failover test.

Failover test

As I haven’t done a video in a while, I decided this would be a good section to use it in. Here is what it demonstrates:

A single FC storage domain replicated by SRDF/S, the R1 presented to production, the R2 to recovery. The domain has two VMs.
The SRDF pair is split, so that the R2 becomes R/W and importable as a storage domain on the recovery site.
The Ansible ovirt-dr script is executed for failover which then:
- Imports the FC-TEST storage domain into recovery
- Registers the two VMs and powers them on

As this is a test failover, I do not run the failback, nor do I include the cleanup of the recovery site.