Virtualization and some coffee: Disaster Recovery

Showing posts with label Disaster Recovery. Show all posts

Monday, May 4, 2015

Azure Site Recovery: Generation 2 VM support

For almost a year ago, Microsoft announced the preview of a cloud service that has turned out to be the leading star when it comes to Hybrid Cloud scenarios, out of the box from Microsoft.

Microsoft Azure Site Recovery let customers extend their datacenter solutions to the cloud to ensure business continuity and availability on-demand.

The solution itself is state of the art and covers many different scenarios – and can rather be seen as their “umbrella” when it comes to availability and recovery in the cloud, as it has several different offerings in different flavors under its wings.

Besides supporting DR protection of VMware and Physical computers (newly announced), Azure Site Recovery is considered as mandatory for organizations that need DR for their Hyper-V environments, regardless whether the cloud or a secondary location on-prem is the actual DR target.

Just recently, Microsoft announced support for protecting Generation 2 Virtual Machines to Azure.

This is fantastic good news and shows that the journey towards cloud consistency is established for sure.

Let me add some context before we look into the details.

I’ve been working with the brilliant Azure Site Recovery Product Group at Microsoft for a long time now, and I have to admit that these guys are outstanding. Not only do they ship extremely good quality of code, but they also listen to feedback. And when I say listen, they actually engage with you and really tries to understand your concern. In the end of the day, we are all on the same team, working towards the best experience and solution possible.

During TechEd in Barcelona, I was co-presenting “Microsoft Azure Site Recovery: Leveraging Azure as your Disaster Recovery Site” (http://channel9.msdn.com/Events/TechEd/Europe/2014/CDP-B314 ) together with Manoj, and this is when our real discussion started.

Using Azure as the secondary site for DR scenarios makes perfect sense and many customers would like to take benefit from this as soon as possible. However, we often saw that these customers had deployed their virtual machines as Generation 2 VMs – which wasn’t suited for the Azure platform. This was a blocker and the amount of Gen2 VMs were increasing every day.

Earlier in January this year, I made a community survey around the topic and the result was very clear:

Yes – people would love to use Azure as their secondary site, if there was support of Generation 2 VMs in the Cloud.

I am glad to say that the Product Group listened and now we can start to protect workloads on Gen2 VMs too.

But, how does this work?

When you enable a VM for protection, the data is sent to an endpoint in Azure, and nothing special has happened so far.

However, the ASR service will perform a conversion in the service at the time of failover to Gen1.

What?

Let me explain further.

In case of a disaster where you need to perform a failover to Azure, the VM(s) is converted and started as Gen1, running in Azure.

The ASR backend services used during failover has the conversion logic. At failover time, backend service reads Gen2 OS disk and convert the disk to Gen1 OS disk (hence the requirements of the OS disk in Azure).

If you need/want/have to failback to your on-prem Hyper-V environment, the VM will of course be converted back to Gen2.

For more details – check out the official blog post by one of the PM’s, Anoob Backer

http://azure.microsoft.com/blog/2015/04/28/disaster-recovery-to-azure-enhanced-and-were-listening/

Monday, January 19, 2015

Business Continuity with SCVMM and Azure Site Recovery

Business Continuity for the management stamp

Back in November, I wrote a blog post about the DR integration in Windows Azure Pack, where service providers can provide managed DR for their tenants - http://kristiannese.blogspot.no/2014/11/windows-azure-pack-with-dr-add-on-asr.html

I’ve been working with many service providers over the last months where both Azure Pack and Azure Site Recovery has been critical components.

However, looking at the relatively big footprint with the DR add-on in Update Rollup 4 for Windows Azure Pack, organizations has started in the other end in order to bring business continuity to their clouds.

For one of the larger service providers, we had to dive deep into the architecture of Hyper-V Replica, SCVMM and Azure Site Recovery before we knew how to design the optimal layout to ensure business continuity.

In each and every ASR design, you must look at your fabric and management stamp and start looking at the recovery design before you create the disaster design. Did I lost you there?

What I’m saying is that it’s relatively easy to perform the heavy lifting of the data, but once the shit hit the fans, you better know what to expect.

In this particular case, we had a common goal:

We want to ensure business continuity for the entire management stamp with a single click, so that tenants can create, manage and operate their workloads without interruption. This should be achieved in an efficient way with a minimal footprint.

When we first saw the release of Azure Site Recovery, it was called “Hyper-V Recovery Manager” and required two SCVMM management stamps to perform DR between sites. The feedback from potential customers were quite loud and clear: people wanted to leverage their existing SCVMM investment and perform DR operations with a single SCVMM management stamp. Microsoft listened and let us now perform DR between SCVMM Clouds, using the same SCVMM server.

Actually, it’s over a year ago since they made this available and diving into my archive I managed to find the following blog post: http://kristiannese.blogspot.no/2013/12/how-to-setup-hyper-v-recovery-manager.html

So IMHO, using a single SCVMM stamp is always preferred whenever it is possible, so that was also my recommendations when it came to the initial design for this case.

In this blog post, I will share my findings and workaround for making this possible, ensuring business continuity for the entire management stamp.

The initial configuration

The first step we had to make when designing the management stamp, was to plan and prepare for SQL AlwaysOn Availability Groups.

System Center 2012 R2 – Virtual Machine Manager, Service Manager, Operations Manager and Orchestrator does all support AlwaysOn Availability Groups.

Why plan for SQL AlwaysOn Availability Groups when we have the traditional SQL Cluster solution available for High-Availability?

This is a really good question – and also very important as this is the key for realizing the big goal here. AlwaysOn is a high-availability and disaster recovery solution that provides an enterprise-level alternative to database mirroring. The solution maximizes the availability of a set of user databases and supports a failover environment for those selected databases.

Compared to a traditional SQL Cluster – that can also use shared VHDXs, this was a no brainer. A shared VHDX would have given us a headache and increased the complexity with Hyper-V Replica.

SQL AlwaysOn Availability Groups let us use local storage for each VM within the cluster configuration, and enable synchronous replication between the nodes on the selected user databases.

Alright, the SQL discussion is now over, and we proceeded to the fabric design.

In total, we would have several Hyper-V Clusters for different kind of workload, such as:

· Management

· Edge

· IaaS

· DR

Since this was a Greenfield project, we had to deploy everything from scratch.

We started with the Hyper-V Management Cluster and from there we deployed two VM instances in a guest cluster configuration, installed with SQL Server for Always On Availability Groups. Our plan was to put the System Center databases – as well as WAP databases onto this database cluster.

Once we had deployed a Highly-Available SCVMM solution, including a HA library server, we performed the initial configuration on the management cluster nodes.

As stated earlier, this is really a chicken and egg scenario. Since we are working with a cluster here, it’s straightforward to configure the nodes – one at a time, putting one node in maintenance mode, move the workload and repeat the process on the remaining node(s). Our desired state configuration at this point is to deploy the logical switch with its profile settings to all nodes, and later provision more storage and define classifications within the fabric.

The description here is relatively high-level, but to summarize: we do the normal fabric stuff in VMM at this point, and prepare the infrastructure to deploy and configure the remaining hosts and clusters.

For more information around the details about the design, I used the following script that I have made available that turns SCVMM into a fabric controller for Windows Azure Pack and Azure Site Recovery integration:

https://gallery.technet.microsoft.com/SCVMM-Fabric-Controller-a1edf8a7

Once the initial configuration was done, we deployed the NVGRE gateway hosts, DR hosts, Iaas hosts, Windows Azure Pack and the remaining System Center components in order to provide service offerings through the tenant portal.

If you are very keen to know more about this process, I recommend to read our whitepaper which covers this end-to-end:

https://gallery.technet.microsoft.com/Hybrid-Cloud-with-NVGRE-aa6e1e9a

Here’s an overview of the design after the initial configuration:

If we look at this from a different – and perhaps a more traditional perspective, mapping the different layers with each other, we have the following architecture and design of SCVMM, Windows Azure Pack, SPF and our host groups:

So far so good. The design of the stamp was finished and we were ready to proceed with the Azure Site Recovery implementation

Integrating Azure Site Recovery

To be honest, at this point we thought the hardest part of the job was done, such as ensuring HA for all the workloads as well as integrating NVGRE to the environment, spinning up complex VM roles just to improve the tenants and so on and so forth.

We added ASR to the solution and was quite confident that this would work as a charm since we had SQL AlwaysOn as part of the solution.

We soon found out that we had to do some engineering before we could celebrate.

Here’s a description of the issue we encountered.

In the Microsoft Azure portal, you configure ASR and perform the mapping between your management servers and clouds and also the VM networks.

As I described earlier in this blog post, the initial design of Azure Site Recovery in an “Enterprise 2 Enterprise” (on-prem 2 on-prem) scenario, was to leverage two SCVMM management servers. Then the administrator had the opportunity to duplicate the network artifacts (network sites, VLAN, IP pools etc) across sites, ensuring that each VM could be brought online on the secondary site with the same IP configuration as on the primary site.

Sounds quite obvious and really something you would expect, yeah?

Moving away from that design and rather use a single SCVMM management server (a single management server, that is highly-available is not the same as two SCVMM management servers), gave us some challenges.

1) We could (of course) not create the same networking artifacts twice within a single SCVMM management server

2) We could not create an empty logical network and map the primary network with this one. This would throw an error

3) We could not use the primary network as our secondary as well, as this would give the VMs a new IP address from the IP pool

4) Although we could update IP addresses in DNS, the customer required to use the exact IP configuration on the secondary site post failover

Ok, what do we do now?

At that time it felt a bit awkward to say that we were struggling to keep the same IP configuration across sites.

After a few more cups of coffee, it was time to dive into the recovery plans in ASR to look for new opportunities.

A recovery plan groups virtual machines together for the purposes of failover and recovery, and it specifies the order in which groups of VMs should fail over. We were going to create several recovery plans, so that we could easily and logically group different kind of workloads together and perform DR in a trusted way

Here’s how the recovery plan for the entire stamp looks like:

So this recovery plan would power off the VMs in a specific order, perform the failover to the secondary site and then power on the VMs again in a certain order specified by the administrator.

What was interesting for us to see, was that we could leverage our Powershell skills as part of these steps.

Each step can have an associated script and a manual task assigned.

We found out that the first thing we had to do before even shutting down the VMs, was to run a powershell script that would verify that the VMs would be connected to the proper virtual switch in Hyper-V.

Ok, but why?

Another good question. Let me explain.

Once you are replicating a virtual machine using Hyper-V Replica, you have the option to assign an alternative IP address to the replica VM. This is very interesting when you have different networks across your sites so that the VMs can be online and available immediately after a failover.

In this specific customer case, the VLAN(s) were stretched and made available on the secondary site as well, hence the requirement to keep the exact network configuration. In addition, all of the VMs had assigned static IP addresses from the SCVMM IP Pools.

However, since we didn’t do any mapping at the end in the portal, just to avoid the errors and the wrong outcome, we decided to handle this with powershell.

When enabling replication on a virtual machine in this environment, and not mapping to a specific VM network, the replica VM would have the following configuration:

As you can see, we are connected to a certain switch, but the “Failover TCP/IP” checkbox was enabled with no info. You probably know what this means? Yes, the VM will come up with an APIPA configuration. No good.

What we did

We created a powershell script that:

a) Detected the active Replica hosts before failover (using the Hyper-V Powershell API)

b) Ensured that the VM(s) were connected to the right virtual switch on Hyper-V (using the Hyper-V Powershell API)

c) Disabled the Failover TCP/IP settings on every VM

a. Of all of the above were successful, the recovery plan could continue to perform the failover

b. If any of the above were failing, the recovery plan was aborted

For this to work, you have to ensure that the following pre-reqs are met:

· Ensure that you have at least one library server in your SCVMM deployment

· If you have a HA SCVMM server deployment as we had, you also have a remote library share (example: \\fileserver.domain.local\libraryshare ). This is where you store your powershell script (nameofscript.ps1) Then you must configure the share as follow:

a. Open the Registry editor

b. Navigate to HKEY_LOCAL_MACHINE_SOFTWARE\Microsoft\Microsoft System Center Virtual Machine Manager Server\DRAdaper/Registration

c. Edit the value ScriptLibraryPath

d. Place the value as \\fileserver.domain.local\libraryshare\. Specify the full fully qualified domain name (FQDN).

e. Provide permission to the share location

This registry setting will replicate across your SCVMM nodes, so you only have to do this once.

Once the script has been placed in the library and the registry changes are implemented, you can associate the script with one or more tasks within a recovery plan as showed below.

Performing the recovery plan(s) now would ensure that every VM that was part of the plan, was brought up at the recovery site with the same IP configuration as on the primary site.

With this, we had a “single-button” DR solution for the entire management stamp, including Windows Azure Pack and its resource providers.

-kn

Thursday, January 1, 2015

Azure Site Recovery - Survey

Happy New Year!

Now, let us get back to work.

I have made a very short survey just to get a better understanding of the potential DR scenarios with Microsoft Azure Site Recovery.

As you already know, Azure can be your DR site today, where you can have ongoing replication from your private cloud(s) to Azure, which eliminates the need for a secondary site that you have to manage and operate yourself.

However, there are some limitations when using Azure, such as lack of support for Generation 2 VMs and the advance usage of VHDX.

Please take 30 seconds to complete this short survey - and I will be very grateful.

https://no.surveymonkey.com/s/KWSCD6W

-kn

Wednesday, September 24, 2014

Providing Disaster Recovery for Windows Azure Pack tenants

In today’s datacenters where service providers are using Hyper-V, System Center and Windows Azure Pack, we will find a wide diversity of technologies and many moving parts that together will deliver one or more advanced sophisticated solutions.

To deliver cloud computing at scale, the most important thing is an understanding of the solutions as well as the preferred design to meet these goals.

In a nutshell, the holy grail of disaster recovery for tenants is something like this:

“A failover should occur with a minimum or none interaction from the tenants”

This one is hard to achieve to be honest, and the underlying design of the management stamp and the rest of the topology need to be designed in order for this to work.

Let us just look at some of the dependencies when a service provider is offering a complete IaaS solution to their tenants, leveraging the VM Cloud Resource Provider in Azure Pack.

· VMM Management Stamp

o The management stamp is the backend, the fabric controller of your VM Clouds in WAP

o VMM manages the scale-units as well as the clouds abstracted in to WAP.

o Networking is also a critical part of VMM once NVGRE is implemented. Since VMM act as a network controller in this context, we have in essence a boundary for the managed virtual machines

o Virtualization Gateways is also managed by the stamp, within the single boundary

* Hyper-V Replica, managed through ASR and VMM

· SPF

o Service Provider Foundation exposes the IaaS capabilities in VMM to Azure Pack, and is the endpoint used by the VM Cloud Resource Provider in WAP

· Active Directory

o A service provider datacenter may contain several domain controllers and forests to manage their solutions, and hence a critical part of the offerings

o ADFS for federation using multiple services can be considered as a key component

· SQL

o Almost everything in the Cloud OS stack on-premises has a dependency and requirement of SQL servers and databases

· Windows Azure Pack

o The API’s, the portals and the extensions are all there to deliver the solutions to the tenants, providing Azure similar services based on the service providers own fabric resources, such as IaaS and PaaS.

§ VM Cloud Resource Provider

· Remote Console

· VMM library with VM templates and VM Roles

· Tenants

· VMs

· VM Roles

· Virtual Networks

· Usage

o Other resource providers, such as Automation with SMA, Web Site Clouds, SQL Clouds and more, are all part of the solution too

· Scale-units

o Compute

o Storage

o Networking

In order to solve this, and address many of the challenges that you will be facing, it is really important to be aware of the structure of these components, how they scale, how they can survive and what it takes to bring them back online again.

We have been working on some designs that may fit the majority of the organizations out there, each and one of them with pros and cons.

Within the next weeks, I will publish our findings and this will give you a better understanding of what is possible and what may be the best solution to ensure business continuity for both the service provider and the tenants.

Tuesday, December 31, 2013

How to Setup Hyper-V Recovery Manager with a Single VMM server topology

Hyper-V Recovery Manager with single VMM server topology

Recently, Microsoft announced that a single VMM server will be sufficient in order to take advantage of Hyper-V Replica – a software as a service offering from Windows Azure, that will orchestrate DR workflows in your on-premise cloud infrastructures, managed by System Center 2012 R2 – Virtual Machine Manager.

This is a huge step in the right direction, in order to ensure HVR adoption for customers and partners.

The requirement of having two VMM infrastructures would not only be an additional cost, but also lead to administrative overhead and complexity, since a Hyper-V host can only be managed by a single VMM management server at a time.

To have the entire story about the deployment and considerations, please see the following guide: http://blogs.technet.com/b/scvmm/archive/2013/12/16/single-vmm-server-deployment-topologies-for-windows-azure-hyper-v-recovery-manager.aspx?WT.mc_id=Social_TW_OutgoingAnnouncements_Fri%20Dec%2020%2020:14:28%20GMT%202013_36341402_system_center

This blog post will focus on:

· Setup of the HVR agent on the VMM Management server

· Creation of DR Cloud within VMM

· Configuration of DR in HVR

· Orchestration with HVR and VMM

Setup of the HVR agent on the VMM Management server

Before we can go ahead and deploy HVR into our environment, the following requirements must be met.

Hyper-V Recovery Manager prerequisites:

· Windows Azure account. You will need an Azure account with the recovery services feature enabled.

· .CER certificate that must be uploaded as a management certificate containing the public key to the Hyper-V Recovery vault, so that the VMM server can be registered with this vault. Each vault has a single .cer certificate that complies with the certificate prerequisites.

· .PFX file. The .cer certificate must be exported as a .PFX file (with the private key), and you will import it on each VMM server that contains virtual machines that you want to protect. This blog post will only use a single VMM server.

VMM server prerequisites:

· At least one VMM server running on System Center 2012 SP1 or System Center 2012 R2 (this blog post will demonstrate 2012 R2)

· If you are running one VMM server, it will need two clouds configured (where the DR will occur between the clouds). If you have two or more VMM servers, at least one cloud should be configured on the source VMM server you want to protect, and one cloud on the destination VMM server that you will use for recover. The primary cloud you want to protect must contain the following:

o One or more VMM host groups

o One or more Hyper-V hosts servers in each host group

o One or more Hyper-V virtual machines on each Hyper-V host

· If you want virtual machines to be connected to a VM network after failover, you configure network mapping in Hyper-V Recovery Manager.

Once the certificate is uploaded to HVR, you can download the latest provider that you should install on your VMM management server

The installation process will require that you stop the System Center Virtual Machine Manager service prior to install, as there will be changes made to the GUI as well as extra functionality on the server

During the installation, you must point to the .pfx file of your .cer certificate and map it with the vault created in Windows Azure Hyper-V Recovery Manager.

Specify the VMM server name, and enable ‘Synchronize cloud data with the vault’. For you information, there will only be metadata that is shipped from VMM to Windows Azure.

Once the setup has completed, the setup can start the VMM server service again, and you can open the VMM console.

The next thing we will do, is to create clouds in VMM.

Creation of DR Cloud within VMM

A cloud is an abstraction of your physical fabric resources, like virtualization hosts (host groups), networks, storage, library resources, port classifications, load balancers and eventually the user actions that you permits.

Create at least two clouds (one for production and one for DR) where you enable DR on both of them. This option is available when you assign a cloud name and a description

Also, please note that the capability profile that contains ‘Hyper-V’ should be selected as part of the cloud. This is a requirement so that only virtual machines tagged for Hyper-V, can participate in the DR workflows that is solely depending on Hyper-V as the hypervisor.

Now, if we look at the HVR service in Windows Azure again, under protected items, we should see both of our clouds listed

Note that there are currently no virtual machines enabled for protection, although there could be virtual machines running in these clouds.

If we check the clouds in VMM, we can see that status for protection shows ‘disabled’

Configuration of DR in HVR

To complete the configuration of the HVR service, we must continue to work in the Windows Azure Portal.

Click on your cloud under protected items, that should be seen as the primary cloud (running the primary workload).

In order to complete the configuration, click configure protection settings.

This will let you configure the replication location and frequency.

If you are familiar with Hyper-V Replica, you will recognize the options here.

Target location: this will be your VMM server

Target cloud: this will be the DR cloud you created in VMM, that will receive replication from the primary cloud, running the primary workload.

Copy frequency: Choose between 5 minutes (default, 30 seconds and 15 minutes – which was introduced with Windows Server 2012 R2 – Hyper-V.

Additional recovery points: Default is zero, but you can have in total 15 recover points.

Frequency of application-consistent snapshots: Hyper-V Replica does also support app-consistent snapshots in addition to crash consistent snapshots. This is ideally for SQL servers and other critical applications enabled for DR with HVR.

Data transfer compression: default is ON, so that the data is compressed during replication.

Authentication: Certificate and Kerberos is the option. HVR will let you use certificates so you can replicate between different domains if you would like, without any trust.

Port: 8084 is the default port, and a firewall rule will be enabled on the Hyper-V hosts in primary and recovery clouds to allow access to this port

Replication Method: Over the network is default – and recommended, but offline is also an option.

Replication start time: Immediately – which is good when you have the bandwidth. An initial replication will copy and replicate the entire virtual machine (with its virtual hard disks) to the recovery site. A good idea might be to schedule this to happen during night, for example.

Once you have completed the configuration, click ‘Save’.

This will initiate a job in your VMM and Hyper-V infrastructure that will pair clouds, prepare the VMM server(s) and clouds for protection configuration, and configure the settings for the clouds to start protecting virtual machines.

Once the job has completed, go back to protected items in the Azure portal and verify that DR is enabled for your clouds.

We must also map some resources in order to streamline the potential failovers between our cloud.

If you have worked with Hyper-V Replica, you may remember that after you have enabled initial replication on a new virtual machine, the wizard will send you to the virtual NIC interface on the hardware profile, so that you can configure an alternative IP configuration for the VM.

This setting in HVR let us do this at scale, so that network A on the primary cloud could always be mapped to network A2 on the DR cloud, for instance.

Click on ‘resources’ in the portal, and map your networks.

It is important that these networks are available in the cloud configuration in VMM in order to show up here.

Next, let us enable DR on our virtual machines running in the primary cloud.

In VMM, we will notice a new option under ‘Advanced’ on the hardware tab on the virtual machines.

The screenshot below shows a virtual machine running in my ‘Service Provider Cloud’ which is the primary cloud, where I enable DR.

Once this has completed, the virtual machine’s metadata should be exposed in HVR and ready to use in a recovery plan.

Note: if DR should be considered as mandatory in your environment, a good tip would be to tag the hardware profiles on your templates to be enabled for Hyper-V on the capability profile, as well as DR enabled under advanced. Then all newly created virtual machines based on your templates, will be available in the recovery plans in HVR. Also note that if Hyper-V Replica Broker is in use (in a Hyper-V Cluster), you can’t use protection on VMs that are not configured as highly available, running locally on one of the nodes.

Back in the portal, we must create a recovery plan.

Creating Recovery Plans in HVR

Now that we have a VM enabled for protection, it is time to create one or several recovery plans.

A recovery plan gathers virtual machines into groups and specifies the order in which the groups fail over. Virtual machines you select will be added to the default group (Group 1). After you create the recovery plan, you can customize it and add additional groups.

This is very useful if you have distributed applications (everyone have this!) or specific workload you would like to group. The power of HVR is the ability to orchestrate and facilitate the failovers.

Click on recovery plans in the portal, and start the wizard to create a new one.

First, you must select source and target. In my example, since using only a single VMM server, I can use the same on both source and target. Specify a name and continue.

Select virtual machines that should participate in the recovery plan. We can see the VM I enabled previously at this stage.

Once the job has completed, you should have successfully enabled a recovery plan for the virtual machine(s) and is able to perform the workflows like failover (planned, unplanned) and test failover.

Thanks for reading – and in the next blog post or so, we will look closer at DR operations at scale and how to use groups together with recovery plans to meet critical business requirements.

Happy new year!

Virtualization and some coffee