jump to navigation

Setup a Windows Failover Cluster in vSphere August 10, 2010

Posted by General Zod in Microsoft, Tech, VMware.
trackback

Recently, I was discussing VMware capabilities with a colleague; and the idea of using shared storage between virtual machines for Microsoft failover clustering came up.  I’d never thought to try to do this before, so I was intrigued by the idea.  I started to quietly do some research into the possibilities.

Then by sheer coincidence, just 4 days later, I was asked by my Supervisor if doing so was possible.  I told him it’d take me a couple days to research the topic and threw myself into it with even greater interest that before.

For your information, my VMware environment consists of a cluster of four (4) ESX 4.0 Update 2 Host servers which are leveraging shared SAN storage over fibre channel.  And this project would require me to construct two (2) VMs to run Windows 2008 R2 (64-bit) Enterprise Servers with shared storage for Failover Clustering services.

So after some struggle, I came with up with guidelines you’ll see below.

Please note that I’m not going to give you exact step-by-step instructions.  I’m going to assume that you know something about VMware and Microsoft Clustering (or, at the very least, can look the instructions up online), and just give you the details I think it’s necessary to point out.

So here’s what I documented…

First, build your Cluster node Servers according to your usual standard build procedure.  Make certain to deploy the same updates to all nodes.  Don’t worry about creating the shared storage just yet, we’ll take care of that after the initial servers are up and running.

Next, create the shared storage.  As I see it, there are a couple options to choose from when creating shared storage.  I will discuss each option as we continue…

Solution #1:  Present a new LUN to the ESX Host, then add the new “Hard Disk” to each of the VMs using the “Raw Device Mappings” disk type.  (For your future reference in the event of trouble, it’s in your best interest to document which LUN is mapped to this VM.)

This is actually my preferred solution even though expanding the drive in the future will require additional SAN work.  (I usually try to keep SAN changes to a minimum… but I’d rather have a little additional SAN work once in a long while than to have the VMotion complications that the latter solution will bring.)

The only significant drawback to this solution will come when it becomes necessary to migrate to an alternate remote storage solution.  Since the “disk” is a LUN (and not a VMDK), then you can’t simply copy it to a new SAN device.

OR

Solution #2:  Create an independent VMDK apart from either VM, and leverage that as your shared storage device.

Power OFF the VMs.  Then, use a SSH emulator (such as PuTTY) to login to one of the ESX Hosts.  Create a new folder on the datastore where your VMs are housed, and then invoke a command similar to the following to create your shared VMDK.  (Adjust the text in RED to fulfill your requirements.)

vmkfstools –c 10G –d eagerzeroedthick shared.vmdk

device_node_and_mode Next, edit the settings on each VM and add the new VMDK as an existing drive.  Select a new Virtual Device Node (such as “SCSI (1:0)”).  I recommend that you enable “Independent” mode (with the Persistent option) on the drive so that it will NOT be affected by snapshots.  (If someone starts taking snapshots on one cluster node, then the other will probably start encountering difficulties.)   After adding the drive to the VM settings, then edit the new SCSI Controller settings to change the scsi_controllerSCSI Bus Sharing option to “Physical” so that the drive can be shared by multiple VMs on separate Host servers.

VMotion between Hosts:

The biggest challenge with the shared VMDK solution is that you lose the ability to VMotion the hot VM from Host to Host.

When attempting to do so, you’ll be presented with an error message that reads as:

Virtual machine is configured to use a device that prevents the operation: Device ‘SCSI controller 1’ is a SCSI controller engaged in bus-sharing.

Some sites tell you that you can get around this by configuring the SCSI Bus setting to “Virtual” instead Physical.  However, I’ve never actually gotten that to work right, so I just leave it this way.  Plus, if it’s set to Virtual, then you’d need to setup a DRS Rule in your Cluster settings to ensure that all of your Nodes are always hosted on the same ESX box.  (Of course, if you leave the SCSI Bus setting as Physical, then you should probably setup a DRS Rule on your Cluster to keep the VMs separated on different Hosts.)

Now you don’t completely lose the ability to VMotion the VM, but you have to power them off first.  Luckily, you’re setting up a failover cluster, so move each of them one at a time… making sure there is at least one VM online at all times.

Storage VMotion for the VM:

This is another challenge.  You will NOT be able to migrate the VM to a new datastore while it is powered on.

To move the VM, you’ll need to power it OFF, and then remove the shared VMDK from the VM settings.  Migrate the VM to a new datastore using Storage VMotion.  Then, reattach the shared VMDK to the VM as above.  Afterwards, power the VM back ON.

Relocating the Shared VMDK:

This last challenge probably won’t become relevant to you unless you’re looking to change your remote storage solution.  Please note that you will be REQUIRED to power off all of the cluster node VMs during this process as you must ensure that no changes are made to the shared VMDK while it is being relocated.

To accomplish this task, if applicable, the ESX Host(s) to both your old and new SAN devices simultaneously.  Then, power OFF all of the cluster node VMs.  Remove the shared VMDK from the Settings of each VM.  Then, login to one of the ESX Hosts via a SSH emulator and CLONE a copy of the VMDK to it’s target destination (that way your original is still intact as a back-out solution) by using the following command:

vmkfstools –i /source-path/shared.vmdk /target-path/shared.vmdk

Next, attach the newly cloned VMDK to the VMs (as mentioned above).  Power on the VMs, and confirm functionality.  Once you are confident that all is working correctly, then feel free to delete the original VMDK from the datastore using this command:

vmkfstools –U /source-vmdk/shared.vmdk

Anyway… once you’ve got the shared storage ready to use, then power on ONE of the VMs.  Use Disk Management to initialize the drive, then proceed with partitioning and formatting.  Then, bring the second VM online, and use Disk Management to bring the shared disk online.

Then, you can proceed with installing the Failover Cluster feature, running the Validation tests, and all that jazz.  I’ll let you go hunt up the rest of your needed instructions from Microsoft on your own.

Comments»

1. Leader Desslok - September 14, 2010

So I set up a Windows cluster on a vmware host using basically these instructions. It’s all on a single non-clustered host, but it still worked the same.
There was one thing I noticed when I did a migration from one server to another. Both servers tried to migrate with their own copies of the share drive.
So, to do the migration this way, simply shut down all the virtuals in the Windows cluster. Then disconnect the shared drives from all but one, and perform all the migrations.
Then, when you go to reconnect the shared drives, you’ll find they’ve been moved to the one virtual’s folder, and possibly renamed. I simply moved the shared drives to a new folder in the data store, and then reconnected the virtuals to them, taking care to make sure I reconnected to the same SCSI target IDs and in the same order. Everything started back up again just fine.

2. General Zod - September 14, 2010

Good catch… that is an important point to note.

3. Leader Desslok - November 6, 2010

I have noticed one thing when using this setup.

I have a 2 node Windows cluster running SQL, and anytime DRS moves one of the nodes, if it is the active SQL node, then it will have one issue. If the database is currently actively doing something when the DRS occurs, the “new” node appearing after the DRS sees that the “old” node still has the files on the cluster disk resource open somehow, and it fails to open the sql database file. So you have to manually go in and close and open the database again.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: