Sunday, September 19, 2010

Mindy 3.0 – Open the Windows!

So, in what appears to be an annual event, I’ve had to rebuild Mindy yet again.  This time it wasn’t a complete hardware failure that caused the refresh, though there was a minor hardware problem involved.  Turns out when I rebuilt her last year, I had picked up three Western Digital WD1001FAYS 1Tb hard drives to uses as part of the RAID configuration.  Unfortunately, those are desktop drives and appear to be basically incompatible with any kind of RAID for redundancy (RAID-1, RAID-5, etc).  I kept seeing disk timeout errors from my RAID controller which were causing system stalls (up to 60 seconds) and repeated verification runs on the array.  Besides being annoying (nothing like having the music you’re listening to just stop mid stream followed by a very annoying repeating click – love the SoundBridge, but it doesn’t handle that gracefully), I also would see file corruption on my primary file server.  Fortunately it was confined to the logs, but it was no good none the less.

The other issue I was seeing was an almost complete failure of the VMware web UI (MUI).  When I first rebuilt her, I installed Ubuntu Linux server 8.04.2 LTS.  My hope was to avoid doing a rebuild for a while and plus I’d worked with Ubuntu for a while and really liked it.  Unfortunately, it appears that Ubuntu + VMware Server 2.x (at least for me) are not really a good combination.  For the first couple of months, all was well, but then the MUI started failing to connect to the back end server.  At this point, I can’t tell if it was a simple degradation of the system or if it was a result of an update (I think the latter).  Regardless, by about 2 months ago, I would restart the UI, connect once or twice and that was it.  I could never reconnect, so no starting VMs, stopping VMs, etc.  It became basically unmanageable.

At that point, I decided it was time to fix the drive problems by replacing the desktop drives with three Western Digital WD2002FYPS RE4 2 tb drives.  These are the RAID Enabled drives, which basically means they have TLER (time limited error recovery) enabled.

Time Limited Error Recovery (TLER) - instructs the drive that in the event of write failure that it should limit the amount of time it takes to attempt to correct the issue.  Typically this is 7 seconds.  That way the RAID controller can catch the issue and help fix it rather than assuming the drive has timed out.

No only do they have the TLER, they also have a 5 year warranty and 64mb cache, so they are blazingly fast.  More on that in a moment.

So, about $800US later and I have three new hard disks and the battery backup (BBU) for my LSI / 3ware 9650SE RAID card.  (I’m thinking this is worse than a child, but I digress).  I proceed to backup all the data (no small feat with 3tb online and only 1.5 tb of space around), and start the rebuild.  First thing I do is take out the six hard disks I had in the system before (3x1Tb and 3x500Gb), attach the BBU and install the three new RE4 drives.  Then I had to determine which OS to use for the new system.

My original plan was to move to VMware ESXi 4.1.0 so I could continue to leverage my investment in VMware (both as a desktop system and from the server).  So that was my first move.  I build the array and then tried installing ESXi.  No dice.  It would load the bootstrap and then just before starting the configuration UI, it would freeze (at the “Booting: MBI=0x01100db, entry=0x00100212” step).  I couldn’t seem to figure out what was happening.  Nothing online seemed to explain the error and I had a hard time determining if my motherboard (ASUS P6T Deluxe V2) was supported or not.  VMware does have a small number of supported motherboards and the ASUS was definitely not on there.  I had found a site that said the P6T (no additional markings) has been made to work, so I didn’t assume it was a complete hardware incompatibility.

I was stumped, so I turned to Windows Server 2008 R2.  I have an MSDN account, so I pulled down the server installer, just to see if it worked.  Funny enough, it failed too.  The installer would start and then almost immediately give me a black screen and an error code (—insert code here—when I find it).  I dug around and was finally able to determine that Windows (and ESXi for that matter) both failed to start because I had disabled ACPI in the BIOS (remnants of me trying to solve the drive timeout problem mentioned earlier).  Enabling that allowed both to boot.

Now I had a decision to make.  Do I go with ESXi or Hyper-V.  My preference was definitely VMware’s solution as I’ve used Server and Workstation for quite some time and generally have good luck with it.  In fact, I use Workstation on my work laptop on a regular basis.  So it was back to trying to get VMware to boot.  First thing I discovered is that VMware ESXi (not ESX) cannot be installed to a RAID card like mine.  It needs some sort of storage (non-RAID).  So I ordered a compact flash card to SATA adapter.  Once that came in, I tried it again.  Still no dice.  Now I got an error saying the vmfs3 module wouldn’t load.  Back to Google.

After a bit of searching, I think I’ve determined that it isn’t compatible with my network card, and apparently this is a requirement for it to install.  I thought about ordering a new network card, but at this point, I was tired of spending money.  And while the CF to SATA card was being delivered, I tried Windows 2008 and seemed to finally get it to a point where it looked usable.  So, I ended up installing Windows Server 2008 R2 and Hyper-V.  (Though, I probably should have gone with the Hyper-V stand alone, free product in retrospect … and if I had known about it, I would have).

All wasn’t completely rosy, but I’d made my decision.  There were, however, a couple of key tweaks I had to perform before the system worked as it should.  I ended up having to disable the IPv4 large packet offload on all the systems before I managed to get proper network performance.  Actually, on my Server 2008 client OS install, I just disabled all offload.  Now, I can get around 40-60Mb/sec transfer rates to and from the VM.  That’s what I like to see and wasn’t ever quite seeing with the older VMware/Linux setup (though, I didn’t have the BBU for my card and I think the new drives have twice the cache, so that’s not necessarily an apples-to-apples comparison)

No matter, I now have Mindy 3.0 up and running and she’s doing quite well.  I’ve even used the Vmdk2Vhd converter to move from VMware to Hyper-V.  Though, one piece of advice for the transition: before moving any Windows VMs to Hyper-V, install an IDE hard disk, boot the VM and uninstall the VMware tools.  That’ll make sure you can (a) boot the Windows VM on Hyper-V (as it only offers IDE disks for conversions like this) and (b) you actually get the VMware tools uninstalled.  They don’t like being uninstalled off a non-VMware hosed system.