A Bumpy Road Leads to A Rocky Start


To say that I’ve been busy the past few weeks is a bit of an understatement.  I am relieved and proud to say that as of today we had 193 desktops of 861 in use.  We survived a log on / log off of 50 users at once and things went pretty smoothly so far.  I got some breathing room to troubleshoot some of the minor one-off issues affecting singular users or cosmetic cleanup issues.  Here’s a rundown of what happened over the past few weeks.

 

In early August our target was set for 770 desktops for the start of school on September 4th.  As part of this expansion we were planning to switch the storage to Atlantis ILIO Diskless VDI (for non-persistent desktops).   77 VMs per host was dervied from a specification provided by Atlantis for a 64 GB ILIO VM on a 256GB host at 2GB per VM, leaving room for approximately 79 VMs.  We opted for 77, since it fit our model of 770 desktops.

Around August 20th, we were told we had another budgetary boost to provide substantial, noticable improvements to deliver services to students.  As part of that, we opted to expand our infrastructure by 308 licenses on the assumption that we’d be working with 77 VMs per host. We purchased the 4 additional hosts and licenses.  As other hardware arrived, I began installing and configuring the hosts.  The setup time for each Hyper-V host was substantial.

On Friday, August 23rd the HP Universal print driver was updated on the print server, causing various printer installation issues on machines with corrupted registry keys.  This included the existing XenDesktop 5.6 image which did not have a fix applied to it.  Users (apparently) reported it, but the reports did not get filtered to the appropriate queue in a timely fashion.

With the new XenDesktop 7 infrastructure in place and functionally tested, we were ready to schedule a cutover.  However, one of our biggest risks was using the EMC Celerra CIFS server that had previously had significant CPU issues for no apparent reason.  EMC was unable to resolve the case, so we decided to abandon the Celerra platform and instead opt for a Windows Server 2012 file server.  On Wednesday, August 28st I made the final cutover from the Celerra to the Windows File Server (~25-30 million files, approximately 18TB before de-duplication).

On Thursday, August 29th I was made aware of persisted (and rather angry) reports that users “could not print anything since last Friday” (I won’t go into the fact that users have 5-40 printers to choose from at their site from a variety of manufacturers).  As the day was half way done and the cutover was scheduled for that evening, we held out for Friday.  The cutover on Thursday evening went somewhat smooth, but I noticed very late (~1:30 AM) that Office wasn’t activated.  I had to put the disk into private mode.  After doing so, I neglected to re-select  Cache on Device Hard Drive on accident, leaving the default of cache on server.

On Friday, we had reports of horrific performance.  Given that we were using ILIO, we were quite surprised.  As it turned out, however, the devices were caching on the PVS server.  After a rather troubling day, we got through it and I changed the vDisk over the weekend, working tirelessly to provision additional VMs on Hyper-V hosts(a follow up post will be available about Hyper-V/SCVMM).  On Tuesday, with 1 day to go it was clear I would not be able to provision VMs fast enough to come anywhere near our expected roll-out.  To complicate matters, early Tuesday morning a server became completely unresponsive, knocking off 12 users.  I immediately contacted Atlantis support who got back to me and we discovered that the ramdisk had filled to capacity.  While I was on the phone, I discovered that Hyper-V allocates BIN files regardless of any user preferences for memory swapping and there is no way to turn it off, drastically reducing our storage capacity in the ILIO ramdrive (~1-2GB per VM before deduplication).  Around noon, the remaining hosts began to fill to capacity and as the 90 or so users bounced around they filled up the additional servers and everything went down.

We immediately began trying to find the cause of why the write cache was filling so quickly with so few users.  However, we were hamstrung by Hyper-V’s hyper-slow provisioning through the XenDesktop Setup Wizard in Provisioning Services, so even getting basic service restored was extremely difficult.   By Wednesday at 2:30 AM, I had brought 20 VMs per host online and I worked to reconfigure the Hyper-V hosts to attach via iSCSI to store the configuration files as well as the .BIN swap file.   On Wednesday it became clear that the slowness with provisioning in Hyper-V and SCVMM as well as general failings with SCVMM were going to be our death knell, so I began inquiring for quotes on ESXi standard pricing.

On Thursday, September 5th, I was working with Atlantis support on optimizations for our image to reduce write cache usage.  Atlantis also suggested using SDELETE to write zeros to the free space of the virtual desktops in order to free space on the RAM drive.  We configured this to run as a shutdown script due to our high log on and log off rate.  On Thursday evening, I provisioned as many desktops as I could, increasing the number from 20 after running semi-stable for a day.  I enabled some hosts with additional VMs (some with 30, some with 40) which brought us to a total of 180.  Additionally, I setup one host as ESXi and installed ILIO Center to monitor the ILIO instances, but did not have any online before Friday morning.  I was scheduled to be out of the office on Friday for my son’s birthday.  I went to sleep (again) at about 3 AM for the 7 or 8th time in 2 weeks after verifying we had the 180 VMs online and ready.

Friday morning I woke up at about 6:45 to check the status of things.  To my horror, I discovered all desktops were down.  PVS was not booting devices properly.  A quick CDF trace revealed there was a date stamp difference between the versions (why I don’t know – I used robocopy to copy it.  Recopying didn’t work, so I just deleted the version and re-made the changes to a maintenance version and recopied as quickly as possible.  By 8:15, I had about 60 VMs online with more booting every few minutes.  I continued working to get the environment stabilized and to monitor the write cache before departing at about 11:30 to hang out with my Son.  By 3pm, they were off to grandma’s house and I  helped out with the (now daily) VDI status meeting.  The plan was to expand the VMware platform and get  some Hyper-V hosts closer-to-capacity.

I worked Friday evening to cover my hours and into overtime setting up ESXi and reverse imaging our Windows 7 image for import into VMware.  On Saturday, I worked hard to clean up the image and stress-test the VMware deployment with ILIO.  The results were so fantastic I called my supervisor to ask to modify the plan instead of doing only 1-2 hosts on VMware if I could proceed and do the rest.  We agreed to keep 1 Hyper-V host and the remaining 13 would be VMware.  It truly was pleasing to have rapidly the OS rapidly deployed without cumbersome imaging and annoying configuration requirements (NIC teaming on 2008 R2…looking at you).  The ILIO VMs were provisioned with 90GB of RAM and in very short order I had 13 hosts ready to go…then the moment of truth.  I had 12 hosts remaining to provision.  At about 1:00 AM on Sunday, I opened 12 instances of the PVS Console and was able to start all of them each for 77 VMs per host.  It took about 37 minutes.  In SCVMM, even with the “fast” workaround of enabling dynamic disks, that would have taken about 26 hours (again, assuming random SCVMM failures didn’t happen!).  On Sunday, I successfully booted (and rebooted) 861 VMs a few times and did further image cleanup.  Antivirus was removed as the policies weren’t applying properly as well as the SCCM client due to various “idle” tasks eating away our CPU.  Finally, late Sunday I was content that the VMs were stable.

Monday morning I woke up and immediately checked my phone…AND NO UNEXPECTED ALERTS!  I was thrilled.  Throughout the day I was monitoring the write cache sizes a well as the ILIO datastores.  Peak usage in the ILIO datastore was about 15GB, while some hosts barely scratched 6GB.  Now that the ILIO instances are on VMware we don’t have a 64 GB limit and can allocate 90GB per ILIO, leaving us with a 110GB NFS store.

In the VDI status meeting it was clear that VMware was the clear choice moving forward.  Hyper-V’s TCO was simply blown out of the water by the stability and speed of the tools at hand.  Tuesday, 9/10/2013 we will commence testing pushing our limits and trying to get to capacity on a host – that is – 77 VMs in use.  We’re not out of the woods yet, but I’m sitting back enjoying a beer after what I consider to be a successful day (albeit a bit late!) and I am looking forward to making sure that our VDI environment is absolutely awesome for the users.

 


Leave a Reply

0 thoughts on “A Bumpy Road Leads to A Rocky Start

    • Atum Post author

      What flavors are you using? I’m really searching for other souls out there who are suffering on Hyper-V. I just have a hard time to believe that there are really users using it in production after a year of having that experience.