I recently spoke with someone and realized I hadn’t created a post for this yet – much to my disappointment, as it was a major accomplishment to have the issue solved. At my previous employer (K-12 education) the environment was setup as follows:
- Approximately 1100 pooled desktops from a single PVS image.
- 600 powered on desktops from 6am to 5pm
- 50% PeakBufferPercent
- The majority of users were in library or lab environments where 10-30 users would log on and off in fairly short time
- Average concurrent utilization analyzed over a period two months was between 350-450 users during peak hours (~10am to noon Monday thru Friday)
On February 28th, there was a brief “outage” period where users were reporting “No Desktops Available.” I frantically scrambled to find out why – there were only about 400 users logged in so we should have had plenty of reserve available and powered on or at the very least more machines powering on. It wasn’t every login that was failing, either. During my troubleshooting, school ended and I was eventually unable to duplicate the issue, despite having changed nothing. Since I couldn’t duplicate, I looked a bit longer for a cause then chalked it up to a ghost I’d have to watch out for in the future.
The same thing happened on March 5th, at which time my alarms went into overdrive. We didn’t have support, so it was definitely time to call in reinforcements. A quick check in #Citrix proved unfruitful at the time and it seemed we were on our own. We had previously had difficulties getting satisfactory results out of support so we were reluctant to pay per-hour support from a partner. As before, while troubleshooting the issue, it went away seemingly on its own. It was at an earlier time of day, too, which proved strange.
On March 12th and 13th the same symptoms were experienced but reports didn’t trickle in through the helpdesk to the right channels, despite being extra vigilant. On the 21st, however, it happened again and being on XenDesktop 7 (which was being serviced only through private, non-publicized hotfixes), we decided an “emergency” change request to XenDesktop 7.1 was in order. This update proved unfruitful as the issue re-occurred on the 26th as well, however between the 21st and 26th we had arranged for a Citrix case to be opened through a partner.
On the 26th, with support on the line, we captured the standard stuff – Scout, CDF traces, event logs, etc. of every conceivable component that support could muster – including powering on all 1100 VMs and collecting their broker agent logs – but again, the issue went away on its own. Since support seemed pretty lost (which turns out to be somewhat common, unfortunately), I decided to give #Citrix another whirl. A Citrix Support staff member who participates in the channel reviewed my CDF trace and uploads. The next day I received a nice PM, paraphrased below:
“We looked over your logs. An incredibly smart guy on my team has a hunch. You’re using the default power settings, yes?”
Since I had a support case that was being escalated at this point I passed this tidbit along to support, who shrugged it off, despite me trying to emphasize the importance that this was a recommendation from another unit within support. Given the visibility this case had within the organization, we needed this resolved and an official answer. There were weeks of inactivity due to the fact that the issue didn’t re-occur. I had set the power settings to keep all 1100 vms powered on in order to mitigate the risk during a critical testing time. However for the better part of a month I harped on support about that setting but was ignored again and again. Finally, reflecting on my earlier work (and post) digging around in the XenDesktop database, I decided to go find out where the Power Actions were stored. I initially didn’t find the Get-BrokerHostingPowerAction cmdlet (I didn’t look very hard, because support told me the answer from development was that there was “No way to find this information, we don’t expose it”). So instead, I turned to the VirtualCenter database, which logs the tasks. My horror was immediately confirmed when I saw the pattern in the tasks submitted from svc-xendesktop.
There were a large number of commands being sent, then they slow down to 10 per minute, but every 60 seconds there’s 10 new commands – indicating that the commands are queuing, then executing. When I submitted this database dump and my previous assertion that MaxPowerActionsPerMinute was suggested by support and that we were not going to renew SA because we weren’t getting satistfactory results with support, I got an e-mail within 6 hours that had the “Use Get-BrokerHostingPowerAction” answer (eye roll, good job there guys).
Get-BrokerHostingPowerAction did exactly what I wanted – it showed the output of the Pending actions which immediately revealed during peak hours I was seeing 250-350 queued commands, which over a period of logons and logoffs, would result in not enough VDA’s being registered and unable to be powered on in a timely fashion, thus resulting in “No desktops available”. Finally, after nearly 2 months, I was almost free. I had the answer, I had proven to support and escalation support that the issue was indeed power management and the advanced setting “MaxPowerActionsPerMinute” had to be changed in our environment, so we asked for the official recommendation. After a week or two of waiting, they finally came back with (paraphrased), “we haven no official answer, you will need to test it in your environment.” That’s exactly what I did. I found that with the storage configuration, host configuration of my VDI servers CPU and RAM, hypervisor (vSphere standard) and solid PVS servers I could safely execute about 60 power actions per minute without causing the host CPU’s to go hog-wild and degrade the experience while a large number of VMs booted (this number is closer to 100 in that environment, but I intentionally set it much lower to provide a cushion during the daytime).
So why was this an issue?
In a pooled VDI environment, delivery groups are set, by default, to power off machines after use. This behavior can be configured with the following PowerShell commands:
Set-BrokerDesktopGroup -Name “Desktop Group Name” -ShutdownDesktopsAfterUse $True
Set-BrokerDesktopGroup -Name “Desktop Group Name” -ShutdownDesktopsAfterUse $False
When a logoff occurs, XenDesktop will send the Hypervisor connection associated with that VM a graceful shutdown command (i.e. using VMware tools). This task is limited by the 10 new actions per minute setting on the Hypervisor resource in the site configuration, shown below
In 7.6, the screen changes a bit. Read about the changes here.
The “maximum new actions per minute” is what was adjusted, but why was it needed? Well, as it turns out the primary use case was library and lab scenarios with approximately 1800 endpoints (thin clients or repurposed machines), which involved either a large number of students logging off at once (and thus generating approximately 30 power actions all at once) or a high volume of users logging in and out, say when looking up a library book. Given that these labs were spread among approximately 30 sites, the queue could build up, and once it was deep enough it couldn’t recover until after the school day was out.
Could this have been avoided by not using power management?
No. Power management wasn’t the cause of the issue. Although we didn’t see the issue with 1100 powered on VMs, this was only because our reserve was able to meet the demand while our queue for power actions was very large. If we had a higher concurrent usage at the time (say, 600 or 700 users) we’d have seen the same problem and likely faster as well given that there’d be a greater number of people logging in and out.
What lessons can be learned from this?
The first lesson I’d say is be sure to use all resources available to you. It’s no secret I’m a fan of #Citrix IRC. Since joining in 2011 the knowledge I’ve gained from simply listening and watching or participating in conversations has accelerated my understanding of Citrix technologies tremendously. The Citrix discussion forums can also be helpful, as well as other user communities.
Secondly, although having support is a nice CYA, don’t take what they say as gospel, even if it is from an escalation engineer supposedly speaking with development. I could have had my “smoking gun” for my issue weeks earlier had I found the Get-BrokerHostingPowerAction cmdlet on my own, but I took the escalation engineer’s word that the functionality, after checking with development, was simply not exposed. I knew something was off with that statement since I had seen the advanced connection properties dialog and knew it must be related.
Third, test. One might be tempted to set the max actions per minute to a high value like 200, but keep in mind what happens to your infrastructure and hosts if you boot 200 VMs at once. Even though my PVS and storage could handle it, the CPUs in the host (2 x 6 core) simply maxed out much lower. Additionally, many PVS servers are substantially “smaller” in resources than the ones I was using because I had tuned them for the needs of the environment.
Finally, realize that you might be a corner case. While Citrix marketing makes it seem as though everyone is rolling XenDesktop with tens of thousands of VMs and therefore it must be a common scenario, the exact factors that go into making this particular issue occur actually might be quite rare. For example, using session hosts, this would never be an issue as they wouldn’t have power actions on logoffs. Likewise, in a non lab or otherwise highly rotational setup these bursts might have been handled normally with the power on buffer.