An issue presented itself when attempting to load balance Exchange 2013 servers with a Citrix NetScaler VPX (virtual appliance); the configured virtual servers and service groups in the NetScaler showed portions of one of two Exchange servers being down.
The monitors configured are checking for the presence of the healthcheck.htm file. Microsoft has given us a way to check each virtual directory (vdir) in Exchange 2013 for availability and health with Health Probe Checking. Healthcheck.htm is available for each vdir when they are healthy and not when “not healthy”.
Each of the monitors configured in the NetScaler rely on /healthcheck.htm to determine whether a Exchange service area is available or not based on the state of a component associated with a specific protocol. It is important to note that healthcheck.htm exists only in memory when the component state is active for its associated protocol.
Below are the Exchange 2013 vdirs and the associated component protocol URLs available to perform cursory health checks of Exchange.
NOTE: Replace “IPADDRESS” with the internal IP address or the NetBIOS name or internal FQDN of an Exchange 2013 server. In a load balanced scenario, the internal IP address or NetBIOS name will bypass any DNS records or VIPs that may be configured that directs traffic through the load balancer.
https://IPADDRESS/Autodiscover/healthcheck.htm
https://IPADDRESS/ecp/healthcheck.htm
https://IPADDRESS/EWS/healthcheck.htm
https://IPADDRESS/mapi/healthcheck.htm
https://IPADDRESS/Microsoft-Server-ActiveSync/healthcheck.htm
https://IPADDRESS/OAB/healthcheck.htm
https://IPADDRESS/owa/healthcheck.htm
https://IPADDRESS/PowerShell/healthcheck.htm
https://IPADDRESS/Rpc/healthcheck.htm
When checking each of those URLs, we should receive a “200 OK” message with the internal FQDN of the server we are targeting when the health of the vdir is good…
…And if the health of the vdir is bad, the result will show “This page can’t be displayed”.
NOTE: At the time this issue presented itself, the Exchange 2013 servers were running CU7. Additionally, there are 2 multi-role servers which are members of a database availability group (DAG) hosting 1 active database each.
The “problem” server was investigated for health issues using Get-HealthReport “SERVERNAME” which identified the AlertValue of several HealthSets as being “unhealthy”. The databases were even moved back and forth in an attempt to “shake something loose”. Further troubleshooting proved fruitless and the server was ultimately restarted. However, a restart did nothing to resolve the issue either; as running Get-HelthReport provided the same results. At this point, the server was determined to be working normally as users could still access mailboxes via Outlook, OWA and smart devices. Therefore, it was decided to schedule another day to dig into the issue.
Along comes CU8… On Tue 17 Mar 2015, with the release of CU8 for Exchange 2013, I was all over updating the “problem” server as a possible resolution. The server was updated and restarted. Sure enough the NetScaler was happy again and reported all target services up.
However, that was very short-lived. Apparently, the “problem” server showed healthy from the NetScaler’s perspective for just a brief moment and then showed the same services on the same server being inaccessible.
I really was at a loss and didn’t know what to search for. As we all do, I made a request to the community on Twitter and then went to Bing searching for anything that referenced “healthcheck.htm”. Findings were miniscule and hope was all but lost when a crumb was found in the TechNet Exchange Server Forums that provided some welcomed guidance.
As was suggested, Get-ServerComponentState “SERVERNAME” was run against the “problem” server that showed 5 of the 7 components that the NetScaler was monitoring were in an “inactive” state. Ah ha! And when the same command was run on the other Exchange server, the state of all the components showed active.
Though a restart of the server did not address the issue previously, I went ahead and restarted the ‘Microsoft Exchange Health Manager’ service. That did not help either. I went as far as attempting to force the components into an active state with the following command. Still, no joy.
Set-ServerComponentState -Identity "SERVERNAME" -Component "COMPONENTNAME" -State Active -Requester Maintenance
Next, Looking to ADUC in the ‘Microsoft Exchange System Objects/Monitoring Mailboxes’ OU, twenty-six (26) “HealthMailboxes” were found. This seemed like a really high number of mailboxes given what was stated in the forum post. However, with an article posted by the Exchange Team on Fri 20 Mar 2015, it appeared that the number of monitoring mailboxes was about right.
NOTE: Based on the Exchange Team article as I understand it, my environment (with 2 multi-role servers and 2 mailbox databases [with DAG copies]) should have 24 monitoring mailboxes. But there were 26.
We can also use EMS to get a list of the monitoring mailboxes we have using one of these two commands…
Get-Mailbox -Monitoring Get-Mailbox -Monitoring | ft -auto
…or get a count of the number of monitoring mailboxes we have with this command…
(Get-Mailbox -Monitoring).count
Moving on…
I deleted all 26 of the monitoring mailboxes and, as expected, new monitoring mailboxes were created once the ‘Microsoft Exchange Health Manager’ service was restarted. For good measure, the “problem” server was restarted. After the restart, interestingly enough, only 24 monitoring mailboxes were created. That leads me to believe there must have been a few corrupted monitoring mailboxes. I also restarted the other DAG member (the non-problematic Exchange server).
Now, for the moment of truth…
{Drum roll}
Get-ServerComponentState was run against the “problem” server again. This time, all components that previously were in an inactive state now showed active. Huzzah!!!
The NetScaler? … Currently, it’s as happy as I am.
And Get-HealthReport? … Every ‘HealthSet’ showed the ‘AlertValue’ of healthy.
Conclusion
This was a relatively easy fix for what was a couple days of research and frustration. Ultimately, we could not track down the source of why the problem began in the first place; but considering everything was fine the day before, I chalk it up to some database failover testing we were performing at the time the issue presented itself. I can safely say that many environments have this issue present but Exchange admins are unaware of it. You don’t need a load balancer to find out there is a problem. And it doesn’t only present itself when a DAG is configured.
Again, good luck and have fun! Happy troubleshooting!
Reference(s):
- Load Balancing in Exchange 2013 (The Exchange Team Blog; Refer to section on ‘Health Probe Checking’)
- healthcheck.htm (TechNet Fourms)
- Exchange 2013 Corrupted Health Mailboxes
- Recreate Exchange 2013 Health Mailboxes (The UC Guy)
- How to Use Managed Availability in Exchange 2013 with your Load Balancer
- Exchange 2013 Health Check Monitors and Journaling (Jeff Guillet)
- Exchange 2013 Monitoring Mailboxes (The Exchange Team Blog)
- Get-HealthReport (TechNet)
- Get-ServerComponentState (TechNet)
- Set-ServerComponentState (TechNet)
- Managed Availability (TechNet)
- NetScaler Application Delivery Controller (Citrix)
Hi
thanks for your valuable information just curious do you have Exchange Server config document on Citrix LB? not very clear citrix official document. ssadoglu at ereteam dot com
Thanks
IMO, the doc Citrix has on this subject is disjointed, incomplete and difficult to follow.
I am currently in the process of working on a document outlining the process; from the GUI perspective. But it may be a few weeks due to other priorities.
In the meantime, I will refer you to a blog post by Jetza Mellema who is working on documentation for using a NetScaler to load balance Exchange as well … http://jetzemellema.blogspot.com/2015/03/citrix-netscaler-configuration-notes.html
Hi,
is it done document ? if so could you please send me it ?
Thanks,
Brilliant article! Fixed all my issues, thanks so much for taking the time to post a very clear and easy to follow article.
Thank you, Ziv.
Hi Todd
is it done document ? if so could you please send me it ?
Thanks,
I don’t have it done and timeframe for completion is not near. Sorry.
A great article on Exchange LB for netscaler: http://danielruiz.net/2015/05/26/exchange-2013-layer-7-single-namespace-loadbalancing-with-citrix-netscaler/
On a side note, I’m still having trouble with the ECP service healthcheck after deleting health mailboxes and recreating. Anybody have other suggestions for this issue?
Pingback: Microsoft Exchange Monitorering
I am having same problems. I have two servers in DAG, both MSX2013 CU13. Both servers show identical unhealthy health sets. I tried with deleting monitoring mailboxes but to no avail. Could be there an error with my split DNS configuration or perhaps with bindings in IIS? I am running out of options 😦
Just faced the exact same issue today. All I had to do was to restart MSExchangeHM (Microsoft Exchange Health Manager) and all HealthMonitors came back to green. Thanks!
Exchange 2019 and NetScaler VPX dont work, OWA monitor Failure – Time out during SSL handshake stage