Friday, November 07, 2014

BizTalk Host Polling Interval - Opportunities and Pitfalls

Scenario


In BizTalk circles it's well know that in order to implement low latency it's useful to set the polling interval on the hosts to a low value (e.g. 50ms).  This means that the host instances poll for work more frequently and hence messages spend less time at rest in the Message Box.

However, increasing the frequency of polling means that the host instances are necessarily placing additional load on the Message Box through increased SQL query loads.  This "parasitic" load can sometimes consume as much as 20% of the CPU in SQL for a large BizTalk group with many hosts.

Each host is allows configured values of the polling interval for messages and for orchestrations.

Opportunity to Reduce SQL Load

The parasitic load can be reduced if the workload is split across several hosts, and if each host handles a specific type of work.

For example, consider the following scenario:  System A sends a message via MQ into BizTalk, which performs some processing via the orchestration engine (click to enlarge).

Figure 1:  Example integration scenario using MQ

In this case the hosts have been split into ReceiveHost, ProcessingHost and SendHost.  The polling settings required are as follows:

Receive Host:  Does not need to poll Message Box for either messages or orchestrations.  The threads in the receive adapter are polling a message queue constantly and the adapter only publishes to the Message Box, it never picks up from it.

ProcessingHost: Polls the Message Box for orchestrations but not for messages.

SendHost: Polls the Message Box for messages but not for orchestrations.

As you can see from this, if the polling settings are changed so that the hosts only poll as they need to, then actually 4 out of 6 polls on the Message Box are not required.  This can alleviate load on SQL and contribute to performance improvement.

In practice we can't stop hosts from polling the message box, so we reduce the load by setting the polling interval to a high value (e.g. 1 minute) we can practically eliminate the polling load.

The Effect of Web Services

Now, imagine that the system is remodelled so that we have essentially the same business process, but System A sends messages into BizTalk via a WCF web service rather than via MQ.  

Figure 2:  Example integration scenario with message delivery via web service.

This changes the host configuration, as now we have an isolated host on the receive side instead of an in-process host.  This means that the receive processing is now performed under an IIS process and not under a BizTalk Windows service, albeit the web application will call through to BizTalk assemblies in order to interact with the BizTalk engine.

Gotcha #1:  Web Services are Both Receive and Send

This time we have the following hosts:  ReceiveIsolatedHost, ProcessingHost, SendHost.The polling settings required are as follows:

Receive Isolated Host:  IIS requests are published to the Message Box, but for solicit-response web services the orchestration will send a response back via the Message  Box, therefore isolated hosts need to poll for messages but do not need to poll for orchestrations.

ProcessingHost: Polls the Message Box for orchestrations but not for messages.

SendHost: Polls the Message Box for messages but not for orchestrations.

Interesting note:  Isolated hosts are only used in messaging running under IIS.  Although the BizTalk Admin Console allows an orchestration polling interval to be configured, the host will not be able to run orchestrations (see below). 


This means that before we make any tuning adjustments we effectively have 5 host polls on the Message Box, 3 for messages and 2 for orchestrations.  If we filter out the polling we don't need by setting the polling interval to a high value we will end up with one poll per host, reducing the polling load by 2/5.

The use of an Isolated Host therefore reduces the scope to tune out polling.

Gotcha #2: Isolated Host Instances run under IIS


If we have a scaled-out BizTalk group with, say, 6 servers we usually divide up the workload by allocating host instances to a subset of servers.  All BizTalk assemblies and artefacts are deployed on all boxes so that the host configuration can be reconfigured in the event of server losses.

For example, in the initial host configuration for the first scenario above may be as follows:

Receive Host: Server 1 + Server 2
ProcessingHost: Server 3 + Server 4
Send Host: Server 5 + Server 6

In this case we have 2 host instances running on each host, and only the Processing Host and the Send Host are polling, therefore we are getting 4 host instances polling regularly, from servers 3-6.

When we deploy isolated hosts we will also typically deploy the web application onto all of the servers, but use load balancing to control the servers that will be receiving messages.  This would, for example, be configured as follows:

Receive Isolated Host: Web app deployed on all servers, load balanced to Server 1 + Server 2
ProcessingHost: Server 3 + Server 4
Send Host: Server 5 + Server 6

In the event of server failure the load balancer will be configured to redirect traffic to active servers.  However, as the Receive Isolated Host runs under IIS, if the website and App Pool in IIS is running on each server then they will still be polling for messages even though they will receive no traffic on servers 3-6.  In this case we have actually added 6 more processes polling the Message Box from IIS in addition to the 4 host instances that were previously polling.  This is obviously a large increase in the parasitic load on the Message Box.

Possible Optimisation:  Stop the website / application pool on the servers with no traffic.  Note that operational procedures will then need to start these back up if the load is to be rebalanced.

Possible Optimisation:  Reduce the number of Isolated Hosts on your BizTalk Group (subject to performance constraints), and run more receive locations under each Isolated Host.

Possible Optimisation:  When publishing WCF services at development-time, create them as one-way instead of two-way.  These services then return an HTTP 200 (success) when the message is published into the Message Box without requiring a response message.  Isolated Hosts that publish one-way messages can therefore be tuned to poll infrequently as response messages are not expected.  A limitation of this is that no information can be passed back to the caller, such as an error code, so this may not suit many design scenarios. 

Gotcha #3: Isolated Hosts don't pick up settings in the same way


When making changes to BizTalk settings it is common practice to restart host instances to make sure that settings (such as polling intervals) are picked up straight away.  However, Isolated Hosts do not pick up settings in the same way, and as they do not have Windows services that run under BizTalk.  From experience, the polling interval is not something that is refreshed until the IIS process is recycled, either through machine reboot or a recycle of the application pool.

Solution:  When changing settings on an Isolated Host, recycle the relevant application pools in IIS to force a refresh of the settings (ideally this should be scripted into automated deployment processes).

Gotcha #4: Getting it wrong

Remember that if you tune out polling on a host that actually needs to poll, the impact on performance can be quite severe but you may not actually experience any errors.  Effectively, messages and orchestrations queue up in the Message Box until the next polling interval, and they are then processed on as many threads as the host instance has available.

If you inadvertently set the polling interval to 1 minute to decrease polling load for a host that ought to be polling several times per second, this means that the service instances will simply stay on the spool as active (ready) instances and will wait for up to a minute to process.  When the host polls the Message Box these messages will be picked up and processed as if nothing has happened.

The side effects can be as follows:
  • Spool count climbs and potentially causes throttling
  • Latency increases (in the example of a 1-minute poll, on average latency will increase by 30 seconds)
  • Throughput suffers (thread starvation may occur if large numbers of service instances are on the spool when the host polls, and they cannot be processed concurrently).
Symptoms to watch for:
  • Spool count higher than expected, possibly associated with throttling.
  • Sawtooth pattern for the Spool Count performance counter in Perfmon.
  • Variation in end-to-end latency (don't just look at average latency stats, look at a plot of latency over time).
  • For Isolated Hosts that have not been set up to poll regularly, possible timeouts from the web service client, with response messages sat in the message box waiting to be picked up.

Real Life Scenario


The above observations come from a real-life incident I know of, related to me by someone who I used to work with.  

An Isolated Host had the message polling setting increased to reduce message box load and had passed functional and performance test.  This was because our "System A" had been replaced by a test injector that had unlimited connections and a large enough timeout to cope with the polling interval.  

The resulting production issue came when the real-life System A was timing out sending messages to BizTalk and was not getting a response in time.  To compound matters, the system had a limited number of connections, and there was a backlog of outbound work that needed to be sent.

To make matters even worse the setting had been changed literally weeks earlier and had never been picked up by the Isolated Host, because the BizTalk host instances had been restarted but the app pools had not been recycled.  

A regular patch window had caused the servers to be rebooted, picking up the new settings leading everyone to assume that the cause of the issue was the Windows Update and not the fact that a setting change introduced weeks earlier had finally been activated.

The result was an incident that affected service by reducing throughput (without loss of service, and without generating meaningful errors).  Diagnosis was made more difficult because attention centred first around the update and then error reports were giving average response times from the web service and not the latency over time, which led the investigators to initially look at what could be running slow all of the time.

Summary


It is possible to reduce the parasitic load on the Message Box in BizTalk by tuning the polling settings, and hence improve overall performance.  

Be careful that you properly performance test any changes and make sure that all processes are properly refreshed if any settings are updated.  This should be done in your automated deployment scripts to ensure that it happens every time.

Also, make sure that any test fixtures mimic the intended production systems in terms of client timeouts, number of connections and queueing mechanisms to get a realistic simulation.

Make sure that when you have latency issues that you look at behaviour over time, preferably from raw data, and don't rely on any single point data or on averages as these may be misleading.  

Make sure that your test environments are patched and regression tested in the same way that you would do for application changes.  If server patches are known to pass regression test then they can be discounted as a root cause of production issues.