Andrew Rivers: 2014

Friday, November 07, 2014

BizTalk Host Polling Interval - Opportunities and Pitfalls

Scenario

In BizTalk circles it's well know that in order to implement low latency it's useful to set the polling interval on the hosts to a low value (e.g. 50ms). This means that the host instances poll for work more frequently and hence messages spend less time at rest in the Message Box.

However, increasing the frequency of polling means that the host instances are necessarily placing additional load on the Message Box through increased SQL query loads. This "parasitic" load can sometimes consume as much as 20% of the CPU in SQL for a large BizTalk group with many hosts.

Each host is allows configured values of the polling interval for messages and for orchestrations.

Opportunity to Reduce SQL Load

The parasitic load can be reduced if the workload is split across several hosts, and if each host handles a specific type of work.

For example, consider the following scenario: System A sends a message via MQ into BizTalk, which performs some processing via the orchestration engine (click to enlarge).

Figure 1: Example integration scenario using MQ

In this case the hosts have been split into ReceiveHost, ProcessingHost and SendHost. The polling settings required are as follows:

Receive Host: Does not need to poll Message Box for either messages or orchestrations. The threads in the receive adapter are polling a message queue constantly and the adapter only publishes to the Message Box, it never picks up from it.

ProcessingHost: Polls the Message Box for orchestrations but not for messages.

SendHost: Polls the Message Box for messages but not for orchestrations.

As you can see from this, if the polling settings are changed so that the hosts only poll as they need to, then actually 4 out of 6 polls on the Message Box are not required. This can alleviate load on SQL and contribute to performance improvement.

In practice we can't stop hosts from polling the message box, so we reduce the load by setting the polling interval to a high value (e.g. 1 minute) we can practically eliminate the polling load.

The Effect of Web Services

Now, imagine that the system is remodelled so that we have essentially the same business process, but System A sends messages into BizTalk via a WCF web service rather than via MQ.

Figure 2: Example integration scenario with message delivery via web service.

This changes the host configuration, as now we have an isolated host on the receive side instead of an in-process host. This means that the receive processing is now performed under an IIS process and not under a BizTalk Windows service, albeit the web application will call through to BizTalk assemblies in order to interact with the BizTalk engine.

Gotcha #1: Web Services are Both Receive and Send

This time we have the following hosts: ReceiveIsolatedHost, ProcessingHost, SendHost.The polling settings required are as follows:

Receive Isolated Host: IIS requests are published to the Message Box, but for solicit-response web services the orchestration will send a response back via the Message Box, therefore isolated hosts need to poll for messages but do not need to poll for orchestrations.

ProcessingHost: Polls the Message Box for orchestrations but not for messages.

SendHost: Polls the Message Box for messages but not for orchestrations.

Interesting note: Isolated hosts are only used in messaging running under IIS. Although the BizTalk Admin Console allows an orchestration polling interval to be configured, the host will not be able to run orchestrations (see below).

This means that before we make any tuning adjustments we effectively have 5 host polls on the Message Box, 3 for messages and 2 for orchestrations. If we filter out the polling we don't need by setting the polling interval to a high value we will end up with one poll per host, reducing the polling load by 2/5.

The use of an Isolated Host therefore reduces the scope to tune out polling.

Gotcha #2: Isolated Host Instances run under IIS

If we have a scaled-out BizTalk group with, say, 6 servers we usually divide up the workload by allocating host instances to a subset of servers. All BizTalk assemblies and artefacts are deployed on all boxes so that the host configuration can be reconfigured in the event of server losses.

For example, in the initial host configuration for the first scenario above may be as follows:

Receive Host: Server 1 + Server 2

ProcessingHost: Server 3 + Server 4

Send Host: Server 5 + Server 6

In this case we have 2 host instances running on each host, and only the Processing Host and the Send Host are polling, therefore we are getting 4 host instances polling regularly, from servers 3-6.

When we deploy isolated hosts we will also typically deploy the web application onto all of the servers, but use load balancing to control the servers that will be receiving messages. This would, for example, be configured as follows:

Receive Isolated Host: Web app deployed on all servers, load balanced to Server 1 + Server 2

ProcessingHost: Server 3 + Server 4

Send Host: Server 5 + Server 6

In the event of server failure the load balancer will be configured to redirect traffic to active servers. However, as the Receive Isolated Host runs under IIS, if the website and App Pool in IIS is running on each server then they will still be polling for messages even though they will receive no traffic on servers 3-6. In this case we have actually added 6 more processes polling the Message Box from IIS in addition to the 4 host instances that were previously polling. This is obviously a large increase in the parasitic load on the Message Box.

Possible Optimisation: Stop the website / application pool on the servers with no traffic. Note that operational procedures will then need to start these back up if the load is to be rebalanced.

Possible Optimisation: Reduce the number of Isolated Hosts on your BizTalk Group (subject to performance constraints), and run more receive locations under each Isolated Host.

Possible Optimisation: When publishing WCF services at development-time, create them as one-way instead of two-way. These services then return an HTTP 200 (success) when the message is published into the Message Box without requiring a response message. Isolated Hosts that publish one-way messages can therefore be tuned to poll infrequently as response messages are not expected. A limitation of this is that no information can be passed back to the caller, such as an error code, so this may not suit many design scenarios.

Gotcha #3: Isolated Hosts don't pick up settings in the same way

When making changes to BizTalk settings it is common practice to restart host instances to make sure that settings (such as polling intervals) are picked up straight away. However, Isolated Hosts do not pick up settings in the same way, and as they do not have Windows services that run under BizTalk. From experience, the polling interval is not something that is refreshed until the IIS process is recycled, either through machine reboot or a recycle of the application pool.

Solution: When changing settings on an Isolated Host, recycle the relevant application pools in IIS to force a refresh of the settings (ideally this should be scripted into automated deployment processes).

Gotcha #4: Getting it wrong

Remember that if you tune out polling on a host that actually needs to poll, the impact on performance can be quite severe but you may not actually experience any errors. Effectively, messages and orchestrations queue up in the Message Box until the next polling interval, and they are then processed on as many threads as the host instance has available.

If you inadvertently set the polling interval to 1 minute to decrease polling load for a host that ought to be polling several times per second, this means that the service instances will simply stay on the spool as active (ready) instances and will wait for up to a minute to process. When the host polls the Message Box these messages will be picked up and processed as if nothing has happened.

The side effects can be as follows:

Spool count climbs and potentially causes throttling
Latency increases (in the example of a 1-minute poll, on average latency will increase by 30 seconds)
Throughput suffers (thread starvation may occur if large numbers of service instances are on the spool when the host polls, and they cannot be processed concurrently).

Symptoms to watch for:

Spool count higher than expected, possibly associated with throttling.
Sawtooth pattern for the Spool Count performance counter in Perfmon.
Variation in end-to-end latency (don't just look at average latency stats, look at a plot of latency over time).
For Isolated Hosts that have not been set up to poll regularly, possible timeouts from the web service client, with response messages sat in the message box waiting to be picked up.

Real Life Scenario

The above observations come from a real-life incident I know of, related to me by someone who I used to work with.

An Isolated Host had the message polling setting increased to reduce message box load and had passed functional and performance test. This was because our "System A" had been replaced by a test injector that had unlimited connections and a large enough timeout to cope with the polling interval.

The resulting production issue came when the real-life System A was timing out sending messages to BizTalk and was not getting a response in time. To compound matters, the system had a limited number of connections, and there was a backlog of outbound work that needed to be sent.

To make matters even worse the setting had been changed literally weeks earlier and had never been picked up by the Isolated Host, because the BizTalk host instances had been restarted but the app pools had not been recycled.

A regular patch window had caused the servers to be rebooted, picking up the new settings leading everyone to assume that the cause of the issue was the Windows Update and not the fact that a setting change introduced weeks earlier had finally been activated.

The result was an incident that affected service by reducing throughput (without loss of service, and without generating meaningful errors). Diagnosis was made more difficult because attention centred first around the update and then error reports were giving average response times from the web service and not the latency over time, which led the investigators to initially look at what could be running slow all of the time.

Summary

It is possible to reduce the parasitic load on the Message Box in BizTalk by tuning the polling settings, and hence improve overall performance.

Be careful that you properly performance test any changes and make sure that all processes are properly refreshed if any settings are updated. This should be done in your automated deployment scripts to ensure that it happens every time.

Also, make sure that any test fixtures mimic the intended production systems in terms of client timeouts, number of connections and queueing mechanisms to get a realistic simulation.

Make sure that when you have latency issues that you look at behaviour over time, preferably from raw data, and don't rely on any single point data or on averages as these may be misleading.

Make sure that your test environments are patched and regression tested in the same way that you would do for application changes. If server patches are known to pass regression test then they can be discounted as a root cause of production issues.

Friday, October 10, 2014

Thoughts on inaugural #c9d9 online discussion panel

On Wednesday I took part in an online discussion on Agile / Continuous Integration / DevOps / Continuous Delivery organised by Electric Cloud (see http://electric-cloud.com/blog/2014/10/c9d9-continuous-discussions-episode-1-recap/) .

I was really happy to be invited to talk as a panellist on the discussion, especially seeing the very high calibre of the other individuals on the panel. It was great to be included among such an insightful and talented group of people.

One of the things that struck me was that there was widespread agreement among the panellists of the benefits of Agile / CI / DevOps / CD, and many of the barriers that were encountered were also the same.

In the programme I work on at the moment I feel that we have quite a slick delivery machine that incorporates:

Development - making code changes
Automated CI build and scheduled build
"Touch of a button" scripted tear down / rebuild of test environments (both functional and performance), with full redeploy or upgrade of the software
Comprehensive integration tests
Ability to "ship" as soon as the software on test has passed QA

However, within all this goodness there are some significant barriers. I don't want to beat up my end customer too much, because I understand that if my software goes down for any length of time they're pretty much out of business. It's a big responsibility. However......

The end customer is ultimately responsible for integrating, final QA and release into production. The point was made on the discussion panel that if you look at your development process as a pipeline you can only move at the speed of the slowest stage. This is definitely an issue where we are at the moment, we have the ability to produce much more value but we are constrained because the end of the pipe cannot absorb change quickly enough.
As a consequence of this, as releases become further and further apart they become larger and hence have a higher level of inherent risk. It is an irony that the risk-aversion that leads to endless test cycles of an integrated solution ends up slowing down the deployments so much that each deployment becomes ever riskier.
One of the keys to solving this is DevOps. Within my dev programme we've got great build and infrastructure guys embedded so we can all work together to create our slick process. At the end customer the various teams involved are all in different organisational silos (and often in different organisations altogether). The multi-functional teams required to make releases slick and frequent are not in place.
This makes for an interesting observation. In software, quality comes from automation, because only by using automation can we be truly repeatable, and a precondition of software quality is repeatability. But - again an excellent observation made on the panel - automation also frees software developers from mundane repetitive tasks so that they can focus on spending more time delivering solutions. We should embrace automation, it is our friend.
Also, and this is my own reflection on this, true agility (as opposed to agile methodologies, which I prefer to regard as strategies for managing change) comes from adopting best practices. It involves setting up the software development / test machine (note that these two cannot be separated) so that when we change the solution we get feedback on the level of quality quickly. As anyone with a knowledge of control circuits knows: if you want a stable system the speed of change has to be slower than the speed of feedback .
Which brings me back to where I was before. Long release cycles involve slow feedback. We try to get around this by delivering into a QA environment on several-times-a-day basis (fast feedback). However, we are at the mercy of integration issues at the other end of the pipeline and there the feedback can be months. This causes us issues because we are regularly working on updated requirements to fix integration issues - not because this adds more business value, but because slow integration did not catch this early in the cycle.

A great discussion, much food for thought for me about where I want to take this. My current programme is winding down at the moment, and I'm thinking of what I want to work on next year. All I know for sure is that I want MORE CI, MORE DevOps and MORE Continuous Delivery!

And to work on big stuff.

Many thanks to everyone involved in #c9d9.

Friday, September 19, 2014

Details of next month's online discussion on Continuous Delivery

*I'm a panellist on the first "Continuous Discussion" community panel - you're invited!*

On October 8th I'll be joining an online panel of experts and practitioners in Agile development, DevOps, CI and Continuous Delivery, hosted by Electric Cloud who have done large Continuous Delivery projects for organizations like SpaceX, Cisco, GE and E*TRADE. Other panelists include J. Randall Hunt, evangelist at AWS, Sriram Narayan, IT principal at ThoughtWorks, and Carlos Sanchez, Apache member and frequent speaker on CD/Devops.

In the online panel, myself and the other panellists will run through a few topics which relate to hands-on implementation of these methodologies in various technology and business environments. We'll share our real-life experiences, challenges and quick wins. It will be a "grass roots" discussion of Continuous Delivery and how/to what extent we can implement it.

From my perspective I'd like to recall a few of my own experiences that highlight the need for good DevOps and Continous Integration. There have been some moments of disaster, some moments of success, some moments of epiphany.

Also, having worked on the same programme for the last 5 years I've been quite insulated from the outside world, and I'd really like to find out more about what other people are doing out there to improve the software development practices.

Join me at the online panel on October 8, 6-7pm BST.

Please follow this link for more details from the organisers:
http://electric-cloud.com/blog/2014/09/continuous-discussion-online-panel/

Add to your calendar:
http://electric-cloud.com/wp-content/uploads/2014/09/Continuous-Discussion-1-Community-Panel.ics

Got any questions, tips or things you’d like me to address/cover in the panel relating to Agile, CI, DevOps or continuous Delivery? – don’t hesitate to comment or message me!

Wednesday, September 10, 2014

Time to Start Speaking Again!

I've been heads-down on project work for so long now that I haven't really got round to posting thoughts and sharing things I've learned. I've decided that it's time to change this now.

I'm really excited that I've been invited to talk on a panel about Agile, DevOps and Continuous Delivery, which are areas of my job that I've been trying to get better at for years and am still learning.

I will share more details of this session when they are confirmed, but it's on October 8th at 10am PST (6pm in the UK). In the meantime I'll get some more thoughts together on what I've found interesting lately.

Andrew Rivers