Recent load testing using mod_jk 1.2.14 has uncovered some issues when using the load balance worker.
We have 1 web server running Apache httpd 2, dispatching requests to 4 app servers using mod_jk lb load balancer. The relevant workers.properties config looks like this:
This site is a failover/redirect page for a very popular and busy online web application. If the main site goes down, the error page is asking people to check this site for updates on when it will be back. As such the usage pattern is rather odd, from having close to no traffic it might suddenly have hundreds of concurrent requests.
The site has been carefully built to ensure that page load times are kept very low. On average the JSP pages (excluding static resources i.e. images, stylesheets, scripts) have a load time of around 3 ms.
Despite this we recently experienced some problems when we suddenly faced a lot of traffic to the site. This prompted further investigation and the result is the below.
To look at how the load balancer is working we're using the jkstatus feature of mod_jk:
worker.jkstatus.type=status
The web server is set to accept 500 concurrent connections, the tomcats are set to unlimited AJP connections.
I'm using "ab" which is a poor comparison to real life for many reasons, but I suspect that some of the issues we've experienced with real live load can be provoked using this tool.
Election method R to consider busy queue?
As it happens 2 of these backhosts are older/slower hosts and also are hosting a few more sites than the other two.
The "R" election method of which worker to dispatch to simply looks at a counter for each worker and chooses the one that has had least requests sent to it. The result being a kind of round robin, where each app host will receive as many requests.
When loading the site heavily over a long period (i.e. ab -c 300 -n 1000000 http://my.site/page) this algorithm presents a slight issue with our two older hosts.
Looking at the status page, each host starts to have a constant figure of "Busy" workers, effectively a back log of requests to go through. Each additional request would make the app server run a bit more slowly, so the average response time for all requests to that app server would increase. Our two older hosts have at this point a longer busy queue of requests than our newer, which seems logical.
However since the election algorithm appears to not take the busy queue into consideration we can gradually up the load until our older hosts plunge into a downward spiral, each new request slows down the overall performance of the host, which makes the busy queue grow longer and longer until the host is completely stalled. This happens at the same time as the other hosts have a shorter queue and could take the load instead.
The lbfactor setting for the workers is of little use to us. The overall load on our app hosts changes frequently and with time of day so trimming in this value to achieve balanced busy queues for a certain load situation will not solve the problem.
The maxProcessors setting (tomcat 5.0) on the AJP connector doesn't seem to affect mod_jk behaviour. Limiting the requests to 125 here will just make tomcat give up sooner with an error message, however this does not stop mod_jk from continuing to dispatch more requests to the host.
Oct 26, 2005 5:41:27 AM org.apache.tomcat.util.threads.ThreadPool logFull
SEVERE: All threads (125) are currently busy, waiting. Increase maxThreads (125) or check the servlet status
The solution would appear to be to take the busy queue into consideration when electing a worker. The inverse of the length of the queue could be used as a weight factor in addition to the request counter.
Election method T under constant heavy load
Election method T is electing a worker depending on the amount of traffic to the workers. Again, repeating the above constant load using ab -c 150 -n 1000000 http://my.site/page. By refreshing the status page a lot we're obvserving that this election method choses the same worker for around a second or so, effectively having 150 long busy queue on one worker whilst the others are at 0. In other words we're blitzing one worker with the whole load at a time.
I can only speculate to why this is, but if for instance the traffic counters on the workers are only updated every second, simply looking at those counters would elect the same worker for a second at a time. Needless to say this would not work very well for high loads.
A better approach would be to attempt to even out the traffic across the workers over time. So the traffic counters could be calculated into load factors resulting in all workers being used but some being used more frequently if they have had less traffic. However some thought needs to be put into such an algorithm since the combination of counters being updated say every second and a naive algorithm might result in a "oscillating" spread of load.
Busy workers get "stuck"
I've been trying running a high load over a prolonged period, say ab -c 200 -n 1000000 http://my.site/page, and here the important thing is to choose a -c 200 that is large enough to have a constant busy queue on each worker, but not too high to provoke the above described downward spiral.
After doing this for say 15-20 minutes and then stopping ab with ctrl-c, I have a whole bunch of busy workers that doesn't seem to go away.