What the “Failed Requests” counter in ARR really means

Tuesday, October 26, 2010

ARR IIS IIS7 Windows Server

While troubleshooting an intermittent performance issue recently, the question came up: “What does the Failed Requests” counter in the Monitoring and Management feature in Application Request Routing (ARR) mean?”

For example, does a failed health check cause the counter to climb, or how about 404 or 500 status codes? What happens if all nodes are unhealthy and a 502.4 error is thrown?

I didn’t know the answer to all of these so I set out to find the answer.

Microsoft’s Answer

First, what does the official documentation say:

Failed Requests

Displays the number of requests that failed, including requests that are a result of a connection error or a status code that matches a live traffic failure code.

That’s fairly helpful, but you have to read in-between the lines to know for sure.

Types of Errors

I would categorize errors into three categories for sites handled by ARR:

those that fail in ARR
those that hard fail on the web server
those that soft fail (very slow, or timeout)

Obviously a slow page that still works won’t trip any counters, so I won’t worry about that.

ARR itself will serve up one of two different types of errors:

502.3 – timeout on the page requested. Default timeout is 30 seconds.
502.4 – no web servers are available to take your request. i.e. the health test failed on all nodes, or all nodes have been manually disabled.

Additionally, the web server can send back a response that is passed right through to the user. Four common status codes are:

200 –> Success. Everything is good
302 –> Found. Basically a temporary redirect.
404 –> Page not found. That says it all.
500 –> Server error. Usually a code/application related error.

ARR’s “Failed Requests” Counter

After testing, here is what I concluded, which lines up with Microsoft’s documentation:

502.3 Timeout	This will increment the “Failed Requests” counter.
502.4 No healthy servers available	This will not increment any counter, including “Current Requests”. It’s a failure before it gets to ARR’s Monitoring and Management stats. I tested both with all servers manually disabled, and also when all servers marked as unhealthy from the health test. The results were the same.
Health test	The “URL Test” will not change Failed Requests (or any requests for that matter)
500 status code from web node	This will increment the “Failed Requests” counter using the default settings. (more on this below)
404 status code from web node	This will not increment the “Failed Requests” counter with the default settings. (more on this below)

Going back to Microsoft’s documentation, they say this: “or a status code that matches a live traffic failure code.”

In ARR’s Health Test, there are 2 types of tests. The first is the URL test, and the second is the Live Traffic Test.

The URL test will make a call to your server at specified intervals and mark a server as unhealthy if it doesn’t receive a valid response. It will bring it online again after it receives a successful status.

The Live Traffic Test watches the traffic on the way through and can mark a server as unhealthy when it sees too many bad responses within a set timeframe.

By default, neither are set. I highly recommend always setting the URL Test. However, for the Live Traffic Test, be careful, because it’s possible for someone to find a page that is throwing a 500 error and hit it aggressively and take all of your servers out of rotation. It’s an easy DOS attack. Additionally, if you have the live traffic test enabled but don’t have a URL Test set, then it won’t know when to mark the server as healthy again since there isn’t any live traffic to check. So, 2 rules of thumb: A) don’t use the Live Traffic Test unless you are sure you need it and B) if you do use the Live Traffic Test, never use it without also using the URL Test.

The Live Traffic Test is disabled by default since the Failover period (seconds) is 0. With that at zero, the live test doesn’t mark a server as unhealthy. However, it is still used for the Failed Requests counter.

Notice that the Failure Codes defaults to 500-. That means that all status codes 500 and higher are considered failures.

For testing, I dropped that to 400- and hit non-existent pages (404 status code) a few times. The Failed Requests counter climbed exactly along with my tests.

Interestingly enough, if the Failure Codes is set to 999-, then a 502.3 error still increments the counter. So a 502.3 error on the ARR node is an exception and will always increment the Failed Requests counter.

So, the Failure Codes value determines which status codes from the web server will increment the Failed Requests counter. By default it’s status codes 500 and greater.

Conclusion

Microsoft says it like so:

Displays the number of requests that failed, including requests that are a result of a connection error or a status code that matches a live traffic failure code.

To give a more verbose answer, I’ll conclude with the following:

The Failed Request counter in ARR will increment if there is a connection or page timeout while ARR waits on the web server (502.3), or if there is a status code returned from the web server that is equal to the Failure Codes value in the Live Traffic Test. Health tests and “no servers available” errors (502.4) do not update the counters.

Hi.

Great post.

>>The URL test will make a call to your server at >>specified intervals and mark a server as unhealthy >>if it doesn’t receive a valid response. It will >>bring it online again after it receives a successful >>status.

I'm having problems with 1001/1000/1002 ARR events and while I was checking IIS logs in the farm servers, I saw that all showed 2 simultaneous requests from each ARR server in the interval I defined. Do you know if this is normal? Why am I getting 2 requests for each URL Test in each farm server.

John - Wednesday, May 25, 2011 10:08:58 PM

Hi John,

Actually that is to be expected. It's not the best behavior but it's what ARR does. For each active website on the ARR server, a health check is performed. So if you have 2 websites that are active websites (a worker process has started) then it would perform two health checks instead of one.

The reason is that ARR don't have its own worker process. It piggy backs off of the existing app pool worker processes, so it doesn't know which one to use when there are multiple to choose from, so it does a health check per worker process.

As far as I know there isn't a way around that in the current versions. You need to either use a single site on the ARR server (which is normally find unless you use custom error pages or something else unique like that), or just expect the web servers to have more health check traffic.

OWScott - Thursday, May 26, 2011 9:22:35 AM

2 Comments