Clever Impression Tracking Technique

Saturday, November 20, 2004

I mentioned earlier how I had been at ApacheCon this week. One of the best talks I saw there was from Michael Radwin of Yahoo who delivered an excellent talk on "Cache-Busting for Content Publishers". His slides can be found here.

As part of his talk, he walked through a very cool technique that Yahoo uses across their sites for impression/usage tracking on ads and other resources. It is designed to maximize the accuracy of impression-tracking while minimizing bandwidth costs on the host. It does this by faking out and then leveraging both intermediate and private proxy caches. Below is a quick description of how it is done:

Scenario:

Assume you have advertisements stored as images on a server (ad0001.jpg, as0002.jpg, ad0003.jpg, etc). You then want to expose these advertisements in multiple places on multiple pages across a site. You want to accurately count how many times each image has been seen by visitors so that you can appropriately bill your advertisors for impression tracking.

Naive Implementation:

One simple but sub-optimal implementation of accomplishing this scenario would be to write code on the server every-time the <img src=> attribute is written into the page, and update a counter that keeps track of how many times that particular advertisement has been published.

This approach works, but has the downside that you have to update some counter on each request. This can be a performance problem where there is lock contention at the store location (for example: a single row in a database) -- although there are ways to code around this (by keeping a local cache in the web-server and then doing periodic flushes to a backing store). The biggest performance issue is the fact that you are always having to run code on the server in order to-do some counting -- which means that you can't use features like output caching to just quickly send previously generated html results back down to the client.

Less-Naive Implementation:

A slightly better implementation of ad tracking would be to not count advertisement impressions within your page's server code -- and instead rely on doing log-analysis to count ad impressions. By simply analyzing the logs you can see how many requests there were for ad0001.jpg, and know that each one represents an ad impression.

The benefit of this approach is that it works well with server-side output caching (or pre-generated page content), and as such the server load ends up being good. The downside, though, is that if you do this you will end under-counting the total number of real-world impressions. The reason for this is that proxies will cache the image and avoid sending the HTTP request back to the client if they already have the image in their cache. For example: if browser A within a company hits yahoo and is given an impression of ad0001.jpg in a page -- the company's local proxy server will cache it. If browser B in the same organizations hits yahoo and is given the same ad as well -- the browser will be able to have the image fetched from the local proxy server without ever hitting yahoo. Because yahoo is not hit on this image lookup, it isn't logged, and it can't bill the advertiser.

One way to fix this is to explictly mark the advertisement images as not allowing downlevel caching by sending a "cache-control: must revalidate" http header with each image. Properly written downlevel caches should then honor this setting and call back to the origin server to check the last modified date. You can then count the 304 Not Modified entries in the log files for that particular ad and append them to the file served log value to get a better sense of total impressions.

Very Clever Implementation:

The above implementation will work in cases where downlevel caches/proxies correctly honor the HTTP header setting and re-validate when they see another request for a cached url. The downside is that some caches/proxies don't do this -- and regardless of the setting do not go back to the origin server to re-validate. Content sites selling advertisements will loose money in these cases -- since their customers are seeing advertisement impressions but they don't get credit for them.

Michael then walked us through a clever technique that Yahoo uses to get around this issue. In a nutshell they follow the below approach:

1) Instead of rendering static <img src="ad0001.jpg"> tags in their HTML, they render some in-line client-side script that dynamically constructs an image tag that points to an source url file -- and does so with a randomly generated querystring value appended to it that guarentees a unique url. For example:

<script type="text/javascript">

var r = Math.random();

var t = new Date();

document.write("<img width='109' height='52'

	src='http://ads.example.com/ad/foo/bar.gif?t="

	+ t.getTime() + ";r=" + r + "'>");

</script>

<noscript>

<img width="109" height="52" src=

	"http://ads.example.com/ad/foo/bar.gif?js=0">

</noscript>

This code ensures that each visit to the HTML page generates a unique URL -- one that will avoid any cache hits in either a local browser cache or intermediate proxy server. As such, the browser will always end up hitting the server to request the image. This guarentees a billable log entry that the content publisher can then use to charge an advertiser.

2) What is clever about Yahoo's approach is that when a request for an advertisement image with a querystring is received by their server, they do not serve it out. Instead, they automatically send back an HTTP 302 status code (which is a redirect) that points back at the same image without the querystring.

For example, a GET request for this url:

    http://ads.example.com/ad/foo/bar.gif?js=343434344343

would immediately get back a 302 request for this one:

    http://ads.example.com/ad/foo/bar.gif

The browser will then automatically follow the redirect url and fetch and display the advertisement image.

3) When the non-querystring version of the image is requested, Yahoo will also add aggresive caching headers telling browsers and proxies to basically cache the file forever (specifically they set an ETag, Cache-Control, and a 10 year Expires header).

The reason for this is to cause the image to be automatically stored in intermediate proxies as well as in the local browser cache.

4) When another browser going through the same proxy server as a previous visitor hits Yahoo and is selected to see the same ad-impression, the client-side javascript will generate a unique URL to the image file. This will guarentee that the request bypasses the local cache and intermediate proxies, hits Yahoo again, and is then redirected back to the image without the querystring. Yahoo's logfiles will automatically be updated to include this request that sent back the 302.

Instead of hitting Yahoo again to download the image without querystring, though, the second browser will this time be able to have the image serviced from an intermediate proxy (since the ad0001.jpg had an aggresive caching value set during the first browser's visit to the page).   The benefit from the customer's perspective is that this improves the perceived responsiveness of the site (since they can get the file from a closer location). The big benefit from Yahoo's perspective then is that they don't have to pay the bandwidth cost of serving out this image advertisement (since the bytes aren't being sent over their pipes for this second request).   Instead, they only have to pay the cost of the relatively small first GET request that returned the 302 response.

Yahoo then counts up the 302 redirect responses for an advertisement (normalizing the querystring in their log parser to just show the base filename), and has the exact number of impressions to bill an advertiser for. Quite clever I thought.

What Happened to Site Counters in Whidbey?

In ASP.NET Whidbey Beta1 we had a feature called "Site Counters". It was designed to provide an easy way for developers to add page/image impression and link tracking, and provided a really nice developer experience for enabling this. Specifically, you could just set a property on our adrotator or navigation controls to automatically cause usage of them to update counters stored in a backend database. A really nice and easy to use developer model.

The downside, though, as we started to do real-world application building scenarios and get customer feedback was that we realized that although we had a really easy developer approach for doing usage counting, the implementation model we were using didn't take advantage of all the tricks and real-world best practices that sites like Yahoo and others have pioneered -- and would have likely suffered in performance compared to other approaches.

As a result of this, Site Counters is a feature that we've decided to take out of Whidbey and it will not be in Beta2. Our team philosophy is that a half-baked feature can often be worse than not having the feature at all -- it is much better to postpone it and ensure it is super high quality in the future. Site Counters will likely come back again in a future ASP.NET release, and this time automatically take advantage of approaches like the one above (and others) and deliver a feature that is both easy to use and provides best-in-breed performance.

In the meantime, you can manually take advantage of the approach described in Michael's talk to make your own ad impression system work really well.

Another downside is that the image doesn't then get served if the browser's Javascript is turned off or (more commonly) the user has installed a filter program that blocks code that looks like a web bug.

A better solution would be to generate the unique tag on the server, using an algorithm that produces references like "/images/squrdlioup.jpg" (which can still trigger the redirect trick on any modern server like Apache). This would also allow for better tracking even if the user's got cookies disabled or blocked for image fetches.

Peter da Silva - Saturday, November 20, 2004 12:58:00 PM

Hi Pater,

They actually do a clever trick above of having another impage in a <noscript> block. This ensures that if Javascript is turned off -- the ad and image will still be displayed. According to Yahoo's research, the number of borwsers where this is true is not less than 1%.

As you suggest, you could alternatively generate the random url on the server -- although that would require server code to execute (eliminating the ability to output cache).

scottgu - Saturday, November 20, 2004 7:09:00 PM

Scott,

Good to hear you decided to pull Site Counters from Whidbey. Not that I'm happy to see a feature gone, but it's better not to rely on hacks in a product of such magnitude. Those who are willing can use the technique you've outlined on their own.

Milan Negovan - Monday, November 22, 2004 3:53:00 PM

Awesome post, dude!!!

Martin Kulov - Tuesday, November 23, 2004 3:15:00 PM

Google (gmail) does it slightly different but eaqually effective.

Instead of the qstring being ?r=somerandome number.

they do abcd= , where abcd is the random and there is no value being passed.

John Hamman - Monday, December 6, 2004 4:13:00 PM

is the random name "abcd=" with no value somehow less likely to cache than "t=RandomString" ?

Jeffrey Knight - Wednesday, December 15, 2004 6:03:00 PM

6 Comments