Cache sharing between sites

Saturday, April 5, 2008

There's been some debate recently about good ways that we could enable web sites to share the browser cache in the future. The problem is that popular JavaScript frameworks currently end up being downloaded several times from different sites that use them and this is a great waste of resources. Of course, there are some ways to achieve re-use of scripts across sites today by hosting those frameworks in a central location, but that is an expensive thing to do for framework developers, most of which are open source projects (it basically amounts to asking the framework developers to pay for the hosting cost of everyone that uses them).

To summarize the debate, Doug Crockford has been mentioning a possible solution. He also wrote another piece on JavaScript that is disconnected from this debate.

Brendan answered on what really happened back in the Netscape days and mentioned in passing that he didn't like Doug's proposal and that he preferred another approach.

I have to admit I wasn't captivated by the whole debate about the qualities (or lack thereof) of JavaScript and regretted that the debate around such an important feature would be drowned in that. So let me summarize the interesting part...

Doug wants all elements that have a "src" or "href" attribute to also have an optional "hash" attribute that is computed from the contents of that file with a well-defined cryptographic hash algorithm. This way, when the browser encounters another tag that has the same hash value, and it already has a cache entry with that hash, it would just get the resource from the cache without looking at the remote file.

Brendan doesn't like this because crypto hashes are not that secure in that it is possible (but highly unlikely) to build a different (malicious) file that has the same hash, and also because a crypto hash in otherwise clean HTML would look weird and out of place.

He proposes an alternate approach where the tag has a readable "shared" attribute that would typically be a url. The mechanism is pretty much the same as the hash, except that it's readable.

I don't know if it's Brendan or me who is missing something here but his proposition looks a lot more insecure than Doug's. Here's how an attacker would compromise that system:

EvlH4ckr666 sends spam with links to his new cute penguin image site.
As everyone loves cute penguin images, a large number of people go to http://cutepenguinpictures.com (not a real site as I'm writing this), some of them with an empty cache.
Our cute penguin site contains (in addition to cute penguin images) a script tag with src="evil.js" and shared=http://sharedscripthosting.com/pasteYourFavoriteFramework.js (also not a real site as I write this).
A while later, some of those users will visit another web site that references a legitimate copy of pasteYourFavoriteFramework.js, but as it has the same shared value that evil.js maliciously used, the browser will use what it believes is a legitimate script, but that is in fact evil.js.
Chaos ensues.

Really, am I missing something here?

Also, another variation of those ideas that would be a little chattier but would keep the html clean and could probably be more secure would be to have the shared attribute but have another attribute that points to a hashing web service. Here's how that could work:

When the browser sees a tag with a shared attribute and it has a cache entry with that shared value, it would generate a public key, send it to the validation service url to challenge it to return a hash of the script using the provided public key.
The browser receives the response to its challenge under the form of a hash. It performs the same hashing with the same public key on the cached version and compares it with what the validation service returned. If they are the same, use the cache entry, otherwise hit the src or href.

Of course, this is less simple than the other approaches, but I think it's more secure than both and still avoids sending redundant versions of the same potentially huge scripts. Instead, there is a small negotiation that should be fairly small in terms of network payload.

What are your thoughts on this? Worth the trouble?

Doug's post: http://blog.360.yahoo.com/blog-TBPekxc1dLNy5DOloPfzVvFIVOWMB0li?p=789

Brendan's answer to Doug: http://weblogs.mozillazine.org/roadmap/archives/2008/04/popularity.html

UPDATE: I had a mail exchange with Brendan and it seems like what he meant was that it's the shared attribute url that is hit when present. That sure removes reasonable possibilities of poisoning the cache but I don't see what values it brings: it just seems to replace src and to have exactly the same pros and cons. In particular, it still puts the burden of shared hosting on the script author, whereas Doug's proposal (and mine) distribute this burden across all user sites.
Also removed a word that he found abusive.

UPDATE 2: so apparently the only thing @shared brings when compared with the regular @src is that src can be used as a fallback if @shared is unavailable. The shared url is still queried every time the cache doesn't contain it, which means that it still requires some massive hosting capabilities. There is no distribution of the burden. Brendan even suggested that for performance reasons, both urls get queried whenever the cache is empty! But of course, anyone who gave thought to it had inferred all this from the following ;) :
"If the browser has already downloaded the shared URL, and it still is valid according to HTTP caching rules, then it can use the cached (and pre-compiled!) script instead of downloading the src URL. This avoids hash poisoning concerns. It requires only that the content author ensure that the src attribute name a file identical to the canonical ("popular") version of the library named by the shared attribute. [...] only the @shared value would be shared among script tags. The @src would be loaded only if there was no cache entry for @shared."

5 Comments

Brendan responded to your question in his blog. For your readers' convenience:

"Only the @shared value would be shared among script tags. The @src would be loaded only if there was no cache entry for @shared."

Joe Chung - Saturday, April 5, 2008 5:18:17 AM

I don't see the security difference between the two plans: how do you guarantee that the hash "generated with a well-defined cryptographic hash algorithm" really matches the javascript file referenced by the src attribute? To do that, you would have to have the browser compute the hash after receiving the javascript file (and then, why are you specifying it in the html in the first place?).

The two plans are equally insecure and share the attack that you describe.

Philip - Saturday, April 5, 2008 6:51:45 PM

@Joe: yes, I actually read Brendan's answer before I wrote that post, but I don't see how it answers my question. I reformulated with another comment that for some reason he chose not to publish. That's why I posted my comments on my own blog ;) The attack scenario that I describe in this post, unless I'm mistaken, is not mitigated by his answer in any way as the src value would never be hit if something is already in the cache for the shared value, which may be a malicious script.
@Philip: I think the third solution that I describe doesn't have this flaw at all. But Doug's approach is much more difficult to compromise than Brendan's: the scenario with his approach is that the cache is organized as a dictionary of hashes to scripts. When you visit a page, if there is a script with a hash attribute, the browser looks up that hash and if it finds it it uses the script. But the way the cache is being constructed is what guarantees the integrity of the hash: the first time a script is loaded, the browser computes the hash locally from the contents and uses the computed hash as the key in the dictionary. Does that clarify?

Bertrand Le Roy - Saturday, April 5, 2008 9:52:17 PM

@Kurisu: thanks for the pointers. I agree that crypto hash attacks are blown way out of proportion by Brendan, who tends to generate a quite powerful RDF. That he would be using a security argument to dismiss Doug's approach and propose his own approach which has a much bigger security problem is quite puzzling. And yes, Doug's post wasn't limited to script and neither should any implementation of that stuff. CSS and images would benefit from the same optimizations. To cite the post above, "all elements that have a "src" or "href" attribute".

Bertrand Le Roy - Tuesday, April 8, 2008 9:14:49 PM

@Jonah: Some have expressed privacy concerns over having scripts hosted by Google (which then gets a lot of free information about people browsing your site through the referrer header). You also have to trust the central location to always be available. But yes, centralized hosting by Google and others is a step in the right direction.
I would prefer a solution such as the ones described here because there is no such reliance on a centralized location, yet the scripts get cached across sites and the load is naturally distributed.

Bertrand Le Roy - Wednesday, May 28, 2008 11:29:43 PM

Comments have been disabled for this content.