Tales from the Evil Empire

Bertrand Le Roy's blog

Cache sharing between sites

There's been some debate recently about good ways that we could enable web sites to share the browser cache in the future. The problem is that popular JavaScript frameworks currently end up being downloaded several times from different sites that use them and this is a great waste of resources. Of course, there are some ways to achieve re-use of scripts across sites today by hosting those frameworks in a central location, but that is an expensive thing to do for framework developers, most of which are open source projects (it basically amounts to asking the framework developers to pay for the hosting cost of everyone that uses them).

To summarize the debate, Doug Crockford has been mentioning a possible solution. He also wrote another piece on JavaScript that is disconnected from this debate.

Brendan answered on what really happened back in the Netscape days and mentioned in passing that he didn't like Doug's proposal and that he preferred another approach.

I have to admit I wasn't captivated by the whole debate about the qualities (or lack thereof) of JavaScript and regretted that the debate around such an important feature would be drowned in that. So let me summarize the interesting part...

Doug wants all elements that have a "src" or "href" attribute to also have an optional "hash" attribute that is computed from the contents of that file with a well-defined cryptographic hash algorithm. This way, when the browser encounters another tag that has the same hash value, and it already has a cache entry with that hash, it would just get the resource from the cache without looking at the remote file.

Brendan doesn't like this because crypto hashes are not that secure in that it is possible (but highly unlikely) to build a different (malicious) file that has the same hash, and also because a crypto hash in otherwise clean HTML would look weird and out of place.

He proposes an alternate approach where the tag has a readable "shared" attribute that would typically be a url. The mechanism is pretty much the same as the hash, except that it's readable.

I don't know if it's Brendan or me who is missing something here but his proposition looks a lot more insecure than Doug's. Here's how an attacker would compromise that system:

  • EvlH4ckr666 sends spam with links to his new cute penguin image site.
  • As everyone loves cute penguin images, a large number of people go to http://cutepenguinpictures.com (not a real site as I'm writing this), some of them with an empty cache.
  • Our cute penguin site contains (in addition to cute penguin images) a script tag with src="evil.js" and shared=http://sharedscripthosting.com/pasteYourFavoriteFramework.js (also not a real site as I write this).
  • A while later, some of those users will visit another web site that references a legitimate copy of pasteYourFavoriteFramework.js, but as it has the same shared value that evil.js maliciously used, the browser will use what it believes is a legitimate script, but that is in fact evil.js.
  • Chaos ensues.

Really, am I missing something here?

Also, another variation of those ideas that would be a little chattier but would keep the html clean and could probably be more secure would be to have the shared attribute but have another attribute that points to a hashing web service. Here's how that could work:

  • When the browser sees a tag with a shared attribute and it has a cache entry with that shared value, it would generate a public key, send it to the validation service url to challenge it to return a hash of the script using the provided public key.
  • The browser receives the response to its challenge under the form of a hash. It performs the same hashing with the same public key on the cached version and compares it with what the validation service returned. If they are the same, use the cache entry, otherwise hit the src or href.

Of course, this is less simple than the other approaches, but I think it's more secure than both and still avoids sending redundant versions of the same potentially huge scripts. Instead, there is a small negotiation that should be fairly small in terms of network payload.

What are your thoughts on this? Worth the trouble?

Doug's post: http://blog.360.yahoo.com/blog-TBPekxc1dLNy5DOloPfzVvFIVOWMB0li?p=789

Brendan's answer to Doug: http://weblogs.mozillazine.org/roadmap/archives/2008/04/popularity.html

UPDATE: I had a mail exchange with Brendan and it seems like what he meant was that it's the shared attribute url that is hit when present. That sure removes reasonable possibilities of poisoning the cache but I don't see what values it brings: it just seems to replace src and to have exactly the same pros and cons. In particular, it still puts the burden of shared hosting on the script author, whereas Doug's proposal (and mine) distribute this burden across all user sites.
Also removed a word that he found abusive.

UPDATE 2: so apparently the only thing @shared brings when compared with the regular @src is that src can be used as a fallback if @shared is unavailable. The shared url is still queried every time the cache doesn't contain it, which means that it still requires some massive hosting capabilities. There is no distribution of the burden. Brendan even suggested that for performance reasons, both urls get queried whenever the cache is empty! But of course, anyone who gave thought to it had inferred all this from the following ;) :
"If the browser has already downloaded the shared URL, and it still is valid according to HTTP caching rules, then it can use the cached (and pre-compiled!) script instead of downloading the src URL. This avoids hash poisoning concerns. It requires only that the content author ensure that the src attribute name a file identical to the canonical ("popular") version of the library named by the shared attribute. [...] only the @shared value would be shared among script tags. The @src would be loaded only if there was no cache entry for @shared."

Comments

Joe Chung said:

Brendan responded to your question in his blog.  For your readers' convenience:

"Only the @shared value would be shared among script tags. The @src would be loaded only if there was no cache entry for @shared."

# April 5, 2008 1:18 AM

Philip said:

I don't see the security difference between the two plans: how do you guarantee that the hash "generated with a well-defined cryptographic hash algorithm" really matches the javascript file referenced by the src attribute? To do that, you would have to have the browser compute the hash after receiving the javascript file (and then, why are you specifying it in the html in the first place?).

The two plans are equally insecure and share the attack that you describe.

# April 5, 2008 2:51 PM

Bertrand Le Roy said:

@Joe: yes, I actually read Brendan's answer before I wrote that post, but I don't see how it answers my question. I reformulated with another comment that for some reason he chose not to publish. That's why I posted my comments on my own blog ;) The attack scenario that I describe in this post, unless I'm mistaken, is not mitigated by his answer in any way as the src value would never be hit if something is already in the cache for the shared value, which may be a malicious script.

@Philip: I think the third solution that I describe doesn't have this flaw at all. But Doug's approach is much more difficult to compromise than Brendan's: the scenario with his approach is that the cache is organized as a dictionary of hashes to scripts. When you visit a page, if there is a script with a hash attribute, the browser looks up that hash and if it finds it it uses the script. But the way the cache is being constructed is what guarantees the integrity of the hash: the first time a script is loaded, the browser computes the hash locally from the contents and uses the computed hash as the key in the dictionary. Does that clarify?

# April 5, 2008 5:52 PM

Kurisu said:

A lot of storage systems use exactly this concept i.e., storing hashes of files or chunks thereof along with the files themselves and use this to avoid transferring or sometimes even storing the same data more than once. Not surprising very similar ideas have been around for years and decades. Consider the ETag header, the Content-MD5 header or much closer RFC 2169:

www.faqs.org/.../rfc2169.html

The Gnutella network takes advantage of this RFC.

Calling cryptographic hashes too weak for this purpose borders on reality distortion field. How do these people think cryptographic signatures or SSL/TLS works? Not to mention that attacking the hashes by trying to generate evil twins is the least probable attack vector. Anyway, in this context though I'd consider it a total pain for maintenence. At the very least it would require dedicated tools to keep everything consistent during updates. Also if I was going through all this hassle I'd extend this to all files not just JavaScript. For browsers or proxies that already implement caching this would be a low-hanging fruit anyway. All they need to add is an index with the hashes for the cached files.

# April 8, 2008 12:55 PM

Bertrand Le Roy said:

@Kurisu: thanks for the pointers. I agree that crypto hash attacks are blown way out of proportion by Brendan, who tends to generate a quite powerful RDF. That he would be using a security argument to dismiss Doug's approach and propose his own approach which has a much bigger security problem is quite puzzling. And yes, Doug's post wasn't limited to script and neither should any implementation of that stuff. CSS and images would benefit from the same optimizations. To cite the post above, "all elements that have a "src" or "href" attribute".

# April 8, 2008 5:14 PM

Kurisu said:

I hadn't actually read the linked articles before commenting. Now I see he even mentions base32 encoding of the hash which kinda shows where Doug got this idea from.

I think what Brendan meant is similar to DTD URIs in XML. These can be normal HTTP URLs and it's certainly useful to avoid namespace clashes. A few parsers do/did really try to fetch the DTD from such URLs if it wasn't stored locally. This can easily cause an unintended DDoS. Likewise it's only safe if you can trust DNS and the server. So signing the scripts using PKI and/or use of TLS is probably implicitly given. Anything would mean someone misunderstood cross-site scripting. If you use only signatures i.e., you trust everything signed with certain keys but no explicit hashes that means of course the script can be modified by the key owner for good or for bad.

The DDoS issue could be circumvented by using Coral: www.coralcdn.org

All in all, I believe both suggestions are actually complementary.

This article is also related and somewhat interesting:

changelog.ca/.../gnutella_does_not_need_the_x-alt_http_header

# April 9, 2008 7:13 PM

Jonah Dempcy said:

Why not just rely on Google to provide the bandwidth and simply have them host all the major JS frameworks? They are already hosting plenty of code snippets on Google Code and they offer hotlinks (the link actually has the text "Hotlink/Download" so they intend for you to use it directly if you so desire).

For example, I have used Dean Edwards' IE7.js library. Normally I would concatenate it together with all the other JS as part of the build process to minimize I/O traffic, but I'm wondering if I'd be better off just hotlinking to Google Code, on the off-chance that other sites are doing the same and the download would be cached:

ie7-js.googlecode.com/.../2.0(beta)/IE7.js

For now, I will keep doing my concatenation thing because minimizing the amount of separate JS files seems to have the greatest benefit (and I doubt many people have the IE7.js file cached). But, if Google Code were to host MooTools, jQuery or Prototype I'd be all over it in a flash. I'd stop having to write build scripts that combine those files with the rest of the site JS and be able to just hotlink from Google, basking in the benefits of universal caching for popular pieces of code.

# May 7, 2008 10:49 PM

Bertrand Le Roy said:

@Jonah: Some have expressed privacy concerns over having scripts hosted by Google (which then gets a lot of free information about people browsing your site through the referrer header). You also have to trust the central location to always be available. But yes, centralized hosting by Google and others is a step in the right direction.

I would prefer a solution such as the ones described here because there is no such reliance on a centralized location, yet the scripts get cached across sites and the load is naturally distributed.

# May 28, 2008 7:29 PM
Leave a Comment

(required) 

(required) 

(optional)

(required)