304 Your images from a database

Wednesday, July 1, 2009

I was reading somewhere about some anecdotal evidence that Google doesn't like to index images that don't have some kind of modification time on them. When I relaunched CoasterBuzz last year, I moved all of my coaster pr0n to the database, and I've since noticed that none of the images are in fact indexed. Bummer.

This also pointed out to me that I was doing something annoying. I was reading the data out every time for every image request. Not exactly the most efficient use of resources. Static files come down with information in the headers indicating when they were last modified (IIS, and presumably any other Web server does this), so the next time the browser makes the request, it the server compares the time in the request header with that of the file, and returns a 304 "not modified" response, and no file.

That seemed like an obvious thing to do, even if it has no impact on Google indexing. Fortunately, it just required some refactoring of the IHttpHandler I had doing the work.

Sidebar: This is probably the point at which some people will make a big stink about serving images out of a database, and how it's bad for performance or scalability. That's a fine argument to make, but outside of doing obviously stupid things, this is not an issue here. I'd prefer to address performance and scalability problems if I have them, not when I might have them, or never have them. Seriously, this is a site that does somewhere between a half-million and a million page views a month depending on the season. There are no performance issues here.

So anyway, assuming for a moment that "photo" is a business object in this code, and it was determined by a query value to the handler, this is the meaty part of the ProcessRequest() method of the handler:

if (!String.IsNullOrEmpty(context.Request.Headers["If-Modified-Since"]))
{
    CultureInfo provider = CultureInfo.InvariantCulture;
    var lastMod = DateTime.ParseExact(context.Request.Headers["If-Modified-Since"], "r", provider);
    if (lastMod == photo.SubmitDate.AddMilliseconds(-photo.SubmitDate.Millisecond))
    {
        context.Response.StatusCode = 304;
        context.Response.StatusDescription = "Not Modified";
        return;
    }
}
byte[] imageData = GetImageData(photo);
context.Response.OutputStream.Write(imageData, 0, imageData.Length);
context.Response.Cache.SetCacheability(HttpCacheability.Public);
var adjustedTime = DateTime.SpecifyKind(photo.SubmitDate, DateTimeKind.Utc);
context.Response.Cache.SetLastModified(adjustedTime);

Yes, it probably needs to be refactored, and yes, it should probably be used in an IHttpAsyncHandler. But let's go through what's happening, starting at the bottom.

The last few lines write out the actual bytes of the image (MIME type was set in previous code), then set the cacheability and the modification time of the image, which in my case is stored with the bits. The goofy part is where we create a new DateTime to make its kind known. If you don't explicitly state that it's a UTC time, the SetLastModified() method apparently adjusts it. I happen to store most times as UTC, so that was one less thing to worry about. This adds a header in the response called Last-Modified, and gives it a value that looks something like "Sun, 22 Jun 2003 16:27:19 GMT" (note that it truncates milliseconds and ticks, as you may expect).

Now, on subsequent requests for the same image, the browser adds an If-Modified-Since header to the request, with the same date and time as the value. Here we're checking to see if the value is present on the request, and if so, let's see if we should do a 304. If it's there, we parse it into a DateTime and compare the time with the one stored in the business object. We're stripping off the milliseconds because the database will fill them in on our DateTime, and the incoming request doesn't have the same high resolution. If we have a match, we send out the 304 and return, not sending any more data or reading the bytes from the database.

You can do this pretty easily in ASP.NET MVC as well.

public ActionResult Image(int id)
{
    var image = _imageRepository.Get(id);
    if (image == null)
        throw new HttpException(404, "Image not found");
    if (!String.IsNullOrEmpty(Request.Headers["If-Modified-Since"]))
    {
        CultureInfo provider = CultureInfo.InvariantCulture;
        var lastMod = DateTime.ParseExact(Request.Headers["If-Modified-Since"], "r", provider).ToLocalTime();
        if (lastMod == image.TimeStamp.AddMilliseconds(-image.TimeStamp.Millisecond))
        {
            Response.StatusCode = 304;
            Response.StatusDescription = "Not Modified";
            return Content(String.Empty);
        }
    }
    var stream = new MemoryStream(image.GetImage());
    Response.Cache.SetCacheability(HttpCacheability.Public);
    Response.Cache.SetLastModified(image.TimeStamp);
    return File(stream, image.MimeType);
}

Let me start by saying that this was something I just prototyped. It's in dire need of refactoring, as much of the logic isn't stuff you'd normally put in a controller action. I think there's a method for returning nothing on the Controller base, but I don't remember off the top of my head. If there is, you'd use that instead of Content() in the 304 case.

I hope this helps someone out!

Why would you want to store uploaded images in a database? I don't understand the benefit of doing it this way over storing them as actual files on the server.

Can you explain?

Rick H. - Wednesday, July 1, 2009 1:26:42 PM

Why not? I mean, the image is just data, so why not store the bytes along side the data that describes it? You also don't have the strange issue that crops of files without data or data without a file. The biggest benefit to me, however, is fewer files. The database is backed up as one file, which is a lot easier to move, if need be, than a bazillion smaller ones.

Jeff - Wednesday, July 1, 2009 3:09:08 PM

I can name another scenario where storing images in the database is a huge benefit. We have a cluster of Oracle Application Server instances setup with a load balancer in front of them all. Storing the images in the database ensures that no matter what server the user is hitting, the images will be available to that user.

Also when I retrieve images that are stored in the database, I make sure that I set the last-modified, cache-control, and expires response header attributes. This allows browsers to cache the images, thus making this very efficient.

Java example:

response.addHeader("Last-Modified", formattedLastModifiedDate);
response.addHeader("Cache-Control", "public");
response.addHeader("Expires", formattedExpirationDate);
response.setBufferSize(imageByteArray.length);
response.setContentType(mimeType); response.setContentLength(imageByteArray.length);

Dan - Thursday, July 2, 2009 8:37:49 AM

Thanks, Dan. The server farm thing was too obvious for me to think about!

The browser will cache the image with what I have, as the methods off of HttpContext.Response.Cache set these headers for you. I haven't set the expiration in my case, and I'm not sure if the browser then restricts the caching just to the session.

Jeff - Thursday, July 2, 2009 9:59:33 AM

re: images in the database. YMMV - it depends on your usage. I grok the desire to keep the bytes co-located with its metadata, and the ease of central access in a server farm, but it gets trickier at scale - I inherited a large, client-deployed application that used the database for file storage. Sysadmins almost universally hate this mainly because of the extra strain it puts on the database - these are systems with 24/7 uptime requirements and moderate to high peak concurrency (50,000+ concurrent sessions is not unusual). Admins also just like having direct access to file content.

You are addressing some of perf strain here by using 304, but by replicating functionality that should be available directly in the web server. Ultimately, both techniques can be valid, it's just a matter of understanding the tradeoffs.

Bob - Thursday, July 2, 2009 11:41:58 AM

Agreed. I'm not dogmatic about it at all, but my little disclaimer was for those who are. Some people make it a religious issue!

Sort of related, but I'm fascinated to read up on how Facebook handles their photos. It seems to not work about 2% of the time for me, but it's impressive given the scale.

Jeff - Thursday, July 2, 2009 11:22:46 PM

Awesome story, I literally googled for the exact title of this. LoL.

FYI - another good reason for doing it this way is when you are on shared hosting that doesn't allow your web application to write files to the file system.

I'd be interested in reading that Facebook story you mentioned if you have the URL.

Joe - Sunday, May 30, 2010 3:02:23 AM

I wish ASP.NET had an option to set all the caching headers in one simple easy step. Something like Response.CacheFor(20), or stuff like that.

Saeed Neamati - Sunday, June 3, 2012 1:01:44 PM

8 Comments