Building for Web scale is a different skill

Tuesday, May 7, 2013

There are a lot of things that one can find satisfying about building stuff for the Web. For a lot of people, it's probably just the act of building something cool, pretty and useful. These are certainly things to strive for, but for me, the interesting thing has always been to build something that can scale.

Like so many things in life, this particular desire grew out of experience. Very early on, before I was technically getting paid to be a software developer, I learned about scale problems. In the wild west of 2000, I launched CoasterBuzz and did some advertising for it. I was on a shared hosting plan, and the site started to get slow in a hurry. There were a number of things I did poorly, including some recursive database queries, and worse, fetching more data than I needed. You live and learn, as they say, and I got better at it over time.

Many years later, I would have the chance to work on the MSDN/TechNet forums, which served well over 45 million pages per month. It's not lost on me how rare it is for anyone to get to work on a Web app that has to scale to that size. My team was actually there to try and rope it in a little, because it required a huge number of servers to run. There was a lot of low-hanging fruit, and some really hard things to do as well. I didn't directly do a lot of the performance enhancing stuff (though I did pair for it), but I still took a lot away from that experience.

With my own sites, they collectively do 12 to 15 million pages per year, depending on what's going on that year. Respectable, but under any normal circumstances, not a lot. At peak times, that works out to be between 6 to 10 pages per second, and less than 1 in off-peak times. It's very rare that my server ever gets pushed beyond 25% CPU usage (it doesn't hurt that it's total overkill, with four fast cores).

Still, I've noticed that people who work on Web applications don't always think in Web terms. By that, I mean it's not uncommon for them to think in "offline" terms, where time is not nearly as critical. For example, someone who works a typical job doing line-of-business applications doesn't care if they build a SQL query that has a ten-way join over two views. It might take a few seconds (or minutes) to get results, but it doesn't matter for the report it's going to generate. For the Web, that timing matters.

So here are a few of the things that I think people building apps for the Web need to think about. If there are others you can think of, I'd love to hear them! This is not an exhaustive list...

Denormalize! Disk space is cheap, and disks are huge. Really, it's OK to duplicate data if it means you don't have to do a bunch of expensive database joins. This is even more important now, in an age where we might use different kinds of data storage, like the various flavors of table and blob storage.
Calculate once. This is perhaps the biggest sin I've seen. You might have a large set of rules on whether or not you should display some piece of data. You have two choices: You can make those calculations every time the data is requested, or you can do it once and store the outcome of that decision. Which is going to be faster? Calculating once, probably in an infrequent data writing situation, or calculating every time, in a frequent read-only situation? I think the answer is pretty obvious.
Use caching, but only when it makes sense. Slapping an extra box with a bunch of memory on your network to store data is a pretty quick way to boost performance. There are some pitfalls to avoid, however. If the data changes frequently, make sure your code to invalidate the cache is well tested. Beware giant object graphs that serialize into gigantic objects that are many times larger than their binary counterparts. If you're caching because of expensive data querying or composition, fix that problem first.
Don't wait until the end to understand performance. I'll be honest, premature optimization annoys the crap out of me. Developers who waste time on what-ifs and try to code for them drive me nuts. That said, you can't pretend that performance is a last mile consideration. Fortunately, most shops these days are working with continuous integration environments at least as far as staging or testing, so problems should become apparent early on.
Use appropriate instrumentation. I worked with one company that had a hard time finding the weak spots in its system, because it wasn't obvious where the problems were. Big distributed systems can have a lot of moving parts, and you need insight into how each part talks to the other parts. For that company, I insisted that we had a dashboard to show the average times and failure rates for calls to an external system. (I also wanted complete logging of every interaction, but didn't get it.) Sure enough, one system was choking on requests at one point every day, and we could address it.
Remember your HTTP basics. I'm being intentionally broad here. Think about the size and shape of scripts and CSS (minification and compression), the limits in the number of connections a browser has to any one host, cookies and headers, the very statelessness of what you're doing. The Web is not VB6, regardless of the layers of abstraction you pile on top of it.

These are mostly off the top of my head, but I'd love to hear more suggestions.

There are lots of distributed and redundant cache solutions that allow you to store object graphs. There's really no need to denormalize as a result. Memory is cheap.

123456 is my password - Tuesday, May 7, 2013 12:33:20 PM

Yes, and shuttling half-gig object graphs (true story) around between the cache and the front-end servers is hardly efficient, or good for bandwidth.

Jeff - Tuesday, May 7, 2013 4:13:51 PM

2 Comments