One year with sites in Azure

Saturday, April 11, 2015

Azure

My sites (CoasterBuzz and PointBuzz) have now been in Azure for about a year. It has been interesting, for sure. From a high altitude perspective, I can tell you that I've saved money when compared to renting a dedicated server, but there have been challenges in terms of reliability.

Because of my early work using Azure when I was working at Microsoft (MSDN stuff), I looked forward to the day when I could move my apps there. Back then, in 2010, all we had were worker/web roles, storage and SQL Azure was, uh, not what I'd call robust. A year ago, the economics made sense, the feature set amazing, and here I am.

In terms of cost, the combination of resources I use have been 25% cheaper that the dedicated server (which I paid $167 for each month at SoftLayer). Mind you, I'm talking only about monthly consumption, because the dedicated server involved licensing for different things over the years, most notably a SQL license. So that makes me generally happy, especially because ad revenue has really sucked for the last year.

Performance has been a mixed bag, but appears to be getting better for a number of reasons. Both of the big sites are now running on the newest everything, including POP Forums, which makes a difference in terms of computing performance. But I also think that they've changed some things in the background, or some magic has happened where I'm on less stressed hardware... who knows. Pages are now being served in less than 50 ms in some cases, as viewed from Azure's Virginia monitoring point. More importantly, Google is now fetching pages in around 100 ms, which is good for search juice. I had fast rendering on the dedicated hardware, but SoftLayer's connectivity seemed to have more latency.

In my case, I'm running the sites on a standard small instance (1 core, 1.75 gigs of RAM), at $74 per month. My apps can't do multi-instance, because they use in-memory cache (HttpRuntime.Cache, for the nerds). I have done some experimentation with Redis cache, and I could probably set it up in production if I had to because the caching code is easily swapped out (use your dependency injection, kids). I just don't want to spend twice as much. Yet. I would love to have the simple redundancy.

Here's the problem though... I'm always battling memory pressure. The two apps together only consumer about 600 MB on average, but apparently there can be a good gig of overhead from the OS running the VM. And those nifty diagnostic bits (the Kudu stuff on the "scm" subdomains) can be a big memory hog as well. Probably most importantly, you should turn off your other deployment slots if you're not using them. As such, I only keep my staging sites running when I'm in the process of deploying new code.

Frustratingly, the monitoring for websites, er, Web Apps now, doesn't look at the underlying virtual machine as a unit. Well, they sort of do. If you're willing to use the "preview" portal, which has been preview for more than a year now, you can take an awkward path to the instance via one of the individual sites, er, apps, and see the memory and CPU graphs. It's not at all contextual, because it appears in a box called "quotas" but unless you know better, you don't know that it's for the sum total of all apps on the VM. And that's all assuming the preview portal works at all. Over the last year I've seen enough stuff not work, and click in vain, to just leave it alone.

That frustration was at its worst in the first few months, when I couldn't understand the memory pressure problem because of the lack of context. I get it now, and outside of the outages, things are pretty stable.

SQL Azure performance has been totally predictable from the start. I'm running standard S0 databases for both sites, and my DTU percentage has rarely gone higher than 20%. Since switching their pricing model to this, and not driving it by size, this has been a huge value. Each database is costing $15 per month.

Of course, I can't not talk about The Great Azure Outage of 2014, or the little one that happened just a few weeks ago in the East region. The sites were off the air for about two hours in November due to the deployment of some storage related updates. Since everything uses storage, obviously that was bad news. It wasn't the down time that bothered me in that case, but rather the awful communication around it. When the status page says everything is green, and your stuff is still down, that's not good. They eventually published a big post-mortem about what happened, but in the moment, it was a terrible experience.

Then just a few weeks ago, something went down in the east region. It was again a storage issue, more limited in scope, but the communication was far worse. In fact, there was no public disclosure about what went down, but in the comments of an official Azure blog post, they said you could request the RCA (which after some searching I learned stood for "root cause analysis"). That struck me as completely absurd. In any case, with both outages, I did get service credits.

Support is an issue as well. I've run into a lot of issues with the platform itself, I suppose because my apps were being memory hogs, but with no way to diagnose that during those times, I was stuck. I'm not going to lie, since support only comes if it's paid for (unless it's for billing), I email people I've had contact with in the organization and hope they'll forward me to the right people. For example, I recently had an issue with phantom metrics in my monitoring dashboard for one of the apps that caused an error. I wouldn't have had any recourse through standard support means to get that fixed, and that has to change.

It sounds a little negative, but outside of the two outages, I can't say that things have been unstable on average. I would say the first four or five months were not smooth, because of the memory pressure, but now that I can manage it, it has been super stable if you don't count the two outages. The value for what I'm paying is outstanding (all of the redundancy alone is more or less "free"), and performance is what I expect. I never have to upgrade software or install patches or truncate logs or any of that IT nonsense. There's a lot of value in that as well. Plus I can spin up and tear down stuff on the fly that would have been a pain in the ass before. Table storage, queues, CDN's, service buses, caches and the like are fantastic to have outside of the context of specific server hardware, or even VM's. As a software nerd, it's like being at Walt Disney World. (And I would know... I live next door to Magic Kingdom.)

No Comments