The indie publisher moving to Azure, part 2: operation

Saturday, June 21, 2014

Azure General Software Development

About a month ago, I wrote all about my experience migrating my sites off of dedicated hardware and into Azure. I figured I would wait awhile before writing about the daily operation of those sites, so I could gather enough experience to make a meaningful assessment. As I said in the previous post, this is a move that I was looking forward to make for a good three years or so, when I actually worked with Azure from within Microsoft. The pricing finally came down to a point where it made sense for an indie publisher, and here we are.

A lot has happened in the last month, which is remarkable when you think about it. They're moving pretty quickly at improving the service and making it better. They even fixed the scheduler problem I described last time (it was a problem with the portal). In a general sense, I've found it very stable outside of the problem I'll describe, but the database billing is screwed up. I'm finally to a point where I can stop watching it and trust that it works as intended, and there are definitely some pluses and unexpected savings.

So let's get the negatives out of the way. PointBuzz was crashing in a completely weird way. Basically it would just stop responding entirely. The logging didn't show anything, and there weren't 500 errors, and the browser would just hang out and not ever get anything back from the request. Also weird, while it would do it at any time of day (including during the great water main break at Cedar Point, a key time for the site), I saw it die several times during the 11 p.m. hour, which sure seemed not likely to be a coincidence. The only thing that I could think of that made it different was that the site ran on v3.5 of .NET instead of v4.5. After being frustrated with the people who handle billing support (tech support is an upcharge, which is a real problem when there's a real platform issue), I expressed my frustration to someone very high up the chain at Microsoft because I didn't know what else to do. He put me in touch with some people who looked deeper. They didn't find anything either, though they did observe that turning on the diagnostic functionality didn't work on any v3.5 site. That reinforced my theory that the framework version had something to do with the problem.

While the folks at Microsoft were looking into things, I refactored the site to run on v4.5. It took a few hours, but I eventually got it working. I redeployed, configured it for v4.5, and it hasn't had a single issue since. There's no resolution to the v3.5 problem that I'm aware of, but if you have something you want to put in Azure Web Sites on that version, I wouldn't recommend it.

The other problem that I can't explain is the database pricing. It's bad enough that they use a completely arbitrary "DB Unit" in the billing, but what's really frustrating is that with the deprecated-next-year Web/Business tiers, the amount they're charging me doesn't match what the pricing is supposed to be. As you may recall from the previous post, I tried to import the data originally into the newer Standard tier, but after several hours on a test run, it was getting nowhere. I settled for the old tiers, but the pricing makes no sense. According to the pricing details, CoasterBuzz should be priced for 10 gigs, at $45.96. And yet, after 24 days, it's already billed $67! What's that about? I'm going to file a ticket for it, but I don't expect any positive outcome.

Two good things since the SQL migration: They have improved the performance of all of the new tiers by a factor of five, and just as I started to write this, I noticed that you can finally migrate the old Web/Business tiers to the new Basic/Standard/Premium tiers. That means the big databases for PointBuzz and CoasterBuzz will be a flat $20/month, up to 250 gigs in size. The price doubles in a year, but I suspect the pricing will see drops anyway.

So for the first full month, I'm trending toward a total cost of about $190. If the new database pricing goes as expected, next month I think it will be around $130. The dedicated server I had was costing $167. My big fear about bandwidth turned out to be largely unfounded, because I never thought about the fact that my nightly backups to S3 were the reason I was pushing out around a terabyte every month. Now I'm backing up to storage within Azure, so that bandwidth cost goes away. I'm only pushing 150 gigs outbound.

There is a lot of goodness that makes this effort a lot more redundant, and I think this is where some of the greatest value is derived from using a cloud platform. First of all, the storage is already locally redundant, meaning the data is copied to another disk somewhere else in the data center. I also have the geo-redundancy enabled, meaning it's also copied to an entirely different data center in another part of the country. Right now I'm mostly using storage for nightly database backups, and I'm paying about $2 for almost 30 gigs. I could turn on geo-redundant with read-only access to the other regions for another dollar, if I wanted total overkill.

For all of my complaining about SQL Azure, it too is redundant without any intervention on my part. There are always at least two copies of the data, so hardware failures aren't something to really be concerned about. Then add in the nightly backups into storage. It's pretty solid. At some point they're supposed to add the ability to also restore back in time (via the transaction logs, I assume), but they haven't enabled it yet.

The Web Sites themselves I'm running on a small instance (1 core, 1.75 GB of RAM), using the standard tier. Not surprisingly, this is plenty of room because I write efficient code. :) Seriously though, the sites collectively serve a few million requests daily and CPU never goes over 10%, and the RAM usage hovers around 80%. I use the standard tier because for less than $20 more, you get a few SSL certificate slots, unlimited web sockets (the forums use this), automated backups, the scheduler, and probably my favorite feature, staging deployments. You can deploy to staging, and click a button in the portal to flip the staging and production sites. If it hopelessly fails, you're back up in two seconds.

There are other nice things too. Getting email alerts was helpful when the I had the PB problems (I have an alert to email two people when requests per 5 minutes goes less than 1). Endpoint monitoring gives me a good idea of response times from all over the world. WebJobs and queues are very cool new features, and I may likely use them in a future project. The free credits from SendGrid take care of the email connectivity. The various charts and graphs are cool. The new portal, in preview, shows promise but a lot of stuff doesn't work yet.

Maybe the most important question is: How is the performance? Generally speaking, it's awesome. Once the PointBuzz issue was worked out, it's also surprisingly consistent. CPU and RAM usage follow expected curves. The endpoint monitoring shows the PointBuzz home page with consistent response times below 30 ms from Virginia! CoasterBuzz varies a lot more, but I'm not sure exactly why. It still tends to clock in under 200 ms, but I need to look deeper.

Despite the problems, now I feel like everything is in a pretty good spot, and I'm pretty happy with it. The server in Dallas was solid, but having to maintain disk space and SQL logs and HTTP logs and all of that stuff got kind of old at times. I like the Azure platform because it takes all of that maintenance stuff out of your life, and instantly gives you tools and "hardware" if you need it. My sites aren't built to scale out (lots of local caching), but they could scale up if I had to at a moment's notice. As the TV commercial once said, "Yay cloud."

No Comments