The great Azure outage of 2014

Thursday, November 20, 2014

Azure

We had some downtime on Tuesday night for our sites, about two hours or so. On one hand, November is the slowest month for the sites anyway, but on the flip side, we pushed a new version of PointBuzz that I wanted to monitor, and I did post a few photos from IAAPA that were worthy of discussion. It doesn't matter either way, because the sites were down and there was nothing I can do about it because of a serious failure in protocol with Microsoft's Azure platform.

I'm going to try and be constructive here. I'll start by talking about the old days of dedicated hardware. Back in the day, if you wanted to have software running on the Internet, you rented servers. Maybe you had virtual networks between machines, but you still had physical and specific hardware you were running stuff on. If you wanted redundancy, you paid a lot more for it.

I switched to the cloud last summer, after about 16 years in different hosting situations. At one point I had a T-1 and servers at my house (at a grand per month, believe it or not that was the cheapest solution). Big data centers and cheap bandwidth eventually became normal, and most of that time I was spending $200 or less per month. Still, as a developer, it still required me to spend a lot of time on things that I didn't care about, like patching software, maintaining backups, configuration tasks, etc. It also meant that I would encounter some very vanilla failures, like hard disks going bad or some routing problem.

Indeed, for many years I was at SoftLayer, which is now owned by IBM and was formerly called The Planet. There was usually one instance of downtime every other year. I had a hard drive failure once, a router's configuration broke in a big way, and one time there was even a fire in the data center. Oh, and one time I was down about five hours as they physically moved my aging server between locations (I didn't feel like upgrading... I was getting a good deal). In every case, either support tickets were automatically generated by their monitoring system, or I initiated them (in the case of the drive failure). There was a human I could contact and I knew someone was looking into it.

I don't like downtime, but I accept that it will happen sometimes. I'm cool with that. In the case of SoftLayer, I was always in the loop and understood what was going on. With this week's Azure outage, that was so far from the case that it was inexcusable. They eventually wrote up an explanation about what happened. Basically they did a widespread rollout of an "improvement" that had a bug, even though they insist that their own protocol prohibits this.

But it was really the communication failure that frustrated most people. Like I said, I think most people can get over a technical failure, not liking it, but dealing with it. What we got was vague Twitter posts about what "may" affect customers, and a dashboard that was completely useless. It said "it's all good" when it clearly wasn't. Not only that, but if you then describe that there's a problem with blob storage but declare websites and VM's as all green, even though they depend on storage, you're doing it wrong. Not all customers would know that. If a dependency is down, then that service is down too.

The support situation is also frustrating. Basically, there is no support unless you have a billing issue or you pay for it. Think about that for a minute. If something bad happens beyond your control, you have no recourse unless you pay for it. Even cable companies have better support than that (though not by much).

Microsoft has to do better. I think what people really wanted to hear was, "Yeah, we messed up really bad, not just in service delivery, but in the way we communicated." The support situation has to change too. I have two friends now that had VM's more or less disappear, and they couldn't get them back. They had to buy support, which then failed to "find" them. Talk about insult to injury.

Hopefully this is just a growing pain, but a significant problem can't go down like this again, from a communication standpoint.

I disagree with your conclusion that "But it was really the communication failure that frustrated most people". Azure was down for 5 hours. People were mad about that. Proper communication would not have brought their sites up faster.

Darrell - Friday, November 21, 2014 8:52:45 AM

Of course not, but having an "all clear" sound compounds the frustration many times over.

Jeff - Friday, November 21, 2014 11:19:50 PM

2 Comments