Website performance – know what’s coming
If you live in Australia (and perhaps even outside of Oz),
there has been a lot of attention on the
Australian Bureau of Statistics (ABS) in regards to the
Census 2016 website
and its
subsequent failure to adequately handle the peak usage
times it was supposedly designed for. It was also reported
that a Denial of Service attack was launched against the
site in order to cause an outage, which obviously worked.
One can argue where the fault lay, and no doubt there are
blame games being played out right now.
IBM were tasked with the delivery of the website and
they used a
company called "RevolutionIT" to perform load testing. It is my opinion that while RevolutionIT performed the
load testing, IBM is indeed the one who should wear all the
blame. Load testing and performance testing simply provide
metrics for a system under given situations. IBM would need
to analyse those metrics to ensure that aspects of the
system are performing as expected. This is not just a one
off task either. Ideally it should be run repeatedly to
ensure changes being applied are having the desired
effect.
Typically, a load test run would be executed against a
single instance of the website to ascertain the baseline
performance for a single unit of infrastructure, with a
"close to production version" of the data store it was
using.
Once this single unit of measure is found, it is a
relatively easy exercise to extrapolate how much
infrastructure is required to handle a given load. More
testing can then be performed with more machines to validate
this claim. This a very simplistic view of the situation and
there is far more variables to consider but in essence,
baseline your performance then iterate upon that.
Infrastructure is just one part of that picture though. A very important one of course, but not the only one. Care must be taken to ensure the application is designed and architected in such a way to make it resilient to failure, in addition to performing well. Take a look at the graph below.
Note that this is not actual traffic on census night but
rather my interpretation of what may have been a factor. The
orange bars represent what traffic was expected during that
day, with the blue representing what traffic actually
occurred on the day. Again, purely fictional in terms of
actual values but not too far from what probably occurred.
At popular times of the day convenient for the
Australian public, people tried to access the site and fill
it in.
A naïve expectation is to think that people will be nice net
citizens and plod along, happy to evenly distribute
themselves across the day, with mild peaks. A more realistic
expectation is akin to driving in heavy traffic. People
don’t want to go slower and play nice, they want to go
faster. See a gap in the traffic? Zoom in, cut others off
and speed down the road to your advantage. This is the same
as hitting F5 in your browser attempting to get the site to
load. Going too slowly, hit F5 again and again. Your problem
is now even worse than estimated expectations as each person
can triple their requests attempting to be made.
To avoid these situations, that you need to have a good
understanding of your customers usage habits. Know or learn
the typical habits of your customers to ensure you get a
clear view of how your system will be used. Will it be used
first thing in the morning while people are drinking their
first coffee? Maybe there will be multiple peak times during
morning, lunch and the evening? Perhaps the system will
remain relatively unused for most days except friday to
sunday?
In a performance testing scenario, especially in a situation
similar to Census where you know in advance you are going to
get a lot of sustained traffic at particular times, you need
to plan for the worst. If you have some degree of metrics
around what the load might be, ensure your systems can
handle far more than expected. At the very least, ensure
that should you encounter unexpected or extremely heavy
traffic, your systems can scale, and can fail with grace.
This means that if your system cannot cope, it can at least
display some form of information to the user, in addition to
resuming service once the system can cope with the load.
Again, infrastructure plays an important part here, but this
can all be for naught if you do not design and architect for
scale. At least some positive out of the census issues is
that this kind of design will hopefully be thought about the
next time such a system is made available. With the
aftermath still quite raw, now is a great time to bring to
attention the performance needs in the systems that you
manage or design, to ensure that this kind of consideration
is brought to the attention of those that can do something
about it.
Sources:
http://www.crn.com.au/news/revolution-it-census-fail-was-not-our-fault-433913
http://risky.biz/censusfailupdate
http://www.abs.gov.au/AUSSTATS/abs@.nsf/mediareleasesbyReleaseDate/5239447C98B47FD0CA25800B00191B1A?OpenDocument
http://www.lifehacker.com.au/2016/08/ibm-and-the-abs-census-let-the-blame-games-begin/