Performance in .NET – Part 3

Friday, December 7, 2018

Introduction

This post is part of a series on performance in .NET. See the first one on object instantiation here and the second on property copying here. This time I’m going to talk about collections, but focusing on the performance side.

Collections

Back in 2009 (!) I wrote a blog post, which I updated a couple of times, about .NET collection types. Essentially, my point was – is – that you should pick the right collection for the job that you have at hands. This post is still up to date.

How many of us don’t by default just choose List<T> when there is need for a general-purpose collection? I certainly do, at times… Well, it turns out that this may or may not be what we need. Let me give a few examples.

List<T> is an array-based collection, which means that it is probably the best if you are going to iterate through items one at a time, sequentially, but it is not so good if you want to remove items from a random position other than the list’s end, because this causes a whole new array to be instantiated in memory, and all items (except the one that you wish to remove, of course) from the original list to be copied into the new one. It’s the same problem with random inserts at any given position, other than the end, and only if the initial capacity isn’t exceeded.

For operations where random inserts and deletes are required, LinkedList<T> is a much better choice as it does not require the instantiation of new arrays and the memory copy. It does that, however, at the expense of a slightly poorer performance in list traversal.

What about duplicates? With the previous list implementations, if we don’t want to allow duplicates, we need to check one by one, which is a PITA. Luckily, the .NET BCL has an implementation of a mathematical set which automatically excludes duplicates. One implementation is HashSet<T>, which is an indexed collection that uses each object’s GetHashCode method to figure out if the object already exists in the collection; needless to say, this method must be properly implemented.

Talking about indexed collections, if we want to be able to rapidly get to one element – or a number of elements – by some key, .NET has that as well: Dictionary<TKey, TValue> for storing a single value per unique key. The key can be of whatever type we want. This one also offers good performance when it comes to retrieving, adding, removing or modifying a single item.

Then we have LIFO and FIFO implementations in the form of the Stack<T> and Queue<T> types, which are optimized for adding items at the top or the bottom of the list, respectively, and don’t even allow other kinds of operations other than traversal. Internally they also use a linked-list approach.

Bits have their own specialized collections, BitArray and BitVector32. The latter only works with up to 32 bits and it probably provides the fastest performance of the two, according to Microsoft:

BitVector32 is more efficient than BitArray for Boolean values and small integers that are used internally. A BitArray can grow indefinitely as needed, but it has the memory and performance overhead that a class instance requires. In contrast, a BitVector32 uses only 32 bits.

Finally, Microsoft makes available thread-safe collections that are thread-safe in nature and therefore do not need any thread synchronization mechanisms, which makes them faster than if we had to roll out our own thread synchronization. They include thread-safe dictionaries (ConcurrentDictionary<TKey, TValue>), queues (ConcurrentQueue<T>), stacks (ConcurrentStack<T>) and general-purpose lists (BlockingCollection<T>).

Conclusion

I didn’t go through all the collection classes available, for that you can refer to my previous post.

The point I want to make with this post is:

Pick the right collection for the job; use the 80-20 rule and try to understand what the most common usage for your collection will be; do not just go blindly with List<T> or similar
Always expose collections publicly though interfaces or base classes that only show the bare minimum required, so that you can swap out the implementation should you need to
And, of course, measure your usage so that you can make opinionated decisions.

If performance is not an issue, by all means, forget about this and keep on doing what you are already doing and works for you.