Archives

Top k algorithm revisited
3 years ago, I implemented top K operator in LINQ. I was asked recently why I chose Min Heap since there are faster algorithms. To recap, we try to select top k element from a sequence of n elements. A minheap has the following property:
 findmin takes O(1) time.
 extractmin takes O(ln k) time where k is the size of the heap.
 insert takes O(ln k) time.
For each number in the sequence, I first compare the number to findmin. If the number is smaller, the number is tossed away. If the number is bigger, we do a extractmin followed by an insert. So in the worst scenario, the algorithm runs with time complexity of O(n ln k) and the space complexity of O(k).
If we use maxheap instead, we can heapify n elements in O(n) time. Then we do k extractmax so we have the total time complexity of O(n + k ln n) and a space complexity of O(n).
We could also use Quick Select. It is very similar to Quick Sort that we randomly select a pivot and move it to the right position. Unlike Quick Sort, we can discard the left side of the pivot whenever we have greater then k elements on the right side. This algorithm converge fairly quickly and we have the average time complexity of O(n) and space complexity of O(n). In average case, the space requirement by Quick Select is less than the max heap approach.
So both maxheap and quick select are likely faster than the minheap approach. Why do I used minheap then? The reason is that the minheap approach uses minimum amount of memory and I assume that I will work with large dataset so . Also, if we work with a stream, the minheap provides a running top k.