Top k algorithm revisited
3 years ago, I implemented top K operator in LINQ. I was asked recently why I chose Min Heap since there are faster algorithms. To recap, we try to select top k element from a sequence of n elements. A min-heap has the following property:
- find-min takes O(1) time.
- extract-min takes O(ln k) time where k is the size of the heap.
- insert takes O(ln k) time.
For each number in the sequence, I first compare the number to find-min. If the number is smaller, the number is tossed away. If the number is bigger, we do a extract-min followed by an insert. So in the worst scenario, the algorithm runs with time complexity of O(n ln k) and the space complexity of O(k).
If we use max-heap instead, we can heapify n elements in O(n) time. Then we do k extract-max so we have the total time complexity of O(n + k ln n) and a space complexity of O(n).
We could also use Quick Select. It is very similar to Quick Sort that we randomly select a pivot and move it to the right position. Unlike Quick Sort, we can discard the left side of the pivot whenever we have greater then k elements on the right side. This algorithm converge fairly quickly and we have the average time complexity of O(n) and space complexity of O(n). In average case, the space requirement by Quick Select is less than the max heap approach.
So both max-heap and quick select are likely faster than the min-heap approach. Why do I used min-heap then? The reason is that the min-heap approach uses minimum amount of memory and I assume that I will work with large dataset so . Also, if we work with a stream, the min-heap provides a running top k.