An efficient Top K algorithm with implementation in C# for LINQ
The LINQ library currently does not have a dedicated top K implementation. Programs usually use the OrderBy function followed by the Take function. For N items, an efficient sort algorithm would scale O(n) in space and O(n log n) in time. Since we are counting the top K, I believe that we could devise an algorithm that scales O(K) in space.
Among all the sorting algorithms that I am familiar, the heapsort algorithm came to my mind first. A Heap has a tree-like data structure that can often be implemented using an array. A heap can either be a max-heap or a min-heap. In the case of min-heap, the minimum element of at the root of the tree and all the child elements are greater than their parents.
Binary heap is a special case of heap. A binary min-heap has the following characteristics:
- find-min takes O(1) time.
- delete-min takes O(log n) time.
- insert takes O(log n) time.
So here is how I implement my top K algorithm:
- Create a heap with an array of size K.
- Insert items into the heap until the heap reaches its capacity. This takes K O(log K) time.
- For each the remaining elements, if an element is greater than find-min of the heap, do a delete-min and then insert the element into the heap.
- Then we repeated delete-min until the heap is empty and arrange the deleted element in reverse order and we get our top 10 list.
The time it takes to find top K items from an N item list is:
O(N) * t1 + O((K + log N - log K) * log K) * t2
Here t1 is the time to compare an element in step 3 to find-min and t2 is the combined time of delete-min and the subsequent insert. So this algorithm is much more efficient that a call to OrderBy followed by a call to Take.
I have checked-in my implementation to my Sky LINQ project. The LINQ Top and Bottom is at in LinqExt.cs. The internal binary heap implementation is in BinaryHeap.cs. The LINQ example can be found in HeapSort.cs. The Sky Log example has also been updated to use the efficient Top K algorithm.
Note that this algorithm combined OrderBy and Take by GroupBy itself still uses O(N) splace where N is number of groups. There are probabilistic approximation algorithms that can be used to further alleviate memory foot print. That is something I could try in the future.