The LINQ library currently does not have a dedicated top K implementation. Programs usually use the OrderBy function followed by the Take function. For N items, an efficient sort algorithm would scale O(n) in space and O(n log n) in time. Since we are counting the top K, I believe that we could devise an algorithm that scales O(K) in space.
Among all the sorting algorithms that I am familiar, the heapsort algorithm came to my mind first. A Heap has a tree-like data structure that can often be implemented using an array. A heap can either be a max-heap or a min-heap. In the case of min-heap, the minimum element of at the root of the tree and all the child elements are greater than their parents.
Binary heap is a special case of heap. A binary min-heap has the following characteristics:
- find-min takes O(1) time.
- delete-min takes O(log n) time.
- insert takes O(log n) time.
So here is how I implement my top K algorithm:
- Create a heap with an array of size K.
- Insert items into the heap until the heap reaches its capacity. This takes K O(log K) time.
- For each the remaining elements, if an element is greater than find-min of the heap, do a delete-min and then insert the element into the heap.
- Then we repeated delete-min until the heap is empty and arrange the deleted element in reverse order and we get our top 10 list.
The time it takes to find top K items from an N item list is:
O(N) * t1 + O((K + log N - log K) * log K) * t2
Here t1 is the time to compare an element in step 3 to find-min and t2 is the combined time of delete-min and the subsequent insert. So this algorithm is much more efficient that a call to OrderBy followed by a call to Take.
I have checked-in my implementation to my Sky LINQ project. The LINQ Top and Bottom is at in LinqExt.cs. The internal binary heap implementation is in BinaryHeap.cs. The LINQ example can be found in HeapSort.cs. The Sky Log example has also been updated to use the efficient Top K algorithm.
Note that this algorithm combined OrderBy and Take by GroupBy itself still uses O(N) splace where N is number of groups. There are probabilistic approximation algorithms that can be used to further alleviate memory foot print. That is something I could try in the future.
Yesterday, I learnt about Tx (LINQ to Logs and Traces) from This Week On Channel 9. The name suggest that it has overlap with my end-to-end example on LINQ to W3SVC log files. It turns out that the overlap is minimal.
Tx is a Microsoft Open Technologies project. The source code commit history suggest it was originally part Reactive Extension (Rx) developed by the Rx team and became an independent project a year ago. Tx heavily uses Rx.
The Tx project has a page on “When to use Tx?” and I will quote a portion of it below.
When NOT to use TxUse only LINQ-to-Objects when:
- There are no real-time feeds involved
- The data is already in memory (e.g. List <T>), or you are dealing with single file that is easy to parse - e.g. a text file, each line representing event of the same type.
Use only Reactive Extensions (Rx) when:
- There are real-time feeds (e.g. mouse moves)
- You need the same query to work on past history (e.g. file(s)) and real-time
- Each feed/file contains a single type of events, like T for IObservable<T>
When to use TxTx.Core adds the following new features to Rx:
- Support for Multiplexed sequences (single sequence containing events of different types in order of occurence). The simplest example of turning multiplexd sequence into type-specific Obseravable-s is the Demultiplexor
- Merging multiple input files in order of occurence - e.g. two log files
- Hereogeneous Inputs - e.g. queries across some log files (.evtx) and some traces (.etl)
- Single-pass-read to answer multiple queries on file(s)
- Scale in # of queries. This is side effect of the above, which applies to both real-time and single-read of past history
- Providing virtual time as per event timestamps. See TimeSource
- Support for both "Structured" (like database, or Rx) and "Timeline" mode (like most event-viewing tools)
The Tx W3CEnumerable.cs does have overlap with my implementation of log parser. I wish I had noticed the project earlier. In the future, I will try to align my project with Tx to make it easier to use them together. I just had an implementation of duct-typing so interoperating different types from the same idea should be easy.
Eric Lippert’s blogs (and his old blogs) have been my source to the inner working of VBScript and C# in the past decade. Recently, his blog on “What is ‘duck typing’?” has drawn lots of discussions and generated additional blogs from Phil Haack and Glenn Block. The discussion injects new ideas into my thoughts on programming by composition. I previously argued that interfaces are too big a unit for composition because they often live in different namespaces and only a few of the well-known interfaces are widely supported. I was in favor of lambdas (especially those that consume or return IEnumerable<T>) as composition unit. However, as I worked through my first end-to-end example, I found that lambdas are often too small (I will blog about this in details later).
Duck typing may solve the interface incompatibility problem in many situations. As described in Wikipedia, The name came from the following phrase:
“When I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck.”
If I would translate it into software engineering terms, I would translate it as:
The client expects certain behavior. If the server class has the behavior, then the client can call the server object even though the client does not explicitly know the type of the server objects.
That leads to the following implementation in a strongly-typed language like c#, consisting of:
- An interface defined by me to express my expected behaviors of duck, IMyDuck.
- An object from others that I want to consume. Let’s call its type OtherDuck.
- A function that checks if Duck has all the members of IMyDuck.
- A function (a proxy factory) that generates a proxy that implements IMyDuck and wrap around OtherDuck.
- In my software, I only bind my code to IMyDuck. The proxy factory is responsible for bridging IMyDuck and OtherDuck.
This implementation takes the expected behaviors as a whole. Except for the one time code generation for the proxy, everything else is strongly-typed at compile time. It is different to the “dynamic” in c# which implements late binding at the callsite.
I have checked my implementation into my Sky LINQ project under the c# project of SkyLinq.Composition. The name of the class is DuckTypingProxyFactory. The methods look like:
I have also provided an example called DuckTypingSample.cs in the SkyLinq.Sample project.
So when would duck-typing be useful? Here are a few:
- When consuming libraries from companies that implement the same standard.
- When different teams in a large enterprise each implement the company’s entities in their own namespace.
- When a company refactors software (see the evolution of asp.net for example).
- When calling multiple web services.
- I noticed the impromptu-interface project that does the same thing after I completed by project. Also, someone mentioned TypeMock and Castle in the comment of Eric’s log. So I am not claiming to be the first person with the idea. I am still glad to have my own code generator because I am going to use it to extend the idea of strongly-typed wrapper as comparing to data-transfer-objects.
- Also, nearly all Aspect-Oriented-Programming (AOP) and many Object-Relational-Mapping (ORM) frameworks have the similar proxy generator. In general, those in AOP frameworks are more complete because ORM frameworks primary concern with properties. I did looked at the Unity implementation when I implemented mine (thanks to the Patterns & Practices team!). My implementation is very bare-metal and thus easy to understand.
- It is built on LINQ and thus can use the entire .net framework, not just the built-in functions of LogParser.
- It can use Parallel LINQ and thus uses multiple cores effectively.
- Any part of a query can be refactored into a function and thus reused.
- We just need to write the queries once and the queries can be used against one log file, a whole directory of log files, several servers, and in any storage, such as Azure Blog Storage.
The queries in the example are from http://blogs.msdn.com/b/carlosag/archive/2010/03/25/analyze-your-iis-log-files-favorite-log-parser-queries.aspx.
The queries look even simpler than what you normally write against LogParser or SQL Server:
That is because part of the queries were refactored into two reusable functions below. We cannot reuse part of query with LogParser. With SQL Server, we can create reusable units with View. With LINQ, we can reuse any part of the LINQ query.
Note that I used the memory efficient version of GroupBy written by me previously. The equivalent code using the GroupBy in LINQ is commented out.
To run these reports, we just need records from the log files as IEnumerable<string>. To get them from a directory, the code would look like:
To get records from Windows Azure Blob Storage, the could would look like:
Lastly, we need to convert IEnumberable<string> to strongly-typed wrapper, IEnumerable<W3SVCLogRecord>, to be consumed by the reports:
I touched upon the idea of strongly-typed wrapper previously. The purpose of strongly-typed wrapper is to allow intelligence while minimize the garbage collection. The wrapper provides access to underlying data, but does not convert the data to a data-transfer-object.
Even with simple example, there are enough benefits in it so that I probably would never need to use LogParser again. The example is fully built on LINQ to IEnumerable. I have not used any cool feature from IQueryable and Rx yet. As usually, the latest code is in http://skylinq.codeplex.com. Stay tuned for more fun stuff.