Understanding LINQ to Objects (7) Query Methods Internals

[LINQ via C# series

This post explains how are the LINQ to Objects standard query methods implemented. Once again, it will be exhibited that iterator pattern is the core pattern of LINQ to Objects query.

The first thing need to emphasize is, not all query methods are deferred. The deferred methods are underlined in the following list:

  • Restriction: Where, OfType
  • Projection: Select, SelectMany
  • Ordering: OrderBy, ThenBy, OrderByDescending, ThenByDescending, Reverse
  • Join: Join, GroupJoin
  • Grouping: GroupBy
  • Set: Zip, Distinct, Union, Intersect, Except
  • Aggregation: Aggregate, [Count], LongCount, Sum, Min, Max, Average
  • Partitioning: Take, Skip, TakeWhile, SkipWhile
  • Cancatening: Concat
  • Conversion: ToSequence, ToArray, ToList, ToDictionary, ToLookup, [Cast]
  • Equality: SequenceEqual
  • Elements: [First], [FirstOrDefault], [Last], [LastOrDefault], [Single], [SingleOrDefault], [ElementAt], [ElementAtOrDefault], DefaultIfEmpty
  • Generation: Range, Repeat, Empty
  • Qualifiers: Any, All, [Contains]

Eager-executing methods

The first thing need to emphasize is, not all methods are deferred. Many methods are logically required to be eager. For example, Aggregate() immediately folds a collection to a single value:

public static TAccumulate Aggregate<TSource, TAccumulate>(
    this IEnumerable<TSource> source, TAccumulate seed, 
    Func<TAccumulate, TSource, TAccumulate> func)
{
    TAccumulate result = seed;
    foreach (TSource item in source)
    {
        result = func(result, item);
    }

    return result;
}

The ToList() is also eager because it transforms a collection into a hot IEnumerable<T>:

public static List<TSource> ToList<TSource>(this IEnumerable<TSource> source)
{
    // Arguments checking is ommitted. 

    return new List<TSource>(source);
}

Inside the constructor of List<T>, the source is iterated, and all items are calculated and added to the List<TSource> collection before it is returned.

So is the ToLookup(), a little complex but clear:

public static ILookup<TKey, TElement> ToLookup<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector, Func<TSource, TElement> elementSelector)
{
    Lookup<TKey, TElement> results = new Lookup<TKey, TElement>();
    foreach (TSource item in source)
    {
        results.Add(keySelector(item), elementSelector(item));
    }

    return results;
}

Here a Lookup<TKey, TElement> with a Add() method need to be created:

internal class Lookup<TKey, TElement> : ILookup<TKey, TElement>
{
    private Dictionary<TKey, Grouping<TKey, TElement>> _data = new Dictionary<TKey, Grouping<TKey, TElement>>();

    internal void Add(TKey key, TElement element)
    {
        if (this._data.ContainsKey(key))
        {
            this._data[key].Add(element);
        }
        else
        {
            this._data.Add(key, new Grouping<TKey, TElement>(key).Add(element));
        }
    }

    #region ILookup<TKey,TElement> Members

    public bool Contains(TKey key)
    {
        return this._data.ContainsKey(key);
    }

    public int Count
    {
        get
        {
            return this._data.Count;
        }
    }

    public IEnumerable<TElement> this[TKey key]
    {
        get
        {
            return this._data[key];
        }
    }

    #endregion

    #region IEnumerable<IGrouping<TKey,TElement>> Members

    public IEnumerator<IGrouping<TKey, TElement>> GetEnumerator()
    {
        return this._data.Values.GetEnumerator();
    }

    #endregion

    #region IEnumerable Members

    IEnumerator IEnumerable.GetEnumerator()
    {
        return this.GetEnumerator();
    }

    #endregion
}

Inside Lookup<TKey, TElement>, the actual data is stored with a Dictionary<TKey, Grouping<TKey, TElement>> object. The dictionary’s value is a Grouping<TKey, TElement>, because when Lookup<TKey, TElement> is iterated, IGrouping<TKey, TElement> items are returned.

So the Grouping<TKey, TElement> type with a Add() method also needs to be created:

internal class Grouping<TKey, TElement> : IGrouping<TKey, TElement>
{
    private TElement[] _elements = new TElement[0];

    internal Grouping(TKey key)
    {
        this.Key = key;
    }

    internal Grouping<TKey, TElement> Add(TElement element)
    {
        int index = this._elements.Length;
        Array.Resize(ref this._elements, index + 1);
        this._elements[index] = element;
        return this;
    }

    #region IGrouping<TKey,TElement> Members

    public TKey Key
    {
        get;
        private set;
    }

    #endregion

    #region IEnumerable<TElement> Members

    public IEnumerator<TElement> GetEnumerator()
    {
        for (int i = 0; i < this._elements.Length; i++)
        {
            yield return this._elements[i];
        }
    }

    #endregion

    #region IEnumerable Members

    IEnumerator IEnumerable.GetEnumerator()
    {
        return this.GetEnumerator();
    }

    #endregion
}

Please notice the above ToLookup() method, Lookup type and Grouping type are simplified to make the code clear. The actual implementations are much more complex.

This ToLookup() is very useful. It leverages the implementation of other methods, like Join(), GroupBy(), etc.

Deferred-executing query methods

The previous post has figured out the Select() method:

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source, Func<TSource, TResult> selector)
{
    foreach (TSource item in source)
    {
        yield return selector(item);
    }
}

So the other projection method SelectMany() can be implemented immediately:

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
    this IEnumerable<TSource> source, Func<TSource, IEnumerable<TResult>> selector)
{
    foreach (TSource item in source)
    {
        foreach (TResult result in selector(item))
        {
            yield return result;
        }
    }
}

And so is the filter method Where():

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

And so is the Zip() in NET. 4.0:

public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(
    this IEnumerable<TFirst> first, IEnumerable<TSecond> second, 
    Func<TFirst, TSecond, TResult> resultSelector)
{
    using (IEnumerator<TFirst> firstIterator = first.GetEnumerator())
    using (IEnumerator<TSecond> secondIterator = second.GetEnumerator())
    {
        while (firstIterator.MoveNext() && secondIterator.MoveNext())
        {
            yield return resultSelector(firstIterator.Current, secondIterator.Current);
        }
    }
}

Join() and GroupBy() can be implemented with the help of ToLookup():

public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer, IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector)
{
    ILookup<TKey, TInner> innerLookup = inner.ToLookup(innerKeySelector, element => element);
    foreach (TOuter outerItem in outer)
    {
        TKey key = outerKeySelector(outerItem);
        if (innerLookup.Contains(key))
        {
            foreach (TInner innerItem in innerLookup[key])
            {
                yield return resultSelector(outerItem, innerItem);
            }
        }
    }
}

public static IEnumerable<TResult> GroupJoin<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer, IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector,
    Func<TOuter, IEnumerable<TInner>, TResult> resultSelector)
{
    ILookup<TKey, TInner> innerLookup = inner.ToLookup(innerKeySelector, element => element);

    foreach (TOuter outerItem in outer)
    {
        TKey key = outerKeySelector(outerItem);
        if (innerLookup.Contains(key))
        {
            yield return resultSelector(outerItem, innerLookup[key]);
        }
        else
        {
            yield return resultSelector(outerItem, new List<TInner>());
        }
    }
}

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector)
{
    ILookup<TKey, TElement> lookup = source.ToLookup(keySelector, elementSelector);
    foreach (IGrouping<TKey, TElement> group in lookup)
    {
        yield return resultSelector(group.Key, group);
    }
}

Do not be confused by ToLookup(). Although ToLookup() is eager, the above 3 methods are still deferred. The inner ToLookup() will be called during the first iteration on the returned IEnumerable<T>.

Maybe the Cast() is the easiest:

public static IEnumerable<TResult> Cast<TResult>(this IEnumerable source)
{
    foreach (object item in source)
    {
        yield return (TResult)item;
    }
}

Please notice that these non-extension methods are also deferred:

public static IEnumerable<int> Range(int start, int count)
{
    for (int i = start; i < start + count; i++)
    {
        yield return i;
    }
}

public static IEnumerable<TResult> Repeat<TResult>(TResult element, int count)
{
    for (int i = 0; i < count; i++)
    {
        yield return element;
    }
}

because they return cold IEnumerable<T>.

Optimization of query methods

Actually, many query methods are optimized for performance.

For example, when some methods need to iterate the collection, they check whether the collection is an array first:

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source, Func<TSource, TResult> selector)
{
    TSource[] array = source as TSource[];
    if (array != null)
    {
        for (int i = 0; i < array.Length; i++)
        {
            yield return selector(array[i]);
        }
    }
    else
    {
        foreach (TSource item in source)
        {
            yield return selector(item);
        }
    }
}

The Count() method checks whether the collection is an ICollection<T> or ICollection first, because these interfaces have Count property which can be directly used:

public static int Count<TSource>(this IEnumerable<TSource> source)
{
    ICollection<TSource> genericCollection = source as ICollection<TSource>;
    if (genericCollection != null)
    {
        // Time complexity becomes O(1).
        return genericCollection.Count;
    }

    ICollection collection = source as ICollection;
    if (collection != null)
    {
        // Time complexity becomes O(1).
        return collection.Count;
    }

    int count = 0;
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        // Time complexity is O(N).
        while (iterator.MoveNext())
        {
            // This looks like a foreach loop, 
            // but iterator.Current is never invoked to improve performance.
            count++;
        }
    }

    return count;
}

As long as source implements ICollection<T> or ICollection (source is an array, or List<T>, etc.), this query’s time complexity is optimized from O(N) to O(1).

In the list at the beginning of this post, methods in [] are specially optimized. Below is the detail from the C# FAQ blog of Microsoft

Method Optimization
Cast If the data source already implements IEnumerable<T> for the given T, when the sequence of data is returned without a cast.
Contains If the data source implements the ICollection or ICollection<T> interface, the corresponding method of the interface is used.
Count
ElementAt If the data source implements the IList or IList<T> interface, the interface’s Count method and indexing operations are used.
ElementAtOrDefualt
First
FirstOrDefualt
Last
LastOrDefault
Single
SingleOrDefault
Published Tuesday, March 16, 2010 3:24 AM by Dixin

Comments

# re: Understanding LINQ to Objects (7) Query Methods Internals

Friday, June 18, 2010 4:12 PM by Lars Holm Jensen

You have a minor error in the Aggregate source; func should be run on result, not seed..

Great articles, very informative.

# re: Understanding LINQ to Objects (7) Query Methods Internals

Saturday, June 19, 2010 12:52 AM by Dixin

I just fixed that bug. Thank you very much!

# Twitter Trackbacks for Understanding LINQ to Objects (7) Query Methods Internals - Dixin's Blog [asp.net] on Topsy.com

Pingback from  Twitter Trackbacks for                 Understanding LINQ to Objects (7) Query Methods Internals - Dixin's Blog         [asp.net]        on Topsy.com

Leave a Comment

(required) 
(required) 
(optional)
(required)