Be aware of the memory implication of GroupBy in LINQ

LINQ functions generally have very small memory footprint. LINQ functions usually use lazy evaluation whenever possible. An item is not evaluated until MoveNext() and Current of the Enumerator are called. It only needs memory to hold one element at a time.

GroupBy is an exception. An IEnumerable<T> are often grouped into one IEnumerable<T> for each group before it is further processed. LINQ often has no choice but to accumulate each group using a List<T> like structure, resulting in the entire IEnumerable<T> loaded into memory at once.

If we are only interested in the aggregated results of each group, we can optimize GroupBy with the following implementation:

 image

We use a dictionary to hold the accumulated value for each group. Each element from IEnumerable<T> is immediately consumed by the accumulated as it is read in. This way, we only need as much memory to hold the number of groups, far less than the entire IEnumerable<T> if each group has a large number of elements.

If the IEnumerable<T> is ordered by the key order, we can further optimize the code. We would need memory to hold only one group at a time.

As usual, all LINQ samples are available in the codeplex SkyLinq project. 

No Comments