Choose .Concat() over .Union() if possible

Update: I've reworded a sentence as it was too vague. Sorry for that.

Here's a simple performance tip which can benefit you without doing any effort. Linq to Objects has two methods to combine two sequences together, both with different characteristics: Union() and Concat(). The difference in characteristics makes it possible to gain performance without doing anything difficult. Let's look at a simple example first:

Say we have two lists of integers: A: {1, 2, 3, 4} and B: {1, 2, 5, 6}. When using A.Union(B), a set union is executed, which results in { 1, 2, 3, 4, 5, 6}. When A.Concat(B) is used, the sequences are simply concatenated and { 1, 2, 3, 4, 1, 2, 5, 6} is the result. Pretty straight forward stuff. If you do not want duplicates in the second sequence to appear in the resulting sequence, Union() is necessary. However, in the case where it's impossible to have duplicates in the second sequence or you don't care if duplicates in the second sequence appear in the resulting sequence, Concat() is a better choice.

It seems obvious that Union() is more performance intensive than Concat(): Contact() simply makes sure the enumerator returned enumerates over the two sequences, Union() filters out duplicates in the second sequence. If your sequences have a lot of elements, using Union() will make the operation become significantly slower.

In the past 8 months I've written a lot of Linq to Objects queries and today I saw:

/// <summary>
/// Gets the entity mapping targets in this meta-data store
/// </summary>
/// <returns>all tables/views, ordered by catalogname/schemaname/tablename unioned with 
/// all views ordered by catalogname/schemaname/viewname</returns>
internal IEnumerable<IEntityMapTargetElement> GetEntityMappingTargets()
{
    return from c in this.PopulatedCatalogs
           from s in c.Schemas
           from e in s.Tables.Cast<IEntityMapTargetElement>()
                     .Union(s.Views.Cast<IEntityMapTargetElement>())
           orderby c.CatalogName ascending, s.SchemaOwner ascending, e.Name ascending
           select e;
}

It turned out I happened to have used Union() in many cases in the code where two sequences had to be merged into one sequence, however it was impossible to have duplicates in the second sequences in these queries. Must be an old strain of SQL-itis, I think: "Oh I have two sets to combine to one set: UNION". However, in the query above, it's not possible to have duplicates in the second sequence: there aren't views in the set of Tables and vice versa. So this same query could be written with a Concat(), saving performance as the second set doesn't have to be filtered from duplicates.

If you too have the habit to use .Union() to combine sequences, pay attention to that second sequence: if it can't have duplicates (make sure it also doesn't contain duplicates in the future!), it's better to use Concat() instead of Union().

Published Wednesday, March 4, 2009 11:25 AM by FransBouma

Comments

# re: Choose .Concat() over .Union() if possible

Wednesday, March 4, 2009 10:08 AM by Yaakov Ellis

Is there any big difference between data.Union() and data.Contact().Distinct() in terms of performance?

# re: Choose .Concat() over .Union() if possible

Wednesday, March 4, 2009 2:17 PM by Robert Kozak

"Pretty straight forward stuff. If you care about the duplicates, Union() is necessary. However, in the case where you can't have duplicates in the second sequence or you don't care, Concat() is a better choice."

You got Union() and Concat() reversed in that statement.

# re: Choose .Concat() over .Union() if possible

Wednesday, March 4, 2009 3:05 PM by FransBouma

@Robert: my intend is correct, however I see my dutch-english is not very clear, I'll rewrite the sentence. thanks! :)

@Yaakov: haven't checked it yet. I think it should be the same, but it depends on how Union and Distinct are implemented internally (e.g. if Union is a Concat().Distinct().

# re: Choose .Concat() over .Union() if possible

Wednesday, March 4, 2009 3:24 PM by Petar Repac

Isn't this equivalent to "UNION" vs. "UNION ALL" in SQL ?

If we know in advance that result sets cannot contain identical rows we should always use "UNION ALL" because it is faseter i.e. it doesn't perform de-duplication.

# re: Choose .Concat() over .Union() if possible

Wednesday, March 4, 2009 4:22 PM by FransBouma

@Petar: exactly. One could see UNION ALL as Concat.

# re: Choose .Concat() over .Union() if possible

Wednesday, March 4, 2009 6:46 PM by Robert Kozak

Frans,

That is much clearer and now I understand what you were saying. I'm usually not a grammar nazi but in this case I wanted to make sure I understood your point.

# re: Choose .Concat() over .Union() if possible

Thursday, March 5, 2009 2:07 PM by Guy Ellis

I would imagine that list.Union() internally calls list.Contact().Distinct()

# re: Choose .Concat() over .Union() if possible

Friday, March 13, 2009 5:06 AM by khuzema

When will we start hearing about Three, I mean v3.0.

regards

# re: Choose .Concat() over .Union() if possible

Friday, March 13, 2009 5:21 AM by FransBouma

In due time, likely in a couple of months .