Erik Porter's Blog

Life and Development at Microsoft and Other Technology Discussions

News

    Select DISTINCT on DataTable

    In a project I'm doing I needed to basically do a DISTINCT on a couple fields in a DataTable that I had used earlier in code.  Did some Google searching and came up with this MS KB on the subject.  I was hoping to find it in the Framework, but hey, writing a little extra code never hurt anyway.  Looking over the code though I found that I just didn't really like it.  Biggest reason being it only took one field.  I wanted to be able to pass in n number of fields, so I rewrote it and did a little tweaking to it.  Here's what I came up with in both VB and C#

    VB

    Public Shared Function SelectDistinct(ByVal SourceTable As DataTable, ByVal ParamArray FieldNames() As String) As DataTable
        
    Dim lastValues() As Object
        
    Dim newTable As DataTable

        
    If FieldNames Is Nothing OrElse FieldNames.Length = 0 Then
             
    Throw New ArgumentNullException("FieldNames")
        
    End If

        
    lastValues = New Object(FieldNames.Length - 1) {}
        
    newTable =
    New DataTable

        
    For Each field As String In FieldNames
             
    newTable.Columns.Add(field, SourceTable.Columns(field).DataType)
        
    Next

        
    For Each Row As DataRow In SourceTable.Select("", String.Join(", ", FieldNames))
             
    If Not fieldValuesAreEqual(lastValues, Row, FieldNames) Then
                  
    newTable.Rows.Add(createRowClone(Row, newTable.NewRow(), FieldNames))

                  
    setLastValues(lastValues, Row, FieldNames)
             
    End If
        
    Next

        
    Return newTable
    End Function

    Private Shared Function fieldValuesAreEqual(ByVal lastValues() As Object, ByVal currentRow As DataRow, ByVal fieldNames() As String) As Boolean
        
    Dim areEqual As Boolean = True

        
    For i As Integer = 0 To fieldNames.Length - 1
             
    If lastValues(i) Is Nothing OrElse Not lastValues(i).Equals(currentRow(fieldNames(i))) Then
                  
    areEqual = False
                  
    Exit For
             
    End If
        
    Next

        
    Return areEqual
    End Function

    Private Shared Function createRowClone(ByVal sourceRow As DataRow, ByVal newRow As DataRow, ByVal fieldNames() As String) As DataRow
        
    For Each field As String In fieldNames
             
    newRow(field) = sourceRow(field)
        
    Next

        
    Return newRow
    End Function

    Private Shared Sub setLastValues(ByVal lastValues() As Object, ByVal sourceRow As DataRow, ByVal fieldNames() As String)
        
    For i As Integer = 0 To fieldNames.Length - 1
             
    lastValues(i) = sourceRow(fieldNames(i))
        
    Next
    End Sub

    C#

    private static DataTable SelectDistinct(DataTable SourceTable, params string[] FieldNames)
    {
        
    object[] lastValues;
        
    DataTable newTable;
        
    DataRow[] orderedRows;

        
    if (FieldNames == null || FieldNames.Length == 0)
             
    throw new ArgumentNullException("FieldNames");

        
    lastValues =
    new object[FieldNames.Length];
        
    newTable =
    new DataTable();

        
    foreach (string fieldName in FieldNames)
             
    newTable.Columns.Add(fieldName, SourceTable.Columns[fieldName].DataType);

        
    orderedRows = SourceTable.Select("",
    string.Join(", ", FieldNames));

        
    foreach (DataRow row in orderedRows)
        
    {
             
    if (!fieldValuesAreEqual(lastValues, row, FieldNames))
             
    {
                  
    newTable.Rows.Add(createRowClone(row, newTable.NewRow(), FieldNames));

                  
    setLastValues(lastValues, row, FieldNames);
             
    }
        
    }

        
    return newTable;
    }

    private static bool fieldValuesAreEqual(object[] lastValues, DataRow currentRow, string[] fieldNames)
    {
        
    bool areEqual = true;

        
    for (int i = 0; i < fieldNames.Length; i++)
        
    {
             
    if (lastValues[i] == null || !lastValues[i].Equals(currentRow[fieldNames[i]]))
             
    {
                  
    areEqual =
    false;
                  
    break;
             
    }
        
    }

        
    return areEqual;
    }

    private static DataRow createRowClone(DataRow sourceRow, DataRow newRow, string[] fieldNames)
    {
        
    foreach (string field in fieldNames)
             
    newRow[field] = sourceRow[field];

        
    return newRow;
    }

    private static void setLastValues(object[] lastValues, DataRow sourceRow, string[] fieldNames)
    {
        
    for (int i = 0; i < fieldNames.Length; i++)
             
    lastValues[i] = sourceRow[fieldNames[i]];
    }

    I thought I'd also point out a downfall from the KB that I corrected in my example.  The beginning of the For Each calls SourceTable.Select("", FieldName)) in it.  This can hinder performance a bit in certain situations, because in C# that statement will be evaulated for every loop iteration.  You can see in the C# code that I have I actually call the Select Method and store the results in a variable and use it in the loop instead.  One other interesting thing to note about the different between the VB and the C# code is that I DID NOT put the results of the Select Method in a variable in the VB code even though I did in the C# code.  This is because VB works differently in that the actual For Each loop only evaluates the first iteration of the loop, which is nice and saves and extra line or two of code.  Another thing that's different between the two (sorry, I think the differences between the two are interesting) is that to instantiate a new Object Array in the lastValues variable in VB you have to put {} after it, and in C# you don't have to.  This is because since VB uses the same characters for Methods vs. Indexers, the compiler wouldn't be able to tell the difference between calling a constructor with one parameter vs. creating a new array, so it's necessary.  In C# it knows since you're using the indexer characters, you mean a new array, not a new object in this case.  That said, if you like using the VB Functions, you could also just call ReDim lastValues(FieldNames.Length), but I tend to stay away from those since I do a lot of code in both VB and C#.

    UPDATE: Adam pointed out that I made the wrong assumption in how the foreach is evaluated in C#.  It is not evaluated every time.  Thanks, Adam!

    UPDATE 2: Jackie Goldstein sent me an e-mail pointing out that .NET 2.0's DataView has a ToTable Method that will build a DataTable based off of the current DataView and also allow you to specify which columns you want in the DataTable as all as being able to say Distinct rows only.  I had totally forgotten about this feature and didn't even know it did distinct rows.  Thanks, Jackie!  :)

    Comments

    Adam said:

    "This can hinder performance a bit in certain situations, because in C# that statement will be evaulated for every loop iteration."

    This most definitely isn't true. Write some code and see for yourself.
    # February 10, 2005 5:17 PM

    Erik Porter said:

    Crap, you're right, Adam! I was thinking about regular for loops and how whatever you have there will be evaluated each time (and again, won't be in VB) and made the silly assumption that the foreach would work the same way. My bad, thanks for the correction!
    # February 10, 2005 6:00 PM

    Cameron said:

    This is great solution, cleaner than the MS KB one. Question. How would one modifiy further to add a "COUNT" column?

    One common use for DISTINCT is to rollup/group results. So it would be nice if a column could be added to include COUNT(*) results of each distinct result. I tried modifying your modified SelectDistinct method but can't figure it out how to add this.

    cheers,

    Cameron
    # February 24, 2005 3:12 PM

    Erik Porter said:

    Thanks, Cameron, glad you like it! :)

    As for the COUNT(*) I haven't had a need for it yet, but I can see it as being handy. Performance-wise I'd say stick with doing it in the DB instead of doing it outside of the DB, but unlike a lot of people that linger around here I realize that performance isn't always the biggest issue.

    I'm sure it's definitely possible, but offhand I'm not sure of an easy way. It seems complicated to me. Maybe there's something you could do with it by getting child rows through a relationship or something. *shrug*
    # February 24, 2005 10:41 PM

    Pranoy said:

    This is a great solution , MS KB had only one row and is not always useful & your solution is complete , a perfect one.
    Thanks a lot for sharing it
    # March 1, 2005 3:24 PM

    Erik Porter said:

    My pleasure, glad you found it useful! :)
    # March 1, 2005 3:40 PM

    Vikram S said:

    ToTable method works great!!! Thanks!
    # November 3, 2006 8:26 PM

    Phil said:

    GOD BLESS YOU THANK YOU!!

    # November 16, 2006 5:24 PM

    Sverker, Sweden said:

    Thanks Erik! In a specific case with a datatable containing 42000 rows and 56 columns the command dataview.ToTable() took 2.5 minutes and by using your code it took 2 seconds!! Quite an improvment! The distinct was only on two columns.

    # November 27, 2006 3:58 AM

    Avinash B said:

    Thanks Erik!

    Your code worked flawlessly. The soultion is clean and flexible. Thanks a lot.

    # December 19, 2006 11:31 AM

    /dev/null said:

    Thanks a lot, saved me so much time.  I was using another routine i found on the net and it took 20 minutes.  Now it takes 1 second.  And is easier to read as well.  The need for a select distinct on a datatable is that not all datatables are fed from a database.  Sometimes it's from reading a CSV file or from another data source like an LDAP directory.

    # February 1, 2007 10:53 AM

    kramlen said:

    here is something I wrote that does the same thing - yet

    just seems easier to understand:

    public static DataTable SelectDistinct(string[] pColumnNames, DataTable pOriginalTable)

           {

               DataTable distinctTable = new DataTable();

               int numColumns = pColumnNames.Length;

               for (int i = 0; i < numColumns; i++)

               {

                   distinctTable.Columns.Add(pColumnNames[i], pOriginalTable.Columns[pColumnNames[i]].DataType);

               }

               Hashtable trackData = new Hashtable();

               foreach (DataRow currentOriginalRow in pOriginalTable.Rows)

               {

                   StringBuilder hashData = new StringBuilder();

                   DataRow newRow = distinctTable.NewRow();

                   for (int i = 0; i < numColumns; i++)

                   {

                       hashData.Append(currentOriginalRow[pColumnNames[i]].ToString());

                       newRow[pColumnNames[i]] = currentOriginalRow[pColumnNames[i]];

                   }

                   if (!trackData.ContainsKey(hashData.ToString()))

                   {

                       trackData.Add(hashData.ToString(), null);

                       distinctTable.Rows.Add(newRow);

                   }

               }

               return distinctTable;

           }

    # February 28, 2007 3:47 AM

    HumanCompiler said:

    kramlen, thanks for posting some code!  You realize that code that I posted is over 2 years old now, right?  I don't claim any intelligence in anything I wrote more than 6 months ago.  ;)  Still seems to be helping out some people though.  Thanks for posting your code and thoughts on it.

    # February 28, 2007 5:46 AM

    Matt said:

    Helped me a lot!  Thanks!!!!

    # April 3, 2007 3:13 PM

    Patrick said:

    Taking these algorithms one step further.  The issue of tracking what values have already been added to the returned DataTable can be eliminated by temporarily setting CaseSensitive to true on SourceTable.

    bool origCaseSensitive = SourceTable.CaseSensitive;

    SourceTable.CaseSensitive = true;

    orderedRows = SourceTable.Select("", string.Join(", ", FieldNames));

    SourceTable.CaseSensitive = true;

    This tells the sort to perform a case sensitive comparison of strings and ensures that all identical values will be correctly grouped contiguously (which the above algorithms and the Microsoft's KB article do not).  Once you have this grouping the for loop in fieldValuesAreEqual and the Hashtable trackData in kramlen's post are no longer needed.

    # April 10, 2007 3:49 PM

    Jawad said:

    Jackie Goldstein u rule

    # April 18, 2007 4:25 PM

    samuel Charles said:

    Tonnes of thanks....

    # May 7, 2007 2:07 AM

    Sanjeev said:

    Thanks a Lot! helped me at a very critical time!! :)
    # May 19, 2007 9:36 AM

    Raymond said:

    Compared to the MS KB this was a joy to behold.

    Proving once again that concise code is a developer's (and maintainer's) best friend.

    Much obliged.

    - RK

    # August 1, 2007 5:58 PM

    HumanCompiler said:

    You're welcome, Raymond.  The fun part is that I posted this over 2 years ago.  ;)

    # August 2, 2007 1:21 PM

    Scott said:

    The comment by Jackie Goldstein about using the DataView class along with it's "ToTable()" method does the trick for performing DISTINCT on rows of a DataTable.

    Thanks!

    # August 3, 2007 4:27 PM

    Tomas said:

    Thanks, great article!

    The sad thing is that SelectedColumns doesnt seem to work on a DataGridView when doing .net 2 winapps. I tried use that so that the user could choose columns to view as distinct.

    # August 24, 2007 6:12 AM

    Don Quixote said:

    The hashtable method works great!

    # September 11, 2007 12:20 PM

    Aftab Ahmed said:

    VB Code:

    ------------------------------------------------------

       Public Function SelectDistinct(ByVal SourceTable As DataTable, ByVal ParamArray Columns() As String) As DataTable

           Dim Result As DataTable = New DataTable()

           If SourceTable IsNot Nothing Then

               Dim DView As DataView = SourceTable.DefaultView

               Try

                   Result = DView.ToTable(True, Columns)

               Catch ex As Exception

               End Try

           End If

           Return Result

       End Function

    ------------------------------------------------------

    C# Code:

    ------------------------------------------------------

       public DataTable SelectDistinct(DataTable SourceTable, params string[] Columns) {

           DataTable Result = new DataTable();

           if (SourceTable != null) {

               DataView DView = SourceTable.DefaultView;

               try {

                   Result = DView.ToTable(true, Columns);

               }

               catch (Exception ex) {

               }

           }

           return Result;

       }

    ------------------------------------------------------

    If you want to through the exception (when column name is the distinct columns list does not exist in the table or any other exception), just remove the Try Catch block.

    Hope this will help you.

    # November 12, 2007 1:23 AM

    Edward M said:

    To loop through distinct rows you can try:

    foreach (DataRow drDetail in dsInvoice.Tables["tblRMA_Detail"].DefaultView.ToTable("tblDetail", true, new string[] { "Style", "Color", "Ident" }).Rows)

               {}

    # January 30, 2008 4:10 PM

    Sergio said:

    Thanks a lot. Great code Aftab Ahmed!

    # February 5, 2008 2:43 PM

    chinnu said:

    This is a very helpful post for people not on framework 2.0 and up. Just thought I would pitch in.

    There was one post earlier on using stringbuilder but the one thing I did not quite take was the using hashtable part and also the fact that the code needs to handle nullable values as well. Coding with hashtables can really take a hit when you are dealing with real large datasets.

    I think the following logic could be a lil' faster. Hope this works :)

    dtSource ' source table

    dtOut 'distinct table

    columnNames ' array of the column names

    dtOut= dtSource.Clone

    dim x as string= string.empty

    dim y as string=string.empty

    for each dr as datarow in dtSource.Select("",String.Join(columnNames))'sort the table

    for count as integer=0 to columnNames.count-1

       if not isnull(dr(count(ColumnNames))

              x+=cstr(dr(count(ColumnNames))

       end if

       x +=","    'seperator

    next

    (if the columnNames length is the same as dtSource.Columns.Count you can use String.Join to generate the content of x.)

     if not y.equals(x) then

      dtOut.ImportRow(dr)

      y=x

     end if

    next dr

    I did not use vs editor to code the above. So the syntax may be off.

    Good luck-

    Vin

    # February 15, 2008 12:14 AM

    Ronald said:

    It is silly that a normal sql statement cannot be used.

    I need a count and a average. Has any body done this?

    # March 27, 2008 6:52 AM

    Poh said:

    Thanks for the wonderful code, Aftab.

    # April 17, 2008 9:14 PM

    Ben said:

    I see a lot of people saying that the DataView's ToTable method provides an "easier" way to do this.

    They're right, it is easy.  But if you read the comments carefully you'll see quite a few people stating how SLOW this can be.  I have to agree...

    I was able to get a result, about five times quicker than the DataView.ToTable, by simply iterating over every row, checking if the field I'm interested isn't already in my generics list, and if not adding it to the list.  

    I would have though the way I'm doing it would've been extremely slow... but apparently the list.Contains() method is actually quite quick.

    # May 10, 2008 11:50 PM

    APPDEV said:

    the items are reordered. anyone notice this ?

    # June 12, 2008 6:27 AM

    Rafael said:

    You can use

    Dim dt As New DataTable

    dvGrupo = New DataView(dtDePara)

    dt = dvGrupo.ToTable(True, "[field1]", "[field2]")

    # July 30, 2008 2:51 PM

    helloise said:

    private static DataTable SelectDistinct(DataTable SourceTable, params string[] FieldNames)

    i will obviously have to get the field names into params string[] and then call this method.

    how will i do these two things??

    thanks

    helloise

    # September 16, 2008 3:59 AM

    StopGo said:

    I have been wrestling with a problem that your code solved.  The built-in .NET DataTable.DefaultView.ToTable() method pegs the CPU at 100% and hangs when the dataset gets large.  I swapped out that line with a call to SelectDistinct and it ran in a matter of seconds.  From my limited experience it appears that your code is vastly superior to the built-in method and I would recommend people use your code and save themselves some major debugging headaches.  Thanks for sharing...

    # September 16, 2008 7:29 PM

    Gabriel said:

    really thanks guys.. the function rowk really cool

    Saludos desde chile

    # September 29, 2008 4:24 PM

    moschino5 said:

    Beware of the hash method. If I understand well the code a situation like this will lead to error:

    Col1   Col2

    ------------

    "AA,"  "BB"

    "AA"   ",BB"

    Both this records will produce the same hash and be considered as "NOT DISTINCT".

    Ciao.

    # October 23, 2008 4:40 AM

    Kevin Deenanauth said:

    Although this was posted years ago - you are the MAN

    # October 24, 2008 4:04 PM

    Jeff Adamez said:

    SUPER SUPER JOB!!!  My thanks to Aftab Ahmed!

    # October 29, 2008 3:09 PM

    RandomProgrammer said:

    If I understand this correctly, since this algorithm is sorting ( n log n, I hope ) before removing rows, it ends up being a n log n algorithm.

    The MS .ToTable() with distinct option selected runs like it's not doing that and is just comparing every row to every other one... n^n.

    I'm working with 200,000+ rows on tables, and using n log n insteand of n^n is a difference of 10 seconds to around 30 minutes or more.

    Microsoft needs to implement this algorithm for their ToTable function in their next framework, IMO. I don't know if you can pass that along.

    # November 20, 2008 3:00 PM

    Gabriel said:

    public DataTable SelectDistinct(DataTable sourceTable, params string[] fields)

    {

    DataView dv = sourceTable.DefaultView;

    DataTable dt = dv.ToTable(true, fields);

    return dt;

    }

    # November 26, 2008 1:25 PM

    Mak said:

    dv.ToTable is perfect one...

    Thanks a lot Gabriel...

    # December 19, 2008 1:58 PM

    rajchanchal said:

    dv.ToTable(true, fields);  have performance problem with something like 9000 rows (I tested with 9000) it takes some seconds which is drawback here

    # January 15, 2009 3:58 AM