Erik Porter's Blog

Life and Development at Microsoft and Other Technology Discussions

News

    Select DISTINCT on DataTable

    In a project I'm doing I needed to basically do a DISTINCT on a couple fields in a DataTable that I had used earlier in code.  Did some Google searching and came up with this MS KB on the subject.  I was hoping to find it in the Framework, but hey, writing a little extra code never hurt anyway.  Looking over the code though I found that I just didn't really like it.  Biggest reason being it only took one field.  I wanted to be able to pass in n number of fields, so I rewrote it and did a little tweaking to it.  Here's what I came up with in both VB and C#

    VB

    Public Shared Function SelectDistinct(ByVal SourceTable As DataTable, ByVal ParamArray FieldNames() As String) As DataTable
        
    Dim lastValues() As Object
        
    Dim newTable As DataTable

        
    If FieldNames Is Nothing OrElse FieldNames.Length = 0 Then
             
    Throw New ArgumentNullException("FieldNames")
        
    End If

        
    lastValues = New Object(FieldNames.Length - 1) {}
        
    newTable =
    New DataTable

        
    For Each field As String In FieldNames
             
    newTable.Columns.Add(field, SourceTable.Columns(field).DataType)
        
    Next

        
    For Each Row As DataRow In SourceTable.Select("", String.Join(", ", FieldNames))
             
    If Not fieldValuesAreEqual(lastValues, Row, FieldNames) Then
                  
    newTable.Rows.Add(createRowClone(Row, newTable.NewRow(), FieldNames))

                  
    setLastValues(lastValues, Row, FieldNames)
             
    End If
        
    Next

        
    Return newTable
    End Function

    Private Shared Function fieldValuesAreEqual(ByVal lastValues() As Object, ByVal currentRow As DataRow, ByVal fieldNames() As String) As Boolean
        
    Dim areEqual As Boolean = True

        
    For i As Integer = 0 To fieldNames.Length - 1
             
    If lastValues(i) Is Nothing OrElse Not lastValues(i).Equals(currentRow(fieldNames(i))) Then
                  
    areEqual = False
                  
    Exit For
             
    End If
        
    Next

        
    Return areEqual
    End Function

    Private Shared Function createRowClone(ByVal sourceRow As DataRow, ByVal newRow As DataRow, ByVal fieldNames() As String) As DataRow
        
    For Each field As String In fieldNames
             
    newRow(field) = sourceRow(field)
        
    Next

        
    Return newRow
    End Function

    Private Shared Sub setLastValues(ByVal lastValues() As Object, ByVal sourceRow As DataRow, ByVal fieldNames() As String)
        
    For i As Integer = 0 To fieldNames.Length - 1
             
    lastValues(i) = sourceRow(fieldNames(i))
        
    Next
    End Sub

    C#

    private static DataTable SelectDistinct(DataTable SourceTable, params string[] FieldNames)
    {
        
    object[] lastValues;
        
    DataTable newTable;
        
    DataRow[] orderedRows;

        
    if (FieldNames == null || FieldNames.Length == 0)
             
    throw new ArgumentNullException("FieldNames");

        
    lastValues =
    new object[FieldNames.Length];
        
    newTable =
    new DataTable();

        
    foreach (string fieldName in FieldNames)
             
    newTable.Columns.Add(fieldName, SourceTable.Columns[fieldName].DataType);

        
    orderedRows = SourceTable.Select("",
    string.Join(", ", FieldNames));

        
    foreach (DataRow row in orderedRows)
        
    {
             
    if (!fieldValuesAreEqual(lastValues, row, FieldNames))
             
    {
                  
    newTable.Rows.Add(createRowClone(row, newTable.NewRow(), FieldNames));

                  
    setLastValues(lastValues, row, FieldNames);
             
    }
        
    }

        
    return newTable;
    }

    private static bool fieldValuesAreEqual(object[] lastValues, DataRow currentRow, string[] fieldNames)
    {
        
    bool areEqual = true;

        
    for (int i = 0; i < fieldNames.Length; i++)
        
    {
             
    if (lastValues[i] == null || !lastValues[i].Equals(currentRow[fieldNames[i]]))
             
    {
                  
    areEqual =
    false;
                  
    break;
             
    }
        
    }

        
    return areEqual;
    }

    private static DataRow createRowClone(DataRow sourceRow, DataRow newRow, string[] fieldNames)
    {
        
    foreach (string field in fieldNames)
             
    newRow[field] = sourceRow[field];

        
    return newRow;
    }

    private static void setLastValues(object[] lastValues, DataRow sourceRow, string[] fieldNames)
    {
        
    for (int i = 0; i < fieldNames.Length; i++)
             
    lastValues[i] = sourceRow[fieldNames[i]];
    }

    I thought I'd also point out a downfall from the KB that I corrected in my example.  The beginning of the For Each calls SourceTable.Select("", FieldName)) in it.  This can hinder performance a bit in certain situations, because in C# that statement will be evaulated for every loop iteration.  You can see in the C# code that I have I actually call the Select Method and store the results in a variable and use it in the loop instead.  One other interesting thing to note about the different between the VB and the C# code is that I DID NOT put the results of the Select Method in a variable in the VB code even though I did in the C# code.  This is because VB works differently in that the actual For Each loop only evaluates the first iteration of the loop, which is nice and saves and extra line or two of code.  Another thing that's different between the two (sorry, I think the differences between the two are interesting) is that to instantiate a new Object Array in the lastValues variable in VB you have to put {} after it, and in C# you don't have to.  This is because since VB uses the same characters for Methods vs. Indexers, the compiler wouldn't be able to tell the difference between calling a constructor with one parameter vs. creating a new array, so it's necessary.  In C# it knows since you're using the indexer characters, you mean a new array, not a new object in this case.  That said, if you like using the VB Functions, you could also just call ReDim lastValues(FieldNames.Length), but I tend to stay away from those since I do a lot of code in both VB and C#.

    UPDATE: Adam pointed out that I made the wrong assumption in how the foreach is evaluated in C#.  It is not evaluated every time.  Thanks, Adam!

    UPDATE 2: Jackie Goldstein sent me an e-mail pointing out that .NET 2.0's DataView has a ToTable Method that will build a DataTable based off of the current DataView and also allow you to specify which columns you want in the DataTable as all as being able to say Distinct rows only.  I had totally forgotten about this feature and didn't even know it did distinct rows.  Thanks, Jackie!  :)

    Comments

    Adam said:

    "This can hinder performance a bit in certain situations, because in C# that statement will be evaulated for every loop iteration."

    This most definitely isn't true. Write some code and see for yourself.
    # February 10, 2005 5:17 PM

    Erik Porter said:

    Crap, you're right, Adam! I was thinking about regular for loops and how whatever you have there will be evaluated each time (and again, won't be in VB) and made the silly assumption that the foreach would work the same way. My bad, thanks for the correction!
    # February 10, 2005 6:00 PM

    Cameron said:

    This is great solution, cleaner than the MS KB one. Question. How would one modifiy further to add a "COUNT" column?

    One common use for DISTINCT is to rollup/group results. So it would be nice if a column could be added to include COUNT(*) results of each distinct result. I tried modifying your modified SelectDistinct method but can't figure it out how to add this.

    cheers,

    Cameron
    # February 24, 2005 3:12 PM

    Erik Porter said:

    Thanks, Cameron, glad you like it! :)

    As for the COUNT(*) I haven't had a need for it yet, but I can see it as being handy. Performance-wise I'd say stick with doing it in the DB instead of doing it outside of the DB, but unlike a lot of people that linger around here I realize that performance isn't always the biggest issue.

    I'm sure it's definitely possible, but offhand I'm not sure of an easy way. It seems complicated to me. Maybe there's something you could do with it by getting child rows through a relationship or something. *shrug*
    # February 24, 2005 10:41 PM

    Pranoy said:

    This is a great solution , MS KB had only one row and is not always useful & your solution is complete , a perfect one.
    Thanks a lot for sharing it
    # March 1, 2005 3:24 PM

    Erik Porter said:

    My pleasure, glad you found it useful! :)
    # March 1, 2005 3:40 PM

    Vikram S said:

    ToTable method works great!!! Thanks!
    # November 3, 2006 8:26 PM

    Phil said:

    GOD BLESS YOU THANK YOU!!

    # November 16, 2006 5:24 PM

    Sverker, Sweden said:

    Thanks Erik! In a specific case with a datatable containing 42000 rows and 56 columns the command dataview.ToTable() took 2.5 minutes and by using your code it took 2 seconds!! Quite an improvment! The distinct was only on two columns.

    # November 27, 2006 3:58 AM

    Avinash B said:

    Thanks Erik!

    Your code worked flawlessly. The soultion is clean and flexible. Thanks a lot.

    # December 19, 2006 11:31 AM

    /dev/null said:

    Thanks a lot, saved me so much time.  I was using another routine i found on the net and it took 20 minutes.  Now it takes 1 second.  And is easier to read as well.  The need for a select distinct on a datatable is that not all datatables are fed from a database.  Sometimes it's from reading a CSV file or from another data source like an LDAP directory.

    # February 1, 2007 10:53 AM

    kramlen said:

    here is something I wrote that does the same thing - yet

    just seems easier to understand:

    public static DataTable SelectDistinct(string[] pColumnNames, DataTable pOriginalTable)

           {

               DataTable distinctTable = new DataTable();

               int numColumns = pColumnNames.Length;

               for (int i = 0; i < numColumns; i++)

               {

                   distinctTable.Columns.Add(pColumnNames[i], pOriginalTable.Columns[pColumnNames[i]].DataType);

               }

               Hashtable trackData = new Hashtable();

               foreach (DataRow currentOriginalRow in pOriginalTable.Rows)

               {

                   StringBuilder hashData = new StringBuilder();

                   DataRow newRow = distinctTable.NewRow();

                   for (int i = 0; i < numColumns; i++)

                   {

                       hashData.Append(currentOriginalRow[pColumnNames[i]].ToString());

                       newRow[pColumnNames[i]] = currentOriginalRow[pColumnNames[i]];

                   }

                   if (!trackData.ContainsKey(hashData.ToString()))

                   {

                       trackData.Add(hashData.ToString(), null);

                       distinctTable.Rows.Add(newRow);

                   }

               }

               return distinctTable;

           }

    # February 28, 2007 3:47 AM

    HumanCompiler said:

    kramlen, thanks for posting some code!  You realize that code that I posted is over 2 years old now, right?  I don't claim any intelligence in anything I wrote more than 6 months ago.  ;)  Still seems to be helping out some people though.  Thanks for posting your code and thoughts on it.

    # February 28, 2007 5:46 AM

    Matt said:

    Helped me a lot!  Thanks!!!!

    # April 3, 2007 3:13 PM

    Patrick said:

    Taking these algorithms one step further.  The issue of tracking what values have already been added to the returned DataTable can be eliminated by temporarily setting CaseSensitive to true on SourceTable.

    bool origCaseSensitive = SourceTable.CaseSensitive;

    SourceTable.CaseSensitive = true;

    orderedRows = SourceTable.Select("", string.Join(", ", FieldNames));

    SourceTable.CaseSensitive = true;

    This tells the sort to perform a case sensitive comparison of strings and ensures that all identical values will be correctly grouped contiguously (which the above algorithms and the Microsoft's KB article do not).  Once you have this grouping the for loop in fieldValuesAreEqual and the Hashtable trackData in kramlen's post are no longer needed.

    # April 10, 2007 3:49 PM

    Jawad said:

    Jackie Goldstein u rule

    # April 18, 2007 4:25 PM

    samuel Charles said:

    Tonnes of thanks....

    # May 7, 2007 2:07 AM

    Sanjeev said:

    Thanks a Lot! helped me at a very critical time!! :)
    # May 19, 2007 9:36 AM

    ureyes84 said:

    Great solution, may I suggest something? You can replace the line (and get rid of the method "createRowClone"): newTable.Rows.Add(createRowClone(row, newTable.NewRow(), FieldNames)); by the DataTable.ImportRow method: newTable.ImportRow(row); and they will work seemingly.
    # May 22, 2007 6:24 PM

    晓风残月 said:

    上篇介绍了交叉表的简单实现1:使用存储过程,这里采取在前端程序实现。实现要点:1。读取所有目标成绩(flatScroreTable)2。从目标成绩中提取考试时间(不重复),作为交叉表的列表头3。从目...

    # May 23, 2007 3:27 PM

    晓风残月 said:

    在ADO.NET1.x

    # July 16, 2007 8:58 AM

    uday said:

    Hey All,

    Plz do not feel bad, but u people are suggesting the option of distinct for 1000 - 2000 records. Can anyone give me the solution to view distinct over 20000 to 50000 records with 100 user login ?

    # August 1, 2007 8:09 AM

    Raymond said:

    Compared to the MS KB this was a joy to behold.

    Proving once again that concise code is a developer's (and maintainer's) best friend.

    Much obliged.

    - RK

    # August 1, 2007 5:58 PM

    HumanCompiler said:

    You're welcome, Raymond.  The fun part is that I posted this over 2 years ago.  ;)

    # August 2, 2007 1:21 PM

    Scott said:

    The comment by Jackie Goldstein about using the DataView class along with it's "ToTable()" method does the trick for performing DISTINCT on rows of a DataTable.

    Thanks!

    # August 3, 2007 4:27 PM

    Tomas said:

    Thanks, great article!

    The sad thing is that SelectedColumns doesnt seem to work on a DataGridView when doing .net 2 winapps. I tried use that so that the user could choose columns to view as distinct.

    # August 24, 2007 6:12 AM

    Don Quixote said:

    The hashtable method works great!

    # September 11, 2007 12:20 PM

    Sanjot said:

    The code doesn't select distinct values if the column contains values that are different in case only.

    Setting the Case-sensitivity of the Sourcetable to True (as suggested in one of the comments) works, but makes it much slower.

    Any suggestions?

    # September 17, 2007 8:35 PM

    快乐老鼠Jerry said:

    有些时候需要从DataTable中选择出在某个列上相同的值。但是DataTable.Select没有提供这样的语法。于是,微软和一些其他大虾提供了解决方案。据说以下这个是最快的:privatest...

    # September 27, 2007 4:47 AM

    缘易姿姿 said:

    我们有时候需要对DataTable中数据进行Distinct处理,过滤掉重复的数据,本文给出了解决方法:

    # October 25, 2007 12:20 PM

    Tim Laughlin's Everything VB.NET Blog said:

    I was trying to select some distinct rows from a DataSet datatable. After consulting Google and finding

    # October 31, 2007 12:38 PM

    Jack said:

    Top work..!!

    # November 5, 2007 8:12 PM

    Aftab Ahmed said:

    VB Code:

    ------------------------------------------------------

       Public Function SelectDistinct(ByVal SourceTable As DataTable, ByVal ParamArray Columns() As String) As DataTable

           Dim Result As DataTable = New DataTable()

           If SourceTable IsNot Nothing Then

               Dim DView As DataView = SourceTable.DefaultView

               Try

                   Result = DView.ToTable(True, Columns)

               Catch ex As Exception

               End Try

           End If

           Return Result

       End Function

    ------------------------------------------------------

    C# Code:

    ------------------------------------------------------

       public DataTable SelectDistinct(DataTable SourceTable, params string[] Columns) {

           DataTable Result = new DataTable();

           if (SourceTable != null) {

               DataView DView = SourceTable.DefaultView;

               try {

                   Result = DView.ToTable(true, Columns);

               }

               catch (Exception ex) {

               }

           }

           return Result;

       }

    ------------------------------------------------------

    If you want to through the exception (when column name is the distinct columns list does not exist in the table or any other exception), just remove the Try Catch block.

    Hope this will help you.

    # November 12, 2007 1:23 AM

    ljsql said:

    在DataTable中实现DataTable.Select(

    # January 11, 2008 7:51 PM

    ewitkows said:

    'Begin gratitude

    Dim i as integer = 0

    While 1 <= 10000000

    Response.Write("Thank You!")

    End While

    'End Gratitude

    # January 28, 2008 10:55 PM

    Edward M said:

    To loop through distinct rows you can try:

    foreach (DataRow drDetail in dsInvoice.Tables["tblRMA_Detail"].DefaultView.ToTable("tblDetail", true, new string[] { "Style", "Color", "Ident" }).Rows)

               {}

    # January 30, 2008 4:10 PM

    Sergio said:

    Thanks a lot. Great code Aftab Ahmed!

    # February 5, 2008 2:43 PM

    chinnu said:

    This is a very helpful post for people not on framework 2.0 and up. Just thought I would pitch in.

    There was one post earlier on using stringbuilder but the one thing I did not quite take was the using hashtable part and also the fact that the code needs to handle nullable values as well. Coding with hashtables can really take a hit when you are dealing with real large datasets.

    I think the following logic could be a lil' faster. Hope this works :)

    dtSource ' source table

    dtOut 'distinct table

    columnNames ' array of the column names

    dtOut= dtSource.Clone

    dim x as string= string.empty

    dim y as string=string.empty

    for each dr as datarow in dtSource.Select("",String.Join(columnNames))'sort the table

    for count as integer=0 to columnNames.count-1

       if not isnull(dr(count(ColumnNames))

              x+=cstr(dr(count(ColumnNames))

       end if

       x +=","    'seperator

    next

    (if the columnNames length is the same as dtSource.Columns.Count you can use String.Join to generate the content of x.)

     if not y.equals(x) then

      dtOut.ImportRow(dr)

      y=x

     end if

    next dr

    I did not use vs editor to code the above. So the syntax may be off.

    Good luck-

    Vin

    # February 15, 2008 12:14 AM

    Ronald said:

    It is silly that a normal sql statement cannot be used.

    I need a count and a average. Has any body done this?

    # March 27, 2008 6:52 AM

    Poh said:

    Thanks for the wonderful code, Aftab.

    # April 17, 2008 9:14 PM

    Poh said:

    An easier way to do a select distinct for a datatable is shown here: www.c-sharpcorner.com/.../BlogDetail.aspx

    If our original datatable is dt and we want to save the distinct into a new datatable newdt, then

    string[] distCol = {"col1","col2"};

    newdt = dt.DefaultView.ToTable(true,distCol);

    # April 30, 2008 12:00 AM

    Ben said:

    I see a lot of people saying that the DataView's ToTable method provides an "easier" way to do this.

    They're right, it is easy.  But if you read the comments carefully you'll see quite a few people stating how SLOW this can be.  I have to agree...

    I was able to get a result, about five times quicker than the DataView.ToTable, by simply iterating over every row, checking if the field I'm interested isn't already in my generics list, and if not adding it to the list.  

    I would have though the way I'm doing it would've been extremely slow... but apparently the list.Contains() method is actually quite quick.

    # May 10, 2008 11:50 PM

    APPDEV said:

    the items are reordered. anyone notice this ?

    # June 12, 2008 6:27 AM

    Rafael said:

    You can use

    Dim dt As New DataTable

    dvGrupo = New DataView(dtDePara)

    dt = dvGrupo.ToTable(True, "[field1]", "[field2]")

    # July 30, 2008 2:51 PM

    helloise said:

    private static DataTable SelectDistinct(DataTable SourceTable, params string[] FieldNames)

    i will obviously have to get the field names into params string[] and then call this method.

    how will i do these two things??

    thanks

    helloise

    # September 16, 2008 3:59 AM

    StopGo said:

    I have been wrestling with a problem that your code solved.  The built-in .NET DataTable.DefaultView.ToTable() method pegs the CPU at 100% and hangs when the dataset gets large.  I swapped out that line with a call to SelectDistinct and it ran in a matter of seconds.  From my limited experience it appears that your code is vastly superior to the built-in method and I would recommend people use your code and save themselves some major debugging headaches.  Thanks for sharing...

    # September 16, 2008 7:29 PM

    Three Headed Monkeys » Select Distinct en un DataTable/DataSet said:

    Pingback from  Three Headed Monkeys  &raquo; Select Distinct en un DataTable/DataSet

    # September 22, 2008 5:07 AM

    Gabriel said:

    really thanks guys.. the function rowk really cool

    Saludos desde chile

    # September 29, 2008 4:24 PM
    Leave a Comment

    (required) 

    (required) 

    (optional)

    (required)