Select DISTINCT on DataTable
In a project I'm doing I needed to basically do a DISTINCT on a couple fields in a DataTable that I had used earlier in code. Did some Google searching and came up with this MS KB on the subject. I was hoping to find it in the Framework, but hey, writing a little extra code never hurt anyway. Looking over the code though I found that I just didn't really like it. Biggest reason being it only took one field. I wanted to be able to pass in n number of fields, so I rewrote it and did a little tweaking to it. Here's what I came up with in both VB and C#
VB
Public Shared Function SelectDistinct(ByVal SourceTable As DataTable, ByVal ParamArray FieldNames() As String) As DataTable
Dim lastValues() As Object
Dim newTable As DataTable
If FieldNames Is Nothing OrElse FieldNames.Length = 0 Then
Throw New ArgumentNullException("FieldNames")
End If
lastValues = New Object(FieldNames.Length - 1) {}
newTable = New DataTable
For Each field As String In FieldNames
newTable.Columns.Add(field, SourceTable.Columns(field).DataType)
Next
For Each Row As DataRow In SourceTable.Select("", String.Join(", ", FieldNames))
If Not fieldValuesAreEqual(lastValues, Row, FieldNames) Then
newTable.Rows.Add(createRowClone(Row, newTable.NewRow(), FieldNames))
setLastValues(lastValues, Row, FieldNames)
End If
Next
Return newTable
End Function Private Shared Function fieldValuesAreEqual(ByVal lastValues() As Object, ByVal currentRow As DataRow, ByVal fieldNames() As String) As Boolean
Dim areEqual As Boolean = True
For i As Integer = 0 To fieldNames.Length - 1
If lastValues(i) Is Nothing OrElse Not lastValues(i).Equals(currentRow(fieldNames(i))) Then
areEqual = False
Exit For
End If
Next
Return areEqual
End Function Private Shared Function createRowClone(ByVal sourceRow As DataRow, ByVal newRow As DataRow, ByVal fieldNames() As String) As DataRow
For Each field As String In fieldNames
newRow(field) = sourceRow(field)
Next
Return newRow
End Function Private Shared Sub setLastValues(ByVal lastValues() As Object, ByVal sourceRow As DataRow, ByVal fieldNames() As String)
For i As Integer = 0 To fieldNames.Length - 1
lastValues(i) = sourceRow(fieldNames(i))
Next
End Sub C#
private static DataTable SelectDistinct(DataTable SourceTable, params string[] FieldNames)
{
object[] lastValues;
DataTable newTable;
DataRow[] orderedRows;
if (FieldNames == null || FieldNames.Length == 0)
throw new ArgumentNullException("FieldNames");
lastValues = new object[FieldNames.Length];
newTable = new DataTable();
foreach (string fieldName in FieldNames)
newTable.Columns.Add(fieldName, SourceTable.Columns[fieldName].DataType);
orderedRows = SourceTable.Select("", string.Join(", ", FieldNames));
foreach (DataRow row in orderedRows)
{
if (!fieldValuesAreEqual(lastValues, row, FieldNames))
{
newTable.Rows.Add(createRowClone(row, newTable.NewRow(), FieldNames));
setLastValues(lastValues, row, FieldNames);
}
}
return newTable;
} private static bool fieldValuesAreEqual(object[] lastValues, DataRow currentRow, string[] fieldNames)
{
bool areEqual = true;
for (int i = 0; i < fieldNames.Length; i++)
{
if (lastValues[i] == null || !lastValues[i].Equals(currentRow[fieldNames[i]]))
{
areEqual = false;
break;
}
}
return areEqual;
} private static DataRow createRowClone(DataRow sourceRow, DataRow newRow, string[] fieldNames)
{
foreach (string field in fieldNames)
newRow[field] = sourceRow[field];
return newRow;
} private static void setLastValues(object[] lastValues, DataRow sourceRow, string[] fieldNames)
{
for (int i = 0; i < fieldNames.Length; i++)
lastValues[i] = sourceRow[fieldNames[i]];
} I thought I'd also point out a downfall from the KB that I corrected in my example. The beginning of the For Each calls SourceTable.Select("", FieldName)) in it. This can hinder performance a bit in certain situations, because in C# that statement will be evaulated for every loop iteration. You can see in the C# code that I have I actually call the Select Method and store the results in a variable and use it in the loop instead. One other interesting thing to note about the different between the VB and the C# code is that I DID NOT put the results of the Select Method in a variable in the VB code even though I did in the C# code. This is because VB works differently in that the actual For Each loop only evaluates the first iteration of the loop, which is nice and saves and extra line or two of code. Another thing that's different between the two (sorry, I think the differences between the two are interesting) is that to instantiate a new Object Array in the lastValues variable in VB you have to put {} after it, and in C# you don't have to. This is because since VB uses the same characters for Methods vs. Indexers, the compiler wouldn't be able to tell the difference between calling a constructor with one parameter vs. creating a new array, so it's necessary. In C# it knows since you're using the indexer characters, you mean a new array, not a new object in this case. That said, if you like using the VB Functions, you could also just call ReDim lastValues(FieldNames.Length), but I tend to stay away from those since I do a lot of code in both VB and C#.
UPDATE: Adam pointed out that I made the wrong assumption in how the foreach is evaluated in C#. It is not evaluated every time. Thanks, Adam!
UPDATE 2: Jackie Goldstein sent me an e-mail pointing out that .NET 2.0's DataView has a ToTable Method that will build a DataTable based off of the current DataView and also allow you to specify which columns you want in the DataTable as all as being able to say Distinct rows only. I had totally forgotten about this feature and didn't even know it did distinct rows. Thanks, Jackie! :)