DataSets and Serialization
We all know DataSets are not good when you want to serialize them, because they
always are serialized as XML + the XMLSchema even if you are using a binary
formatter. This implies you get a big serialization payload and bad
serialization performance each time you send a DataSet using Remoting or when
you store a DataSet in a ASP.NET session variable, just to mention two common
scenarios.
I decided to do some research about this. A google search pointed me to an
MSDN Magazine article by Dino Eposito talking about ADO.NET Binary
Serialization. His solution was to have a 'ghost' class that implemented
ISerializable and make that class responsible for serializing a dataset.
His solution worked but it's not very friendly, as you need to wrap all your
Typed DataSets with a 'ghost' class.
As with DeKlarit we generate the Typed
DataSets ourselves, it seemed that we could modify the way the serialization is
done inside the DataSet, avoiding the need of an extra class.
The problem is that System.Data.DataSet implements ISerializable and the
GetObjectData is not virtual, so you cannot override it in the Typed DataSets,
but, if the Typed DataSets itself implements ISerializable, then you can define
your own GetObjectData and do whatever you want.
In addition, the standard DataSet serialization needs to be very generic and
support some scenarios that are not very common. For example, it supports
adding a DataTable to an existing typed DataSet, so when it deserializes it,
the DataTable is created in the deserialized object. With Typed DataSets you
usually define your DataTables in the XSD, and it is not usual to add more
columns in runtime. So, I decided to not to support that case, as it reduces
the amount of data that needs to be serialized.
If the DataSet structure is not serialized, we only need to serialize the
DataRows, while preserving the DataRowState (i.e., if the row is flagged as
Deleted, we should keep that flag when deserializing the object). We also need
to send the row's .Original values if the row was changed.
When deserializing, we just need to add the rows back to the tables, change the
DataRowState, and set the .Original values for modified rows.
The numbers
The results were surprising. The DataSet had one DataTable and two columns, a
System.Int16 (CustomerId) and a System.String (CustomerName). The following
table shows the numbers for 1, 10, 100, 1000 and 10000 rows:
|
Rows
|
% of serialization time
|
% of deserialization time
|
% of size
|
|
1
|
9.0%
|
11.1%
|
13.7%
|
|
10
|
10.8%
|
14.1%
|
14.5%
|
|
100
|
13.7%
|
21.5%
|
15.8%
|
|
1000
|
12.8%
|
24.4%
|
16.3%
|
|
10000
|
18.0%
|
30.0%
|
16.6%
|
All the numbers are expressed as percentages of the time compared to the
serialization of the original ADO.NET DataSet. Serializing 1000 rows was almost
10 times faster, deserializing them around 4 times faster and the size about 6
times smaller.
Then I decided to compare it to serializing a simple object array... and the
results were also surprising. The DataSet was faster to serialize, slower to
deserialize, and about the same size.
The following table shows the numbers, where the 100% is the time it takes to
serialize a Customer[], where the Customer class was composed by a System.Int16
(Id) and System.String (name).
|
Rows
|
% serialization time
|
% deserialization time
|
% size
|
|
1
|
149%
|
350%
|
146%
|
|
10
|
81%
|
252%
|
120%
|
|
100
|
46%
|
208%
|
101%
|
|
1000
|
42%
|
228%
|
99%
|
|
10000
|
48%
|
237%
|
99%
|
With 1000 rows is roughly twice as fast to serialize with the DataSet, twice as
slow to deserialize, and about the same size.
To benchmark the DataSet serialization I filled the DataSet in such a way that
the rows are evenly distributed between Added/Modified/Unchanged/Deleted
RowState, as the way the data is serialized/deserialized changes depending on
that.
When comparing the new DataSet serialization with the object array
serialization, I thought it was unfair to have the RowState distributed in that
way, because as the object array does not carry that information. So, I also
compared the object array serialization with a DataSet with all the rows marked
as 'Added', which is the best scenario for serialization/deserialization. In
this case the numbers were:
|
Rows
|
% serialization time
|
% deserialization time
|
% size
|
|
1
|
150%
|
342%
|
146%
|
|
10
|
73%
|
223%
|
108%
|
|
100
|
45%
|
180%
|
86%
|
|
1000
|
39%
|
210%
|
83%
|
|
10000
|
37%
|
214%
|
83%
|
As expected, the numbers were a little better for the DataSet than the previous
case.
The code
To do the serialization, the DataSet needs to implement ISerializable:
public class CustomerDataSet : System.Data.DataSet, System.Runtime.Serialization.ISerializable
The following is the DataSet’s GetObjectData method:
void System.Runtime.Serialization.ISerializable.GetObjectData(SerializationInfo si, StreamingContext context)
{
si.AddValue("Customer", Customer.GetObjectData());
}
And the following method is added to the DataTable, which copies the Data in a
DataTable to an ArrayList:
public ArrayList GetObjectData()
{
ArrayList dataRows = new ArrayList();
// Insert rows information into a worker array
foreach(DataRow row in this.Rows)
{
dataRows.Add((int) row.RowState);
switch (row.RowState)
{
case DataRowState.Modified:
dataRows.Add(row[columnCustomerId, DataRowVersion.Original]);
dataRows.Add(row[columnCustomerName, DataRowVersion.Original]);
dataRows.Add(row[columnCustomerId, DataRowVersion.Current]);
dataRows.Add(row[columnCustomerName, DataRowVersion.Current]);
break;
case DataRowState.Deleted:
dataRows.Add(row[columnCustomerId, DataRowVersion.Original]);
dataRows.Add(row[columnCustomerName, DataRowVersion.Original]);
break;
default:
dataRows.Add(row[columnCustomerId, DataRowVersion.Current]);
dataRows.Add(row[columnCustomerName, DataRowVersion.Current]);
break;
}
}
return dataRows;
}
To do the deserialization, we add the following constructor to the DataSet:
protected CustomerDataSet(SerializationInfo info, StreamingContext context)
{
InitClass();
this.EnforceConstraints = false;
Customer.SetObjectData((ArrayList) info.GetValue("Customer", typeof(ArrayList)));
this.EnforceConstraints = true;
}
And add the following method to the DataTable, which adds the ArrayList
contents to the DataTable:
public void SetObjectData(ArrayList dataRows)
{
int dataRowIdx = 0;
int rowCount = dataRows.Count - 1;
while (dataRowIdx < rowCount)
{
DataRowState state = (DataRowState) ((int) dataRows[dataRowIdx++]);
DataRow row = NewRow();
row[columnCustomerId] = dataRows[dataRowIdx++];
row[columnCustomerName] = dataRows[dataRowIdx++];
Rows.Add(row);
switch (state)
{
case DataRowState.Unchanged:
row.AcceptChanges();
break;
case DataRowState.Deleted:
row.AcceptChanges();
row.Delete();
break;
case DataRowState.Modified:
row.AcceptChanges();
row[columnCustomerId] = dataRows[dataRowIdx++];
row[columnCustomerName] = dataRows[dataRowIdx++];
break;
}
}
}
You can get the VS.NET 2003 project for the benchmarks
here. It takes 1 hour and 20mins in my box to run, you can lower the
number of iterations to make it run faster.
Moving forward
How can you take advantage of this? I can think of three ways:
-
Change the generated code for the DataSet, but it's not a good option, because
you should never modify the code generated by a tool.
-
Write a surrogate class that handles the serialization. This could work, but
it's not 'transparent', as you need to trigger the serialization yourself to
use it.
-
Adapt
ADO Guy's typed DataSet generator so it performs the serialization using
the approach I followed.
And of course you can also buy a tool that generates Typed DataSets ;).