DataSets and Serialization

Sunday, September 14, 2003

We all know DataSets are not good when you want to serialize them, because they always are serialized as XML + the XMLSchema even if you are using a binary formatter. This implies you get a big serialization payload and bad serialization performance each time you send a DataSet using Remoting or when you store a DataSet in a ASP.NET session variable, just to mention two common scenarios.

I decided to do some research about this. A google search pointed me to an MSDN Magazine article by Dino Eposito talking about ADO.NET Binary Serialization. His solution was to have a 'ghost' class that implemented ISerializable and make that class responsible for serializing a dataset.

His solution worked but it's not very friendly, as you need to wrap all your Typed DataSets with a 'ghost' class.

As with DeKlarit we generate the Typed DataSets ourselves, it seemed that we could modify the way the serialization is done inside the DataSet, avoiding the need of an extra class.

The problem is that System.Data.DataSet implements ISerializable and the GetObjectData is not virtual, so you cannot override it in the Typed DataSets, but, if the Typed DataSets itself implements ISerializable, then you can define your own GetObjectData and do whatever you want.

In addition, the standard DataSet serialization needs to be very generic and support some scenarios that are not very common. For example, it supports adding a DataTable to an existing typed DataSet, so when it deserializes it, the DataTable is created in the deserialized object. With Typed DataSets you usually define your DataTables in the XSD, and it is not usual to add more columns in runtime. So, I decided to not to support that case, as it reduces the amount of data that needs to be serialized.

If the DataSet structure is not serialized, we only need to serialize the DataRows, while preserving the DataRowState (i.e., if the row is flagged as Deleted, we should keep that flag when deserializing the object). We also need to send the row's .Original values if the row was changed.

When deserializing, we just need to add the rows back to the tables, change the DataRowState, and set the .Original values for modified rows.

The numbers

The results were surprising. The DataSet had one DataTable and two columns, a System.Int16 (CustomerId) and a System.String (CustomerName). The following table shows the numbers for 1, 10, 100, 1000 and 10000 rows:

Rows	% of serialization time	% of deserialization time	% of size
1	9.0%	11.1%	13.7%
10	10.8%	14.1%	14.5%
100	13.7%	21.5%	15.8%
1000	12.8%	24.4%	16.3%
10000	18.0%	30.0%	16.6%

All the numbers are expressed as percentages of the time compared to the serialization of the original ADO.NET DataSet. Serializing 1000 rows was almost 10 times faster, deserializing them around 4 times faster and the size about 6 times smaller.

Then I decided to compare it to serializing a simple object array... and the results were also surprising. The DataSet was faster to serialize, slower to deserialize, and about the same size.

The following table shows the numbers, where the 100% is the time it takes to serialize a Customer[], where the Customer class was composed by a System.Int16 (Id) and System.String (name).

Rows	% serialization time	% deserialization time	% size
1	149%	350%	146%
10	81%	252%	120%
100	46%	208%	101%
1000	42%	228%	99%
10000	48%	237%	99%

With 1000 rows is roughly twice as fast to serialize with the DataSet, twice as slow to deserialize, and about the same size.

To benchmark the DataSet serialization I filled the DataSet in such a way that the rows are evenly distributed between Added/Modified/Unchanged/Deleted RowState, as the way the data is serialized/deserialized changes depending on that.

When comparing the new DataSet serialization with the object array serialization, I thought it was unfair to have the RowState distributed in that way, because as the object array does not carry that information. So, I also compared the object array serialization with a DataSet with all the rows marked as 'Added', which is the best scenario for serialization/deserialization. In this case the numbers were:

Rows	% serialization time	% deserialization time	% size
1	150%	342%	146%
10	73%	223%	108%
100	45%	180%	86%
1000	39%	210%	83%
10000	37%	214%	83%

As expected, the numbers were a little better for the DataSet than the previous case.

The code

To do the serialization, the DataSet needs to implement ISerializable:

public class CustomerDataSet : System.Data.DataSet, System.Runtime.Serialization.ISerializable

The following is the DataSet’s GetObjectData method:

void System.Runtime.Serialization.ISerializable.GetObjectData(SerializationInfo si, StreamingContext context)
{ 
   si.AddValue("Customer", Customer.GetObjectData());
}

And the following method is added to the DataTable, which copies the Data in a DataTable to an ArrayList:

public ArrayList GetObjectData()
{
   ArrayList dataRows = new ArrayList();
   // Insert rows information into a worker array
   foreach(DataRow row in this.Rows)
   {
      dataRows.Add((int) row.RowState);
      switch (row.RowState)
      {
         case DataRowState.Modified:     
            dataRows.Add(row[columnCustomerId, DataRowVersion.Original]);
            dataRows.Add(row[columnCustomerName, DataRowVersion.Original]);
            dataRows.Add(row[columnCustomerId, DataRowVersion.Current]);
            dataRows.Add(row[columnCustomerName, DataRowVersion.Current]);
            break;
         case DataRowState.Deleted:
            dataRows.Add(row[columnCustomerId, DataRowVersion.Original]);
            dataRows.Add(row[columnCustomerName, DataRowVersion.Original]);
            break;
         default:
            dataRows.Add(row[columnCustomerId, DataRowVersion.Current]);
            dataRows.Add(row[columnCustomerName, DataRowVersion.Current]);
            break;
      }
   }
   return dataRows;
}

To do the deserialization, we add the following constructor to the DataSet:

protected CustomerDataSet(SerializationInfo info, StreamingContext context)
{
   InitClass(); 
   this.EnforceConstraints = false;
   Customer.SetObjectData((ArrayList) info.GetValue("Customer", typeof(ArrayList)));
   this.EnforceConstraints = true;
}

And add the following method to the DataTable, which adds the ArrayList contents to the DataTable:

public void SetObjectData(ArrayList dataRows)
{  
   int dataRowIdx = 0;
   int rowCount = dataRows.Count - 1;
   while (dataRowIdx < rowCount)
   {
      DataRowState state = (DataRowState) ((int) dataRows[dataRowIdx++]);
      DataRow row = NewRow();
      row[columnCustomerId] = dataRows[dataRowIdx++];
      row[columnCustomerName] = dataRows[dataRowIdx++];
      Rows.Add(row);
  
      switch (state)
      {
         case DataRowState.Unchanged:
            row.AcceptChanges();
            break;
 
         case DataRowState.Deleted:
            row.AcceptChanges();
            row.Delete();
            break;
         case DataRowState.Modified:
            row.AcceptChanges();
            row[columnCustomerId] = dataRows[dataRowIdx++];
            row[columnCustomerName] = dataRows[dataRowIdx++];
            break;
      }
   }
}

You can get the VS.NET 2003 project for the benchmarks here. It takes 1 hour and 20mins in my box to run, you can lower the number of iterations to make it run faster.

Moving forward

How can you take advantage of this? I can think of three ways:

Change the generated code for the DataSet, but it's not a good option, because you should never modify the code generated by a tool.
Write a surrogate class that handles the serialization. This could work, but it's not 'transparent', as you need to trigger the serialization yourself to use it.
Adapt ADO Guy's typed DataSet generator so it performs the serialization using the approach I followed.

And of course you can also buy a tool that generates Typed DataSets ;).

what about the schema?

sometimes you have more than just table schema..

Yosi Taguri - Sunday, September 14, 2003 12:25:00 PM

The constructor that is called during deserialization calls InitClass(), which adds all the tables, columns, relations, constraints, etc, that were originally defined in the typed dataset.

So, all the schema information is kept.

It won´t work if you changed the dataset definition in runtime in the caller, as those changes won´t be serialized.

Andres Aguiar - Sunday, September 14, 2003 6:25:00 PM

why not just do dataSet.WriteXML, then serialize the resultant string? You can add necessary arguments to ensure you have the complete XML, as well as optionally print out the schema I believe. Then re-hydrate on the other end....

jason buxton - Monday, September 15, 2003 5:59:00 PM

That's what the original serialization does. It works but it's slow and big.

Andres Aguiar - Monday, September 15, 2003 6:01:00 PM

Great!!!!!!

Now, I had a problem, with untyped dataset

I have to buy a ticket to the travel of the schema; what do u think about to create a channel sink that make a cache to the schema, and the second time that I bring the dataset doesn´t have to bring de schema ?

Pablo - Friday, December 26, 2003 4:19:00 PM

Very interresting.

In VB.NET I can't seem to override the DataSets own implementation of ISerializable i get the compiler error "is already implemented by base class".

Any suggestions (other than use c#!)

MikeG - Friday, February 13, 2004 2:41:00 PM

I used the surrogate wrapper solution presented in Microsoft KB article # 829740 with a small twist. I added a new member to stored the DataSet's type and then used reflection the typed dataset to recreate it during deserialziation. To get the data in without much code change, I used DataSet.Merge(). Seems to work fine!

Neri - Wednesday, March 3, 2004 12:19:00 AM

Just to let you know if you replace the DBNull with Nothing in your worker array your data takes up dramatically less space when serialized (depending on how many DBNulls you have!)

I am talking about storing datasets in viewstate of ASP.NET page, have not tried other serialization formatters.

MikeG - Thursday, May 27, 2004 2:23:00 PM

8 Comments