'You want a Dataset with that DAL, sir?' 'No, thank you.'

Tuesday, May 13, 2003

Yesterday, I saw several blogs about the datalayer phenomenon and how to construct one. Now, there are several roads which lead to a glorifying solution: some muddy, some paved with the finest asphalt. Today, I'd like to talk about one of the more muddy ones: ye olde 'DataSet Route'.

When I started with .NET and C# in January 2002, I was eager to try out the new data-oriented objects and tools build inside Visual Studio.NET. Like many of you, I too have experienced that vast, hurting feeling of deep disappointment when I discovered the Data Component in Visual Interdev was totally unusable in an n-tier solution (and thus resulted in long, boring weeks of typing Data Access Layer (DAL) code calling stored procedures for Your Basic Database Operations^TM: the Create, Retrieve, Update, and Delete actions). Boy, was I excited when I read that Microsoft had included new tools in Visual Studio.NET which would easily create all that boring code automatically, plus enabled you to bind the results to any control you wanted!

I started a little test application, nothing big, using ASP.NET, C# and SQL Server. Following one of the many walkthrough tutorials, I started creating classes, dragging and dropping SqlConnection objects, DataAdapters and what have you. After a couple of minutes of dragging and dropping objects on canvasses, I got this wobbly feeling deep inside me that I've been here before, but I thought "Naah.. this is new stuff, it can't be".

However, after completing the tutorial and seeing the goopy code constructing a nice datagrid on my shiny ASP.NET page, filling it with data leeched from the Northwind database, that annoying feeling inside me was still there. To get rid of it, to convince myself my mind was playing hide and seek with me, I started to look better at the code I constructed using the tutorial. "See? It looks totally different!"

It didn't help. It was Visual Interdev and its winner of the Useless Object Of The Year-award: Data Component Design Time Control (DC DTC) all over again. Well, not totally. There was a difference: the VS.NET / .NET approach runs at the server using disconnected sets of data. For the rest, it was all the same: connection string definitions hardcoded (generated!) in the code, no middle tier. No middle tier? Not even a DAL tier! And where was the stateless paradigm Microsoft once preached to us developers?

To use the tools inside Visual Studio.NET to work with the new data related objects , you have to sell your sole to the IDE so it can demand you to do as you're told. With 'using' I mean: using the tools so you save precious time by utilizing the tools provided to generate code for you you otherwise would have to type in. For starters: to use the drag/drop features in Visual Studio.NET, you have to derive your class, which will utilize these dropped objects, from System.ComponentModel.Component. This can have serious implications for your object model, especially when you consider that .NET sports single inheritance and not multiple inheritance. If you want to use a different object model, you can't use the nice features: sell your sole or burn your fingers typing it all out yourself.

Like in Visual Interdev, Visual Studio.NET doesn't help you a hell of a lot when you want to develop an n-tier application using the new data-oriented objects. This is a disappointment in some degree, but after a while you get over it, eventually, and decide to start typing the code using these data-oriented objects by hand. This is tedious work, defining all those data-adapters with the queries they have to run, the parameter declarations the command objects need etcetera, etcetera, but better that than selling your sole, right? So I chose the typing route. I started a new class, which would represent a DAL class, and which would retrieve the same data as I used in the tutorial, but now using hand-typed code, because I wanted a slim, lean class to work with.

The tutorial I used worked with the DataSet. Reading through all the material Microsoft released about using data in .NET, this object must be the best you can get since sliced bread. Looking at its features, it is an impressive piece of work: versioning, XML support, multi-table support, relations, you name it. It is disconnected, so in fact, it's a cached version of the data held by the original datasource. This sounds like an advantage over the ADO recordset object. Being an in-memory cache is also its biggest disadvantage however. Lets investigate when this is a disadvantage and why.

In single-user applications, often one-tier or two-tier, the DataSet is not a problem. The problem starts when multiple users are using your application utilizing DataSets. In multi-user applications build with the n-tier paradigm, like a basic webapplication, the different users, each running their own application version in their own thread, do share only a few things with each other: the application code and the application state. The application state is the complete state the application is in, thus the state shared by all users. Normally this is the current state stored in the shared persistent storage used by the application, in many situations this is a database. Because this application state is shared among all user-threads in the application, when it's altered in thread T1, all other threads should immediately work with the changed application state, otherwise you'll have multiple versions of that application state: each thread will have its own, and the data used by a given thread can be out of sync with the application state.

To achieve fast propagation of changes to the application state to other threads, developers learned that you should make the changes directly to the physically shared repository where the application state is stored: the database. If you then also make sure whenever you require data from the application state which can be changed by other threads, you read it from that same physically shared repository, you can be sure you have the latest data read from a correct application state. This is and was the idea behind Windows DNA and stateless development using ASP and MTS/COM+.

As said, using DataSets to cache data from that application state can make threads using that cached data go out of sync with the application state that is seen by other threads. The main reason for this is that DataSets created in different threads use their own connection to the database, using a different adapter. This means that when two users, U1 and U2 (no, not that rockband), working with the same webapplication, run their own threads on the server, and when they independent of each other request the same rows from the database, they use two different connections to do that, holding two different DataSets. However, semantically it's the same data: each row in a DataSet represents an entity instance and now two users have each a copy of the same entity in memory.

When U1 changes something in a given row in his DataSet and saves it back to the database, U2 will not see these changes. If U2 changes something else in his DataSet, U1 will not see these changes. When U2, always the smartest, fastest kid on the block, is already done with her work and saves the DataSet's changes, U1 will be left in the cold later on when he wants to save his changes back to the database. And this can be confusing, because the changes made by U1 did succeed, that is: when the changes were made in the DataSet. They were not propagated back to the application state in the database, but that was a matter of time. Code run by U1 has to deal with a failed update of the DataSet, which was not expected, since the updates to the rows in the DataSet did go well.

The DataSet as a concept propagates this kind of development of applications, bringing developers into trouble if they don't understand the possible stateful-related disadvantages of the DataSet. However too many times, helped by the not-helping utilities in Visual Studio.NET starting developers and also developers who are new to n-tier, stateless, multi-user applications fall into this trap. And it's unnecessary, however Microsoft doesn't help here.

When you investigate the MSDN library which comes with Visual Studio.NET, you'll notice that Microsoft uses two main approaches for dealing with data in .NET applications: (a) the DataSet way and (b) the DataReader way. The DataSet way is well documented and supported by a big gang of utilities inside the Visual Studio.NET IDE. The DataReader way is not, it is meant for the die-hard developers who want to type in everything themselves. As illustrated above, that means that as a developer you have two choices: use the Visual Studio.NET tools and the DataSet approach or type it in yourself and pick either one of the approaches available.

Because of the vast amount of documentation, most developers go for the DataSet approach, while the DataReader approach is more suitable in most cases: the DataReader approach uses read-only ways to read data for viewing/consuming by Business Logic processes and uses command objects to execute changes directly in the application state, which means there is no caching in-memory, code will not make two times the same change (first in the in-memory object, the DataSet, then the database itself), and when requiring data for some process, it is read from the database directly instead of read from an in-memory cached version of a piece of the application state. However, the DataReader approach requires much more typing, and therefor is not the prefered choice for a lot of developers.

How can this be solved? Frankly, with the current DataSet this is hard to do. Because it caches data in-memory, to avoid two or more copies of the same data in DataSets used in more than 1 thread, it should hold objects with the entity data which is stored per row instead of plain data in rows. These objects then could be shared among the DataSets created in the different threads. This means that when thread T1 reads row R from table T, and T2 does the same, both threads will hold a DataSet which share a single object which contains the data of row R. If T1 changes the data in that row, T2 will automatically see those changes. This approach is common among O/R mapper frameworks in the Java world and which are now slowly but steadily entering the .NET world. Another solution would be to not cache the data extensively, but make changes directly on the application state in the database, plus read data from that application state when it is required, and thus not from cached DataSets in-memory, which can be out-of-sync with the application state.

Microsoft pulled their ObjectSpaces framework off the net, and what I heard from several people, is now reworking it into an O/R-mapping like framework, together with the next generation DataSets. Lets hope MS does it right this time so novice developers, who do not know the dark ages of Windows DNA, will be driven towards functionality which helps them the way they expect it helps them, this time without caveats.

Even with a data reader, you will end up caching data in memory at some point along the line (whether it is in your own objects, the UI, etc.). Not using the DataSet does little to solve this problem. What it all boils down to is that there isn't any really easy way to get around the concurency issue in a disconnected environment.

DataSets are very cool. ObjectSpaces, which are built on top of the DataSet (at least at the time of the last preview), will let you do the same types of things (and much more) with objects and show just how cool the DataSet is. Still, I do agree that the DataSet is not suitable for a complex domain model. Custom objects always win out there (but again, they must cache state in the same way that the dataset does...it is just a pain to write all the extram code).

Jesse Ezell - Tuesday, May 13, 2003 6:15:00 AM

Caching in ANY form is an issue, and you always have to take into account that when you read data on a given time T, the microsecond after that it can be changed by another thread. Indeed, if you do not want that, you should lock the row, not helping performance.

DataSets however give you the feeling they take care of this, which they don't (in full). The problem is that they make it easier for the developer to work in a stateful way with data cached in memory than f.e. the DataReader approach does. If you need to update a userrow f.e. you can simply execute an Update statement on the database, or update a cached userrow in a dataset, then persist it. THe latter will probably not work well in an environment with heavy changing data.

Andres: your example of the webfarm and the multiple appdomains is a good one.

I'm not against the usage of a dataset as a datacontainer between tiers, however if you use it as an equivalent of Fowlers UnitOfWork, it gets a little tricky.

Frans Bouma - Tuesday, May 13, 2003 6:53:00 AM

I found working with a timestamp column very helpful. See "Distributed Concepts and Design, Addison Wesley". A must have book for enterprise developers.

Good post Frans, to bad lots of people (even on dotnetweblogs) seem to be indoctrinated by MS's marketing strategy and refuse to dig in on this issue.

Paul Gielens - Tuesday, May 13, 2003 10:23:00 AM

If you are trying to use DataSets as a global caching mechanism then you are right, they are not good for it.

If you use them as a container for Business Entities instances, for example an Order, when you retrieve one, change it offline, and then persist it, then they work great.

I don't want to just execute an update statement to persist the data because I won't have optimistic concurrency, I'll have to build it myself.

They are also good when you want to cache mostly read-only data. If you have an ASP.NET app that uses DataReaders for filling grids, you'll have better performance than using a DataSet. If you then want to cache that data in a middle tier, or in the asp.net tier itself, then you'll have to change your app.

If you bind to a dataset, you'll have worse performance in the web server (the same in the database server), but you can add caching for read only data very easily, and that will imply a great improvement in the overall performance/scalability of the application.

Andres Aguiar - Tuesday, May 13, 2003 12:05:00 PM

I found DataSets and DataTables and DataRows and DataWhatever the hell else you want to be so counter intuitive that it was actually simpler to build a full blown OR mapper just so I didn't have to deal with it. Now M$ comes out with ObjectSpaces. I think they finally get it. We don't work in 'tables' - we work in objects with graph-like relationships.

ChaCha - Saturday, June 26, 2004 8:14:00 AM

5 Comments