ObjectSpaces: Projection or sparse Objects?

Tuesday, March 23, 2004

My previous blog entry raised some comments I´d like to respond to.

Am I proposing just an ordinary projection feature for ObjectSpaces (OS)? I don´t think so, because with a projection in SQL (or relational databases), you get different results when you issue select colA, colB from myTable vs select colA from myTable. Hence you need a generic data structure like a cursor (e.g. a ADODB.RecordSet) or a ADO.NET DataTable/DataRow to accomodate for the different number of columns.

With OS (or any O/R-mapping (ORM) tool), though, generic data structures are less important. That´s the purpose of ORM. With ORM you define a persistent class like

class Customer
{
public string id;
public string name;
public string city;
public string zip;
}

and hopefully there is also a corresponding typed collection for each persistent class, e.g.

class CustomerCollection : IList
{
...
public Customer item(int index) {
    get {
      ...
    }
}
...
}

More you don´t want. For a given entity (e.g. customers in some database table) you don´t want to define more than one persistent class.

Then, when you query for customers, you define a selection of entities by formulating a condition on their fields/columns, e.g.

select ... from tbCustomers where name like 'a%'

The collection class represents the result of this operation.

But also, you define how much data to load for each entity. O/R-mapping tools usually support at least two modes: The default is ("full mode"), you load all columns of a table and create a full blown persistent object for each entity returned:

select * from tbCustomers ...

Or you specify to only load "hollow object"s ("hollow mode"), i.e. only IDs of entities matching the query criteria:

select id from tbCustomers ...

Object creation then is delayed until you actually access an object (thru the collection class). Of course, then a specific query has to be issued to load the data:

select * from tbCustomers where id="..."

"Hollow objects" reduce the memory footprint and the amount of data transferred initially by the query, but cause additional roundtrips later on.

From the outside, though, what is returned from a query are always fully populated persistent objects. The delayed loading of "hollow objects" is transparent to client code of the ORM API.

Now, what I´m proposing is a third mode ("sparse mode") of retrieving persistent objects. I propose to be able to load sparsely populated objects by issuing a query specifying a subset of columns/fields to be returned for each entity, e.g.

select id, name from tbCustomers ...

From this data persistent objects are created, but of course not all their fields can be populated. But from the outside, when getting a persistent object from a collection, it still is an object of the persistent class with all its fields/properties - like with "hollow object"s. There is no perceivable difference between an object loaded in first or second or third mode.

And that´s the reason why I rather would not call it projections what I´m proposing - although a SQL projection query is underlying this third mode.

I´d call it "sparsely populating persistent objects", because there still is just one persistent class for a persistent entity, but which sometimes is populated with more column data from the database, sometimes with less.

These kind of "sparse objects" cause less memory footprint than objects loaded in "full mode" or "hollow mode", because in both modes, all columns' data is loaded, before you can access any field/property of a persistent object. Both modes only differ with regard to the time, when the data is loaded. In "sparse mode", though, maybe more than the loaded columns are never needed. "Sparse mode" thus combines the small number of roundtrips of "full mode" with a much smaller mem. consumption.

Like "hollow objects", though, "sparse objects" always look fully populated, because when you access a property, whose data has not been loaded by the initial query, the missing data is fetched (preferrably all missing columns). The query could look like this when accessing the city field/property for the first time

select city from tbCustomers where id="..."

and only load the missing column or columns. Or it could look like this:

select city, zip from tbCustomers where id="..."

and load all columns missing so far. Or it could look like this:

select * from tbCustomers where id="..."

to refresh the object´s data.

Of course this is an additional roundtrip to the database, but then it´s hopefully only rarely necessary, because the fields/properties most often needed in a particular context are specified with the query.

The "sparse mode" thus combines small mem. footprint (thru only needed fields/properties populated) with high performance (thru rare roundtrips) and full read/write access to all fields/properties when needed (thru transparent delayed loading of missing columns).

On the outside it´s still just one persistent class. But on the inside it´s more flexibility than with just "full mode" or "hollow mode". With "sparse mode", "full mode" as well as "hollow mode" are only special cases. "Full mode" loads all columns and never needs an additional roundtrip to the database, "hollow mode" always causes a roundtrip for each object.

When to use which mode? Use "full mode" in editing scenarios, where a persistent object maybe is displayed for modification. Use "sparse mode" in read-only or read-mostly scenarios (but also, when just a couple of fields/properties need to be edited).

Never use "hollow mode"! There is no use for it. Or maybe there is? It´s not important, since "hollow mode" comes free once the "sparse mode" is implemented.

The implementation of "sparse mode", however, requires access to missing fields/properties can be detected. Thus it requires all persistent fields to be encapsulated by property methods for access interception. This is obviously not necessary for "full mode", but also not for "hollow mode". In "hollow mode", only proxies are loaded initially, which cause the (transparent) delayed load of the full object.

Now, since property methods are needed for "sparse mode", ObjectSpaces cannot currently support it. To require properties is against its vision of making an arbitrary class persistent.

But then, as you might know by now or have guessed: To be able to provide object persistence for any class is a lofty goal - and in my view not important to reach for many, many scenarios. Many developers could live without it, and would be happy to define their persistent classes in some special ways (e.g. by deriving from a persistent base class, annotate them with attributes, or model them with a tool or language). Developers are mostly not concerned with the purity or generality of a solution. That is not to say, that sometimes the very general approach of OS is just what a project needs. But at least the companies I´m talking to don´t need such generality.

I've implemented what you're talking about in an application framework (O/R mapping included) developed together with Jimmy Nilsson(www.jnsk.se/weblog).

When fetching, a LazyLoadpattern can be supplied. A LazyLoadPattern defines exactly which fields should be loaded...and for reference fields which LazyLoadPatterns to use for loading the referenced entities.

When a non-loaded field is accessed on an entity, an implicit expand operation is issued and the remaining fields loaded. This works for both value fields and reference fields.

Apart from the obvous positive effects that you've already described I think a complexity warning is in order here... it highly increases the compelxity of the overall solution! Mainly internally in the framework...but still...the complexity cost is quite high I think.

The framework itself (Valhalla is its name) will be made available within the next months as open source. Watch Jimmy's blog if you're interested.

/Chris

Christoffer Skjoldborg - Tuesday, March 23, 2004 2:16:00 PM

Sparse mode means that you anticipate an initial usage model, I mean somehow you decide which fields should be prepopulated. From a design perspective I would consider this as problematic at least for components that might be reused (and so the usage pattern may change). I would prefer a more dynamic approach that is driven by profiles and the profiles could even be optimized dynamically.

My 2c.

Jürgen Pfeifer - Wednesday, March 24, 2004 12:33:00 PM

@Jürgen: The decision, which field to populate during the initial load of a persistent object in "sparse mode" is made by the calling code. I suggested in my previous posting a grouping for fields. The use case drives which fields to load immediately, and which to delay load.

However, on the outside calling code always works with the same persistent type - and does not know, when which field gets populated.

Ralf Westphal - Wednesday, March 24, 2004 8:05:00 PM

Well I think I got the idea of field groups and their use in the client from your previous posting. My point is that in your previous posting the pseudocode for the persistence model suggests that the field groups are part of the static model, formulated as kind of annotation of the persistent object model. The client uses then the name of the field group to govern the initial population of the instance. That is exactly where I see that you build a priori usage information into your model. What if a new application would like to preload a different set of fields? You then need to go back to the model and introduce a new field group? What if your class is a candidate for massive use in various apps and many of them are interested in different fields? Of course your model will always work because all the preload is only a speedup trick, but in highly dynamic scenarios I have my doubts that your approach will keep its promise.

I still believe it's worth to think a bit more about ways to allow a more dynamic way for a client app to tell the persistence layer which field group it is interested in.

I also don't like this syntactical tie from the client code to the artifical construct of field group names. I would consider it much more elegant if a client would use annotations (e.g. attributes) to formulate its intend to preload only a subset of fields (and that's a natural construct) from a persistable object.

Jürgen Pfeifer - Thursday, March 25, 2004 2:28:00 AM

@Jürgen: Ah, now I get it. I understand your concern. And I´d say, you´re right. So please excuse my ad hoc notation.

I agree, when specifying field groups they should not be tied to the definition of persistent classes. Nonetheless I think field groups are easier to use than specifying individual fields, when querying. But of course, field groups are application dependent and use case driven.

The notation I used did not want to suggest a certain syntax. I just wanted to get over my point. But I also don´t agree, attributes should be used. My thinking is more in the line of Domain Specific Languages (DSL) for defining persistent classes. A DSL would abstract from any implementation.

Ralf Westphal - Thursday, March 25, 2004 6:04:00 AM

5 Comments