ss_blog_claim=049b3d4ff689e7558b5873d1794f7277 August 2006 - Posts - Brenton House

August 2006 - Posts

Frans Bouma article about why a cache in an O/R mapper doesn't make it fetch data faster.

 
Via Frans Bouma's blog -

Preface
One of the biggest myths in O/R mapper land is about 'caching'. It's often believed that using a cache inside an O/R mapper makes queries much faster and thus makes the O/R mapper more efficient. With that conclusion in hand, every O/R mapper which doesn't use a cache is therefore less efficient than the ones who do, right?

Well... not exactly. In this article I hope to explain that caching in O/R mappers is not there for making queries more efficient, but is there for uniquing. But more on that later on. I hope that at the end of the article, I have convinced the reader that the myth Caching == more efficiency is indeed a myth. Beware, it's perhaps a bit complicated here and there, I'll try to explain it in as much layman's terms as possible.

What's a cache?
Before I can explain what a cache is, it's important to understand what an entity is, what an entity instance is etc. Please consult this article first to learn about what's meant with these terms.

A cache is an object store which manages objects so you don't have to re-instantiate objects over and over again, you can just re-use the instance you need from the cache. A cache of an O/R mapper caches entity objects. Pretty simple actually. When an entity is fetched from the persistent storage (i.e. the database), the entity object (i.e. the entity class instance which contains the entity instance (== data)) which contains the data fetched, is stored in the cache, if it's not there already. What exactly does that mean: "if it's not there already" ? It means that the entity object isn't there yet.

Caches in O/R mappers are above all used for a concept which is called uniquing. Uniquing is about having a single entity object for every entity (== data) loaded. This means that if you load the entity of type Customer and with PK "CHOPS" from the Northwind database, it gets stored in an entity object, namely an instance of the Customer entity class. What happens if you load the same entity with PK "CHOPS" again in another instance of the Customer entity class? You would end up with two instances of the same class, but with the same data. So effectively the objects represent the same entity.

This doesn't have to be a problem. Most actions on entities don't require a unique entity object. After all, they're all mirrors of the real entities in the database and with a multi-appdomain application (like desktop applications accessing the same database or a multi-webserver using webapplication) you have the chance of having multiple entity objects containing the same entity data anyway.

However sometimes it can be a problem or an inconvenience. When that happens, it's good that there's a way to have unique objects per entity loaded. Most O/R mappers use a cache for this: when an entity is loaded from the database, the cache is consulted if there's already an entity object with the entity data of the same entity fetched. If that's the case, that instance is updated with the data read from the database, and that instance is returned as the object holding the data. If there's no object already containing the same entity, a new instance is created, the entity data fetched is stored in that instance, that instance is stored in the cache and the instance is returned. This leads to unique objects per entity.

Not all O/R mappers use a cache for uniquing however, or don't call it a 'cache'. You see, a central cache is really a very generalizing. What if you need for a given semantical context a unique entity, and outside that context you don't need a unique instance or a different, unique instance? Some O/R mappers, like LLBLGen Pro, use Context objects which provide uniquing for a semantical context, e.g. inside a wizard or an edit form. All entity objects inside that context are stored in unique objects.

Caches and queries: more overhead than efficiency
So, when does this efficiency the myth talks about occur exactly? Well, almost never. In fact, using a cache is often less efficient. I said almost, as there are situations where a cache can help, though these are minor or require a lot of consessions. However I'll discuss them as well so you have a complete picture.

Let's state I have a cache in my O/R mapper and I want to see, by using theory, how efficient it might be. So I have my application running for a given period of time, which means that the cache contains a number of entity objects. My application is a CRM application, so at a given time T the user wants to view all customers who have placed at least 5 orders in the last month. This leads thus to a query for all customer entities which have at least 5 related order entities which are placed in the last month.

What to do? What would be logical and correct? Obviously: fetching the data from the persistent storage, as the entities live there, and only then we'll get all known and valid customer entities matching the query. We can consult our cache first, but we'll never know if the entities in the cache are all the entities matching my query: what if there are many more in the database, matching the query? So we can't rely on the cache alone, we always have to consult with the persistent storage as well.

This thus causes a roundtrip and a query execution on the database. As roundtrips and query executions are a big bottleneck of the complete entity fetch pipeline, the efficiency the myth talks about is nowhere in sight. But it gets worse. With a cache, there's actually more overhead. This is caused by the uniquing feature of a cache. So every entity fetched from the database matching the query for the customers has to be checked with the cache: is there already an instance available? If so, update the field values and return that instance, if not, create a new instance (but that's to be done anyway) and store it in the cache.

So effectively, it has more overhead, as it has to consult the cache for each entity fetched, as well as store all new entities into the cache. Storing entities inside a cache is mostly done with hashvalues calculated from the PK values and stored per type. As hashvalues can result in duplicates (it's just an int32 in most cases) and compound PKs can complicate the calculation process, it's not that straight forward to get the lookup process of entities very efficient.

Some tricks can help... a bit and for a price
Before I get burned down to the ground, let me say that there are some tricks to speed things up a bit. I have to say "a bit" because it comes at a high price: you've to store a version field in every entity and the O/R mapper of choice must support that version field. This thus means that you've no freedom over how the entities look like or how your datamodel looks like. This is a high price to pay, but perhaps it's something you don't care about. The trick is that instead of returning the full resultset in the first query execution, only the PK values and the version values are returned for every entity matching the query. By checking the cache, you use the version value to see if an entity has been changed in the db since it was fetched and stored in the cache. If it's not changed, I don't have to fetch the full entity, as the data is already in the cache. If it is changed or not in the cache at all, I've to fetch the full entity. So then I will fetch all entities matching the PKs I've collected from my cache investigation. The advantage of this is that it might be that the second query is very quick. It however also can bomb: what if you're using oracle and you have 3000 customers matching your query? You then can't use an WHERE customerid IN (:a, :b, ...) query as you'll be exceeding the limit of parameters to send in a single query. It also will cause a second query run, which might add actually more time than simply doing a single fetch: first the PK-Version fetch query has to be run, then the second full fetch query (which might result in less rows, but still...).

You might wonder: what if I control all access to the database? Then I know when an entity is saved, and thus can keep track of when which entities are changed as well and thus can make assumptions based on that info whether an entity is updated or not! Well, that's true, but that's not scaling very well. Unlike Java, .NET doesn't have a proper cross-appdomain object awareness system. This means that if you have even two systems targeting the same database (webfarm, multiple desktops), you can't use this anymore. And even if you're in the situation where it could help (single appdomain targeting single database), it's still takes time to get things up to steam: until all entities of a given type are inside the cache, you still have to fetch from the database.

Cache and single entity fetches
There is one situation where a cache could help and be more efficient. That is: if you know or assume the data you might find in a cache is 'up to date' enough for you to be used. That situation is with single entity fetches using a PK value. Say the user of our previously mentioned CRM application wants to see the details of the customer with PK value "CHOPS". So all what should happen is an entity fetch by using the PK value "CHOPS". Consulting the cache, it appears that the entity with the PK value "CHOPS" is already loaded and available in the cache! Aren't we lucky today!

Again, what to do? What's logical in this situation? Pick the one from the cache, or consult the database and run the risk of fetching the exact same entity data as is already contained in the entity object in the cache? I'd say: go to the database. You only know for sure you're working with the right data by consulting the persistent storage. If you pick the one from the cache, you might run the risk that another user has updated the customer data and you're working with outdated, actually wrong data which could lead to wrong conclusions while you could have made the right conclusions if you would have looked at the right data. If you pick the one from the database, you might run the risk of burning a few cycles which turned out to be unnecessary. A trick here could be that if the entity in the cache is fetched X seconds ago, it's still considered 'fresh enough', as all data outside the database is stale the moment it's read anyway. But if correctness is in order, you can't be more sure than by reading from the database and bypass the cache.

So what's left of this efficiency?
Well, not much. We've seen that it actually adds more overhead than efficiency, so it's even less efficient than not using a cache. We've seen that it could be solved in a way which could lead to more efficiency but only in a small group of situations and required consessions you're likely not willing to make. We've also seen that a cache could be more efficient in single-entity fetches, but only if you're willing to sacrifice correctness, or if you're sure the data in your cache is valid enough.

Are caches then completely bogus? Why do O/R mappers often have a cache?
They're not bogus, they're just used for a different feature than efficiency: uniquing. Some O/R mappers even use the cache for tracking changes on entity objects (and thus the entities inside these objects). Claiming that the cache is making the O/R mapper more efficient is simply giving the wrong message: it could be a bit more efficient in a small group of situations, and is often giving more overhead than efficiency.

I hope with this article that people stop spreading this myth and realize why caches (or contexts or whatever they're called in the O/R mapper at hand) in O/R mappers are used for uniquing, and not for object fetch efficiency.

 
Posted by dotnetboy2003 | 9 comment(s)
Filed under:

How to detect old versions when deploying the .NET Framework 3.0 (formerly WinFX)

 
Via Aaron Stebner's WebLog -

I received an interesting question from a customer this weekend.  They are working on a setup package that will include the .NET Framework 3.0 (formerly the WinFX runtime components) as a prerequisite, and they wanted to automatically run the vs_uninst_winfx.exe cleanup tool to make sure that there were not any previous beta versions on the system that would cause setup to fail.  This cleanup tool does not have any silent switches, and it is only designed as an end user tool and not a redistributable setup component, so I advised the customer against including this.

However, it is possible to implement logic in a setup wrapper to accomplish the underlying goal of ensuring that no previous beta versions are present on the system.  I previously outlined an algorithm to accomplish this for the .NET Framework 2.0, and a similar algorithm will also work for the .NET Framework 3.0.

Here is an overview of the algorithm that .NET Framework 2.0 and 3.0 setup use to determine whether any previous beta products are on the system:

For each (beta product code)
{

Call MsiQueryProductState to check if the install state for the product code equals INSTALLSTATE_DEFAULT

if (install state == INSTALLSTATE_DEFAULT)
{

Call MsiGetProductInfo to retrieve the INSTALLPROPERTY_INSTALLEDPRODUCTNAME property for the product code
Add the value of the INSTALLPROPERTY_INSTALLEDPRODUCTNAME property to the list of beta products that need to be uninstalled

}

}

If (list of beta products is not empty)
{

If (setup is running in full UI mode)
{

Display UI with a list of product names that need to be uninstalled via Add/Remove Programs

}

Exit setup with return code 4113

}

The difference between the .NET Framework 2.0 and 3.0 is the location of the list of beta product codes.  You can find the beta product codes for the .NET Framework 3.0 by using these steps:

  1. Download the .NET Framework 3.0 web download bootstrapper and save it to your hard drive
  2. Extract the contents by running dotnetfx3setup.exe /x:c:\dotnetfx3
  3. Open the file c:\dotnetfx3\setup.sdb in a text editor such as notepad
  4. Look for the list of product codes in the [PrevProductIds] section of setup.sdb

 

 
Posted by dotnetboy2003 | 2 comment(s)
Filed under:

DSL Tools V1 to ship with Visual Studio SDK V3 in first part of September

 
via stuart kent's blog -

Having just returned from vacation, I thought I'd update folks about the V1 release of DSL Tools. This will be shipped as part of the Visual Studio 2005 SDK Version 3 in the first part of September.

We have signed, sealed and delivered our code to the VS SDK team who are now just wrapping up. There will be a significant new documentation drop at the same time, although we will continue to update the documentation until the end of the year.

Apart from bug fixes, the main feature in this release is a completed Dsl Designer for editing .dsl definitions.

I'll post again as soon as the toolkit has been released.

 
Posted by dotnetboy2003 | with no comments
Filed under:

WCF and WCF Security Guidance Packages Released.

You can go to the download page here.   On the download page you will also find an excel spreadsheet which has features that are planned for the December CTP release of Service Factory.


Via patterns & practices: Service Factory : News -
 
If you have the July CTP of .NET 3.0 installed and have been wanting to use Service Factory to create WCF services, I am now happy to encourage you to do so :) This release only includes 2 guidance packages (GP): the main WCF GP and the WCF security GP. The Data Access GP that is part of the Service Factory July release will work with these guidance packages.

We have not included the reference implementation (RI) because we expect it will change considerably from the RI that is part of the July Service Factory release. We are also not including documentation in this drop. The plan we are shooting for (from not until we release in December) is to drop GPs and the RI once a month and to also drop the written guidance once a month but 2 weeks apart. As always, we are interested in your feedback on this approach.

We have also changed how we are exposing our known issues from now on. In an effort to be even more transparent into our project, I am making all of the known issues available to you. If you run into a issue, just open the ServiceFactoryKnownIssues-Aug.xls file and using the find feature (Ctrl-F) to find the issue you are experiencing. If you don't have Microsoft Excel installed, see the installation instructions for a download link to the Excel 2003 Viewer. Naturally, we are interested in your feedback about this also.
 
 
Posted by dotnetboy2003 | with no comments
Filed under:

What's the difference between working on an open source project and working on a paid job

Steve King (author of TortoiseSVN) has an interesting article on the differences between working on his open source project and his paid job.

 

Interesting interview question

Hamilton Verissimo posts details about an interesting interview questions  
 
Via Zen and the art of Castle maintenance -

Ayende has posted an interesting code snippet useful to measure how much a candidate to a job knows about the compiler he/she claims to work on.

selected = selected++;

I’ve seen this one years ago (2002 I think) in a Java prep exam. My first guess was that ’selected’ variable would hold the result of selected++. Wrong! Then I kind memorized that this one was tricky, but forgot why. But today I was curious enough to check again the IL code to see where is the trick.

The C# code

selected = 1;
selected = selected++;

The IL

L_0000: ldc.i4.1 // loads the literal 1
L_0001: stloc.0  // store in the local
L_0002: ldloc.0 // load the local value (1)
L_0003: dup  // duplicates the stack, now we have two ints with value 1
L_0004: ldc.i4.1 // loads the literal 1 (++)
L_0005: add // sum 1 + 1 and push the result on the stack (2)
L_0006: stloc.0 // saves the value 2 on the local variable which is the top level item on the stack
L_0007: stloc.0 // whoops, the int on the stack now is the 1, store it (overriding the result of the increment)

Knowing this kind of behavior might be useful. On the project I was working on I coded something like the following

int val = 1;
string something = "some value " + (val + ',' + "something else");

Can you the headache this gave me?

If you have time and likes to read, the book Programming language pragmatics is a gem.

 
Posted by dotnetboy2003 | 4 comment(s)
Filed under:

Pilot invites go out for Microsoft’s AdSense competitor

 
Via TechCrunch -

logoMicrosoft’s long awaited contextual advertising platform, named ContentAds, sent out the first invitations today to prospective participants in its pilot program. Starting on “primarily” (their word) MSN owned sites, Microsoft says that ContentAds will place advertisements using not just keywords but also demographic targeting, geo-targeting and incremental bidding tools. Sounds like AdSense plus some consideration of the demographics of various MSN sites’ readership - we’ll see what happens when ContentAds are released into the wild. We’ll probably see soon.

More big time competition for Google’s AdSense, Yahoo! Publisher Network and the other players in the field should mean higher revenue cuts for publishers and more innovation in the way ads are served. That’s the theory anyway, though Microsoft’s late and safe entry into the game leaves open the question of whether there will be much innovation here. Come on Microsoft - surprise us!

Online ad expert Jennifer Slegg, who got an invitation, broke the news (I found via) and predicts Microsoft’s entry will be especially good news for small publishers without millions of impressions per month.

Microsoft announced what was probably a huge advertising deal with Facebook just last week.

TechCrunch Tuesday, August 29, 2006
 

Clearing Enum Flags

 
Via Krzysztof Cwalina -
 

Somebody just pointed out to me that the enum guidelines don’t provide any information on how to clear a flag in a flags enum variable.

This is quite easy to do if the enum has a value (member) that has all the flags set. Such value is usually called All.

[Flags]

public enum Foos {

    A = 1,

    B = 2,

    C = 4,

    D = 8,

    AB = A | B,

    CD = C | D,

    All = AB | CD

}

 

static class Program {

    static void Main() {

        Foos value = Foos.AB;

        Console.WriteLine(ClearFlag(value,Foos.A);

    }

 

    public static Foos ClearFlag(Foos value, Foos flag) {

        value = value & (Foos.All ^ flag);

        return value;

    }

}

If the enum does not contain the All value, you can manually create it by combining existing flags.

    public static Foos ClearFlag(Foos value, Foos flag) {

        value = value & ((Foos.AB|Foos.CD) ^ flag);

        return value;

    }

Or you can manufactore the All value from UInt32.  

    public static Foos ClearFlag(Foos value, Foos flag) {

        Foos all;

        unchecked {

            all = (Foos)UInt32.MaxValue;

        }

        value = value & (all ^ flag);

        return value;

    }

Keep in mind that not all enums are backed with 32-bit integers. For these that are larger, for example enums with ulong as the underlying type, you will need to use UINt32.MaxValue instead.

 

    [Flags]

    public enum Foos : ulong {

    }

 

    public static Foos ClearFlag(Foos value, Foos flag) {

        Foos all;

        unchecked {

            all = (Foos)UInt64.MaxValue;

        }

        value = value & (all ^ flag);

        return value;

    }

 

What’s interesting is that you don’t actually need to change anything if the underlying type is smaller, i.e. you can cast UInt64 to an enum backed with a byte. The cast operator takes care of the size mismatch. This means you could always use UInt64.  The only drawback would be that it makes the variable all larger than it has to be, which is a slight inefficiency, probably not detectable in all but the most targeted benchmarks.

 
Posted by dotnetboy2003 | 5 comment(s)
Filed under:

Heralding PLINQ into the LINQ party

 Sounds interesting.  What’s weird is that eWEEK is the one with early coverage on it…
 

Via Jonathan Bruce's WebLog -

eWEEK reports today, the announcement of PLinq. I'll quite directly from the article, as Andreas Hejlsberg captures the concept succinctly:

"With PLinq, effectively you write the code the same way, but we arrange for it to run on multiple CPUs,"

...and...

"So the queries get split up and run on multiple CPUs, and then you just wait for all the results to arrive. And lo and behold, without any changes your program just ran six times faster. It's instant gratification."

Sounds very interesting I agree, but my immediate impression are we in danger of over rotating on LINQ, as a concept or perhaps even what it can deliver. I don't wish to draw any comparisons with similar or for that matter dis-similar technologies, but I am ill at ease that we might soon see a a serious outbreak out of mass developer confusion.

 

Related articles about PLINQ

Access to Data No Longer the Weakest LINQ
Microsoft’s PLINQ to Speed Program Execution
Heralding PLINQ into the LINQ party
Chatting about LINQ and ADO.NET Entities

More Posts Next page »