MsCorEE

JeffGonzalez : IScalable

Overriding .Equals() and .GetHashCode()

Ok, I have been reading some information regarding these 2 methods, and I would like someone to set me straight. 

I realize there are a couple of concepts that seem key to understanding this.  If your class does not override the Equals method, you are doing an identity check (do 2 objects point to the same heap reference?)  It is possible to override the Equals method so that your objects are tested for equivelence in addition to identity. 

When I override Equals, I get a compiler warning stating that I should be overriding GetHashCode as well.  In the documentation for GetHashCode there are a couple of properties that a hash function should have.

  • If two objects of the same type represent the same value, the hash function must return the same constant value for either object.
  • For the best performance, a hash function must generate a random distribution for all input.
  • The hash function must return exactly the same value regardless of any changes that are made to the object.

    I understand the first property, you can use GetHashCode in your overriden Equals method to make it faster (You are short circuiting the logic before going into your deep field by field comparison).  I understand the second property because it reduces the possibility of collisions between hashes.  I am unclear regarding the 3rd property, but I am assuming that it has to do with HashTables and how they sort based on key hashes.  If an object returned a different hashcode after being added to a specific bucket, the hashtable could lose it, leading to inefficiency.  However it seems to me that property1 and property3 could collide with each other.  Let's assume the following scenario...  Class1 has 2 properties called x and y. 

    Class1 o1 = new Class1();
    o1.x = 100;
    o1.y = 200;

    Class1 o2 = new Class1();
    o2.x = 200;
    o2.y = 100;

    Assuming we have overriden Equals and GetHashCode, these 2 objects should return different hashcodes, based on the rules we defined above.  What happens when I change o2.x = 100; o2.y = 200; and call GetHashCode again?  Assuming we are using the immutable hashcode rule, both of these objects should still return different hashcodes, even though they are now equivilent.  This seems to go against the first property.  This is where my disconnect seems to be.

    I understand overriding Equals could be useful for some of the stuff we are doing at work, but I am unsure how to proceed with the implementation of the GetHashCode method.  Below I have included some code that I have been toying around with, to demonstrate my findings.  Thanks to Peter Drayton for his Effective Types presentation|pdf that opened my eyes to this.  

    The questions I currently am still asking and investigating are...

    When should I worry about overriding Equals and GetHashCode? 
    Should I be doing this on most of my objects?
    What is a good GetHashCode algorithm? 
    (This question is of particular interest to me.  I decompiled System.Uri, as an example, and noticed that there is a private readonly static byte[] field called CaseInsensitiveHashCodeTable.  The internal static method, CalculateCaseInsensitiveHashCode, seems to be doing some “interesting“ stuff to calculate the hashcode. I don't completely understand what is going on in there, so maybe it is just a lack of knowledge helping me understand this whole thing)

    My Findings

  • Comments

    Larry Osterman said:

    Ooh, you're playing with fire here if you ignore #3. IMHO, Reason #3 is actually the most important one of all!.

    What happens if you take an object of Class1 and insert into a System.HashTable? The SYstem.HashTable calls Object.GetHashCode() to determine what bucket the object goes into. And if the hash code for an object changes during it's lifetime, the bucket into which the object goes will change. And that means that you'll do a hashtable lookup for your object and not find it even though it's already been added to the hashtable.
    # April 22, 2004 3:26 PM

    Jeff Gonzalez said:

    Larry, excellent point. That is what I was getting at in the statement...

    "I am unclear regarding the 3rd property, but I am assuming that it has to do with HashTables and how they sort based on key hashes. If an object returned a different hashcode after being added to a specific bucket, the hashtable could lose it, leading to inefficiency."

    To me, having an immutable hashcode for the lifetime of the object makes perfect sense, given the definition you supplied. However it does seem to contradict the documentation which states that 2 objects that are equivalent (equal by value) should always return the same hashcode.

    Some of the documentation/books/articles I have read (specifically the System.Object msdn docs, Jeffrey Richter's Applied .NET Framework Programming book and Peter Drayton's Effective Types presentation also state that you should use at least 1 instance field and that field should be immutable if possible. I could definitely follow the 3 rules for GetHashCode if I had a field that was indellible. However I cannot guarantee that all of my objects will indeed contain a field that can be used at object construction and that will be immutable.


    Now it reads like an immutable field is a desire not a requirement.
    # April 22, 2004 4:09 PM

    AT said:


    IMHO, There is simply wrong doc.

    HashCode must return the same value for any number of invocation as long as object not changed.

    Storing non-immutable in HashTable has no any common sence.
    Take a look on IHashCodeProvider and IComparer.
    They know nothing about previous values of object and return new HashCode to be used for buckets.

    As for a magic in Uri - they do this to acomplish #2
    # April 22, 2004 5:01 PM

    TrackBack said:

    # April 29, 2004 4:52 PM

    ChaCha said:

    For the best performance, a hash function must generate a random distribution for all input.

    I was under the assumption that this was to ensure a relatively even distribution of the objects across the hash buckets, so that if you were to hash 100 objects that all had very similar values they would not all be placed in the same 1 or two buckets, but rather would be distributed, quasi-randomly across all. Then the internal binary sort will be most efficient.
    # May 28, 2004 7:06 PM

    Keith Hill said:

    Larry,

    The problem is that if you have mutable data that is part of your "Equals" definition then you still have a problem even if the hashcode doesn't change when the mutable data changes. That is, say you have several objects in the same hashtable bucket. So you get the same hashcode on your object where the data has changed but Hashtable still doesn't find it because the "Equals" implementation returns false now because the mutable data has changed.
    # July 18, 2004 2:45 PM

    fogelman said:

    It seems to me that the whole immutability is regarding the key, the object you pass the hashtable as key - not the object value(s) itself. The point is that given an object with it's properties, private data etc, you want a handle to it, and given that object you would want to *always* derive the same key from it. Otherwise, you place a reference to it in the hashtable, but can never get back to it again.. bad for GC, bad for you! It's not really too far from the concept of a unique primary key in a row. You can store a GUID with your object if you need to change every darn data piece in it all the time and can't keep even one thing constant.
    # July 29, 2004 12:11 AM

    Kurt Otto Monnig said:

    Sorry for the java reference, but this is the best explanation of overriding Equals() and GetHashCode() that I have seen.  This text is used in graduate level CompSci courses. C# and java are very similar in many ways.

    java.sun.com/.../Chapter3.pdf

    # October 6, 2007 2:12 PM

    kushal said:

    I like to think of HashCode as more like a "primary key" for an object. And so, rule 3 is the most important, and rule 1 mustn't be taken too literally.

    In your example for instance, o1 and o2 might have the same x and y values, but are still 2 separate objects (presumably in the real world, they'd be representing 2 different underlying "things"), and so ought to have different hashcodes.

    # November 2, 2007 11:25 AM
    Leave a Comment

    (required) 

    (required) 

    (optional)

    (required)