Haskell and Collective Intelligence

Tonight for the Real World Haskell Book Club, Paul Barry and I decided to dig through the Programming Collective Intelligence book as some good hack projects as we dive deep into the language features.  As you may have noticed both here on my blog and on Twitter, I have been posting a lot of these samples in F# and refactoring along the way.  By taking the Python code and rewriting it in Haskell gave me better insight into the Haskell way, but also refactoring opportunities for my existing F# code.

The Haskell Top Matches

First, I decided that some of the type definitions were far too long, so I decided to type alias these so that it makes a bit more sense when I reference them:

import Data.Map((!), intersection, fromList, keys, Map(..))
import Data.List(sort)

type Preferences = Map String (Map String Double)
type Similarity = Preferences -> String -> String -> Double

Now that I have these type definitions, I can now create the critics map with a much cleaner type signature such as the following:

-- Old
critics :: Map String (Map String Double)

-- Better
critics :: Preferences

Next, let’s move onto our two similarity functions, the Euclidean Distance and Pearson Coefficient implementations.  First, let’s look at the Euclidean Distance formula.  As before, in order to calculate between two people on the chart, take the difference in each axis, square them and add them together, and finally take the square root of the sum.  This function will always return values between 0 and 1, with 1 being identical and 0 having nothing in common. The code to do this might look like the following:

sim_distance :: Preferences -> String -> String -> Double
sim_distance prefs p1 p2 =
     let si = keys $ intersection (prefs ! p1) (prefs ! p2)
     in case si of
         [] -> 0
         otherwise -> 
           1 / (1 + sqrt sum_of_squares)
             where sum_of_squares = 
                     sum [((prefs ! p1 ! item) - (prefs ! p2 ! item)) ^ 2 |
                            item <- si]

Of course, I could have also used a fold in order to calculate the sum of squares in order to have one operation instead of two, the list comprehension and the sum of that list.  But, I find that the list comprehension is much more readable in many scenarios.  Now, we can run through this to calculate the differences between Lisa Rose and Gene Seymour:

*Main> sim_distance critics "Lisa Rose" "Gene Seymour"
0.29429805508554946

Moving onto the more sophisticated Pearson Coefficient, we find that implementing this is pretty straight forward as with before.  This finds the items rated by both individuals, then calculates the sum of squares of the ratings, then uses the Pearson score (a not so intuitive formula) to calculate our final score.  This might look like the following:

sim_pearson :: Preferences -> String -> String -> Double
sim_pearson prefs p1 p2 =
  let si = keys $ intersection (prefs ! p1) (prefs ! p2) 
      n  = fromIntegral $ length si
  in case n of
      0 -> 0
      otherwise -> 
        let sum1 = sum [prefs ! p1 ! item | item <- si]
            sum2 = sum [prefs ! p2 ! item | item <- si]
            sum1Sq = sum [prefs ! p1 ! item ^ 2 | item <- si]
            sum2Sq = sum [prefs ! p2 ! item ^ 2 | item <- si]
            pSum = sum [(prefs ! p1 ! item) * (prefs ! p2 ! item) | item <- si]
            num = pSum - (sum1 * sum2 / n)
            den = sqrt ((sum1Sq - sum1 ^ 2 / n) *
                        (sum2Sq - sum2 ^ 2 / n))
        in case den of
          0 -> 0
          otherwise -> num / den

Now we can calculate much as above the similarity between the two noted critics:

*Main>sim_pearson critics "Lisa Rose" "Gene Seymour
0.39605901719066977

Lastly, we can get the top matches for the person most similar to me.  Once again as with above, I will be using a list comprehension to compare one key to another.

topMatches :: Preferences -> String -> Int -> Similarity -> [(Double, String)]
topMatches prefs person n similarity =
    take n $ 
      reverse $ 
        sort [(similarity prefs person other, other) | 
               other <- keys prefs, 
               other /= person]         

Once again, we can check for the matches most similar to Lisa Rose:

*Main> topMatches critics "Lisa Rose" 3 sim_pearson
[(0.9912407071619299,"Matt"),(0.7470178808339965,"Jack Matthews"),(0.5940885257860044,"Mick LaSalle")]

This has been a great exercise in getting to know a language.  What better than to take code from another language and learn to express it in another paradigm, from imperative to functional?  This has been a great opportunity to rid ourselves of such things as imperative loops, mutable data structures and so on.  With the extra enforcement that Haskell puts upon you, you find even more interesting and potentially better ways of solving the problem in a more functional style.  You can find the entire code here.

Refactoring the F# Code Base

Now that we had that brief Haskell lesson, let’s get back to my F# codebase.  How can I learn from what I just did in Haskell?  Well, I can clean up the code base like I did above by using type aliasing such as the following:

type Preferences = Map<string, Map<string, float>>
type Similarity = Preferences -> string -> string -> float

Along with this, my helper functions could use some more oomph.  For example, it would be helpful to have an intersection function which would return only the items with keys from both maps.  Secondly, we could add others such as keys which gets us a key collection from a given map and so on:

[<AutoOpen>]
module Operators =
  let (||>) (x,y) f = f x y

module Seq =
  open System.Linq
  let rev : 'a seq -> 'a seq = Enumerable.Reverse

module Map =
  let keys (m:Map<'k,'a>) =
    m |> Seq.map (fun kvp -> kvp.Key)

  /// Left leaning union
  let union m1 m2 =
    (m1, m2) ||> Map.fold_right Map.add
  
  /// Left leaning intersection
  let intersect m1 m2 =
    (m2, Map.empty) 
      ||> Map.fold_right (fun k _ acc ->
                            match Map.tryfind k m1 with
                            | None   -> acc
                            | Some x -> Map.add k x acc)  

These are any number of these little functions that I use in my F# that I hope one day may be in the base libraries, but until then, these suit me well.  With the intersect now in place, I can rewrite my Euclidean Distance function to look something like this:

// Returns a distance-based similarity score for p1 and p2
let sim_distance (prefs:Preferences) p1 p2  =
  // Get the list of shared items
  let si = Map.intersect prefs.[p1] prefs.[p2] |> Map.keys
  
  // No ratings in common, return 0
  match Seq.length si with
  | 0 -> 0.
  | _ -> 
    // Add up the squares of all the differences
    let sum_of_squares = 
      seq {for item in si -> pown (prefs.[p1].[item] - prefs.[p2].[item]) 2}
      |> Seq.sum
    1. / (1. + (sqrt sum_of_squares)) 

Each of the formulas that we have can easily be rewritten to use this style as I have above.  By refactoring this behavior into common functions such as the Map.intersect, we’re able to clean up our code significantly.  Turning our attention to our topMatches, we can clean this up as well, but changing the type signatures.

// Returns the best matches for person from the prefs dictionary
let topMatches (prefs:Preferences) person n (similarity:Similarity) =  
  seq { for other in Map.keys prefs do
          if person <> other then
            yield (similarity prefs person other, other)
      }
    |> Seq.sort_by fst
    |> Seq.rev
    |> Seq.take n

Much cleaner!   Many cases I find that code I write six months ago that I could easily make a few changes here and there to clean it up in one fashion or another.  In this case, it’s no more than just a couple of weeks!

Conclusion

As the Real World Haskell book club continues, we will be digging further into not only the book, but also other algorithms as well to see how we can accomplish some of these more interesting things such as item recommendation in a purely functional style.  I’ve found that one of the best ways to learning the language is to take some imperative code samples and rewrite them in a pure functional style.  This not only teaches you the syntax of the language, but also what makes functional programming quite different than imperative.

In the next posts, I will be getting back to more Colllective Intelligence including item similarity and revisiting Slope One as well.

No Comments