Functional Programming and Collective Intelligence - II

Wednesday, April 1, 2009

In the previous post covering Functional Programming and Collective Intelligence, I left off with the discussion of the two different algorithms for measuring similarity, the Euclidean Distance and the Pearson product-moment correlation coefficient. The Programming Collective Intelligence covers these two formulas in depth, yet they are not the only two that are capable of performing the task. Let’s briefly cover some more of those before moving onto ranking the critics and recommending items for a particular user.

The Formulas

This time, we’ll uncover three more formulas that we could use in complement to the Euclidean and Pearson formulas we covered last time. Each has their own merits and particular uses as well. The three we will cover will be the Manhattan distance, the Chebyshev distance and the Jaccard coefficient.

Manhattan Distance or Taxicab Metric

The Manhattan Distance or Taxicab metric was devised in the 19th century in which the Euclidean Distance, the formula from last time, was replaced with a metric where the distance between two points is the sum of the absolute differences of the given coordinates. The name Manhattan or taxicab comes from the grid layouts of Manhattan and the shortest distance a car could take between two points in the city. To calculate this, simply sum the absolute differences between the two given points. Below is an example of using this in code.

// Returns the Manhattan distance for p1 and p2    
let sim_manhattan (prefs:Map<string, Map<string, float>>) p1 p2 =
  // Get the list of shared items
  let si = Map.fold_right 
            (fun k _ acc -> 
               if prefs.[p2].ContainsKey k then
                 Set.add k acc
               else acc) prefs.[p1] Set.empty
  // No ratings in common, return 0
  if Set.is_empty si then 0.
  else
    // Add up the absolute differences
    let sum =
      List.sum [for item in si ->
                  abs (prefs.[p1].[item] - prefs.[p2].[item])]  
    1. / (1. + sum)

This function will give us a value between 0 and 1 where 1 is an identical match as with the Euclidean. As with our other similarity functions, this accepts our data map and the two keys, so all of them are readily interchangeable. We can use this to determine how similar our critics are.

> sim_manhattan critics "Lisa Rose" "Claudia Puig";;
val it : float = 0.2857142857

This tells us that the ratings of Lisa Rose and Claudia Puig are not very similar. But are we starting to see a pattern here? Of course this leaves us with refactoring opportunities…. Let’s try yet another example.

Chebyshev Distance or Chessboard Distance

Another interesting similarity metric we can use is the Chebyshev distance, defined on a vector space where the distance between two vectors is their greatest of their differences along any coordinate dimension. This is also called the chessboard distance due to the minimum number of moves required by the king to go from one square to another is equal to the Chebyshev distance between the center squares. Often, this metric is used in warehouse logistics calculations. Brings me back to my days back in college...

To calculate this metric, simply take the max value from the list of absolute differences. Below is an example of this in F#:

// Returns the Chebyshev distance for p1 and p2  
let sim_chebyshev (prefs:Map<string, Map<string, float>>) p1 p2 =
  // Get the list of shared items
  let si = 
    Map.fold_right 
      (fun k _ acc -> 
         if prefs.[p2].ContainsKey k then
           Set.add k acc
         else acc) prefs.[p1] Set.empty
  // No ratings in common, return 0
  if Set.is_empty si then 0.
  else
    let max = 
      List.max [for item in si ->
                  abs (prefs.[p1].[item] - prefs.[p2].[item])]
    1. / (1. + max)

Once again, like above, can only go between 0 and 1 where 1 is the exact match. Let’s take the above calculation of the Manhattan distance and apply the Chebyshev instead.

> sim_chebyshev critics "Lisa Rose" "Claudia Puig";;
val it : float = 0.5

Interesting metric, but instead, let’s focus on another one that was mentioned in the book, the Jaccard Similarity Coefficient.

Jaccard Similarity Coefficient

The Jaccard Similarity Coefficient was another similarity calculation mentioned in the Programming Collective Intelligence book. This measures the similarity between sample sets is calculated by the size of the intersection divided by the size of the union of the sample sets. Below is sample code

// Returns the Jaccard coefficient for p1 and p2  
let sim_jaccard (prefs:Map<string, Map<string, float>>) p1 p2 =
  // Get the list of shared items
  let si = 
    Map.fold_right 
      (fun k _ acc -> 
         if prefs.[p2].ContainsKey k then
           Set.add k acc
         else acc) prefs.[p1] Set.empty
  // No ratings in common, return 0
  if Set.is_empty si then 0.
  else
    let dividend =       
      Set.fold_left 
        (fun acc item  ->
           // If the same add one to dividend
           if prefs.[p1].[item] = prefs.[p2].[item] then
             acc + 1.
           else acc) 0. si
    dividend / float (Set.size si)

To calculate this, we simple checked if the two items in our list were the same, and if so, then increment the dividend by 1. Then we use that number of divide by the number of items in our intersection set. Simple, yet elegant. As you may notice, there are a bit of folds. I’ve talked extensively about using them, and once you understand what they are and how to use them, they tend to be very heavily used in functional programming.

Now we can use this much as above with our similarity checks in our previous examples such as the following:

> sim_jaccard critics "Lisa Rose" "Claudia Puig";;
val it : float = 0.4

Ok, now that we have a basic understand, let’s move onto getting the top matches using these algorithms.

Getting the Top Matches

Now that we have five algorithms that we used to compare two people, we can create a function which scores everyone against a given person and finds the closest matches. In this example taken from the book, we can determine which critics have similar tastes to another in the list. Let’s define a function to do that:

// Returns the best matches for person from the prefs dictionary
let topMatches 
  (prefs:Map<string, Map<string, float>>) 
  person 
  n
  (similarity:Map<string, Map<string, float>> -> string -> string -> float) =
  
  prefs 
    |> Seq.choose (fun kvp -> 
         if kvp.Key <> person then 
           Some(similarity prefs person kvp.Key, kvp.Key) 
         else None) 
    |> Seq.sort_by fst
    |> Seq.rev
    |> Seq.take n

First, we need to ensure we’re not comparing the same person and if so, we calculate given our incoming algorithm, then we sort by the score, reverse it and take the number requested. We can now run the gamut of our algorithms to determine where Lisa Rose fits.

> topMatches critics "Lisa Rose" 3 sim_distance;;
val it : seq<float * string>
= seq
    [(0.472135955, "Michael Phillips"); (0.4142135624, "Mick LaSalle");
     (0.4, "Claudia Puig")]
> topMatches critics "Lisa Rose" 3 sim_pearson;;
val it : seq<float * string>
= seq
    [(0.9912407072, "Matt"); (0.7470178808, "Jack Matthews");
     (0.5940885258, "Mick LaSalle")]
> topMatches critics "Lisa Rose" 3 sim_manhattan;;
val it : seq<float * string>
= seq
    [(0.4, "Michael Phillips"); (0.2857142857, "Claudia Puig");
     (0.25, "Mick LaSalle")]
> topMatches critics "Lisa Rose" 3 sim_chebyshev;;
val it : seq<float * string>
= seq
    [(0.5, "Mick LaSalle"); (0.5, "Michael Phillips"); 
     (0.5, "Claudia Puig")]     
> topMatches critics "Lisa Rose" 3 sim_jaccard;;
val it : seq<float * string>
= seq
    [(0.5, "Michael Phillips"); (0.4, "Claudia Puig");
     (0.3333333333, "Gene Seymour")]

Conclusion

This field of machine learning and Collective Intelligence is quite fascinating. You’ll find that functional programming fits nicely in this field with the conciseness, immutable data structures and type inference. Once again, no for loops, while loops or mutable data structures were harmed during this post. In the next post, we’ll get to actually recommending items. But, as you can see, when translating this book to a more functional style, it gives us a great example of how functional programming fits and indeed F# as well.