Functional Programming and Collective Intelligence

Monday, March 30, 2009

During some of my talks about functional programming, and in turn, F#, many do ask “Why Functional Programming?” and “Why use F#?” and both are very good questions. Many of the people using F# today are in fields such as finance, insurance, science and academia. Today, I hope to dive into a subject for which it is well suited. If you’ve been following me on Twitter, I’ve been talking a bit and showing off a bit of code on today’s subject.

One of my recent interests besides the concurrency issues is around “Collective Intelligence”, which is a shared group of intelligence that emerges from collaboration and competition of many individuals. With the increasing number of sites and people contributing data either intentionally or incidentally, it has given us a vast amount of data which can give us insight into any number of things such as user experience, marketing effectiveness, personal tastes and so on. Examples of large data sets that we can use for machine learning are the Netflix Prize Data and the Guardian Data Store among others.

The recent release of the Guardian Data Store has given us a great opportunity to not only analyze such data as population statistics, world health, education, migration and even such things as 1000 songs to hear before you die, but to visualize them in pretty interesting ways as well. Another noted functional programming blogger, Robert Pickering, has made the foray into analyzing and visualizing world health data using F#.

Why functional languages are a good fit here is around the conciseness of the language and the ability to explore data quickly. The exploratory nature of functional languages such as Haskell and F# allow us to prototype rapidly through the use of a console, without ever having to launch an IDE such as Visual Studio. Instead, we can quickly explore and transform the data in very interesting ways without a lot of overhead. Not only that, but the conciseness of the language and data structures well suited for exploration help us with such things as lists, tuples, maps, sets and so on.

What are the techniques needed for this? There are two good books that have been released recently on the subject that are well worth owning:

Both of these books form an excellent foundation for getting started and give many great examples on how you can start leveraging your data to find many interesting things. With that in mind, let’s take a quick example from the Programming Collective Intelligence book, but this time using F# instead of Python, in which the book examples are written.

Getting Started

A challenge I give myself often is to take many of these examples, written in not only other languages, but other paradigms (imperative/object oriented) and redo the logic in a functional programming style using either Haskell or F#. We can apply such things as higher-order functions, immutable data structures and other functional aspects as we solve the problems. In this case, I chose F# to get started. In this challenge, we’re going to get user preferences and calculate similar users. First, let’s define the data set we’ll be using:

let critics = 
  Map.of_list [("Lisa Rose", 
                Map.of_list [("Lady in the Water", 2.5);("Snakes on a Plane", 3.5);
                             ("Just My Luck", 3.0);("Superman Returns", 3.5);
                             ("You, Me and Dupree", 2.5);("The Night Listener", 3.0)];);
               ("Gene Seymour", 
                Map.of_list [("Lady in the Water", 3.0);("Snakes on a Plane", 3.5);
                             ("Just My Luck", 1.5);("Superman Returns", 5.0);
                             ("You, Me and Dupree", 3.5);("The Night Listener", 3.0)];);
               ("Michael Phillips", 
                Map.of_list [("Lady in the Water", 2.5);("Snakes on a Plane", 3.0);
                             ("Superman Returns", 3.5);("The Night Listener", 4.0)];);
               ("Claudia Puig", 
                Map.of_list [("Snakes on a Plane", 3.5);("Just My Luck", 2.0);
                             ("Superman Returns", 4.0);("You, Me and Dupree", 2.5);
                             ("The Night Listener", 4.0)];);
               ("Mick LaSalle", 
                Map.of_list [("Lady in the Water", 3.0);("Snakes on a Plane", 4.0);
                             ("Just My Luck", 2.0);("Superman Returns", 3.0);
                             ("You, Me and Dupree", 2.0);("The Night Listener", 3.0)];);
               ("Jack Matthews", 
                Map.of_list [("Lady in the Water", 3.0);("Snakes on a Plane", 4.0);
                             ("Superman Returns", 5.0);("You, Me and Dupree", 3.5);
                             ("The Night Listener", 3.0)];);        
               ("Matt", 
                Map.of_list [("Snakes on a Plane", 4.5);("Superman Returns", 4.0);
                             ("You, Me and Dupree", 1.0)];);]

This data set gives us a map of critics and their ratings of given movies. At the bottom, I have my preferences listed as well which is to aid in our selection. You could imagine using this style of data set for your own data as well with such things as user name and items bought, browsed, and rating. Now that the data set is in place, we can then calculate how similar people are in their tastes through the calculation of a similarity score. In this example, we have two scoring algorithms, the Euclidean distance and Pearson product-moment correlation coefficient.

The first algorithm we’ll use is the Euclidean distance algorithm, which takes the items that people have ranked in common and uses them as axes on a chart. To calculate between two people on the chart, take the difference in each axis, square them and add them together, and finally take the square root of the sum. This function will always return values between 0 and 1, with 1 being identical and 0 having nothing in common. To accomplish this in F# might look something like the following:

// Returns a distance-based similarity score for p1 and p2
let sim_distance (prefs:Map<string, Map<string, float>>) p1 p2  =
  // Get the list of shared items
  let si = Map.fold_right (fun k _ acc -> 
                             if prefs.[p2].ContainsKey k then
                               Set.add k acc
                             else acc) prefs.[p1] Set.empty  
  // No ratings in common, return 0
  if Set.is_empty si then 0.
  else
    // Add up the squares of all the differences
    let sum_of_squares = 
      List.sum [for item in si -> 
                  pown (prefs.[p1].[item] - prefs.[p2].[item]) 2]
    1. / (1. + sqrt sum_of_squares)

We first capture the similarity by performing a right fold over the first person’s preferences and if the second person has the key, we add to our collection, else we return the accumulator in a Set. After determining that there are ratings in common, we then calculate the sum of the squares of the differences between the first and second person. And then finally return 1 over the 1 + the square root of our sum of squares.

Note that we could have instead of using List.sum with a list comprehension, we could have also used a left fold which would accomplish the same goal. The left fold approach is more efficient as it doesn’t create a second list, but obviously the list comprehension is much easier to read. We could have implemented it as the following:

let sum_of_squares' =
  Map.fold_left (fun acc k _ -> acc + 
                                (pown (prefs.[p1].[k] - prefs.[p2].[k]) 2))

Let’s now calculate the similarity between Mick LaSalle and Jack Matthews ratings by the following in F# interactive:

> sim_distance critics "Mick LaSalle" "Jack Matthews";;
val it : float = 0.2857142857

We can observe from the results that Mick and Jack’s preferences aren’t that similar at all. But, let’s try another, more sophisticated approach with the Pearson product-moment correlation coefficient. This correlation coefficient measures how well our two sets of data fit into a straight line. This is more sophisticated than the previous because it can give better results when the data isn’t normalized.

In order to calculate this, we first find the movies rated by both critics, then calculates the sum, then the sum of the squares, and then the sum of the products of their ratings. Finally, we apply the Pearson formula which is better explained in code as shown below in F#:

// Returns the Pearson correlation coefficient for p1 and p2
let sim_pearson (prefs:Map<string, Map<string, float>>) p1 p2 =
  // Get the list of mutually rated items
  let si = Map.fold_right (fun k _ acc -> 
                             if prefs.[p2].ContainsKey k then
                               Set.add k acc
                             else acc) prefs.[p1] Set.empty
  
  // if they are no ratings in common, return 0
  let n = float si.Count
  if n = 0. then 0.
  else
    // Sums of all the preferences
    let sum1 =  List.sum [for it in si -> prefs.[p1].[it]]
    let sum2 =  List.sum [for it in si -> prefs.[p2].[it]]    
    
    // Sums of the squares
    let sum1Sq = List.sum [for it in si -> pown prefs.[p1].[it] 2]
    let sum2Sq = List.sum [for it in si -> pown prefs.[p2].[it] 2]  
    
    // Sum of the products
    let pSum = List.sum [for it in si -> prefs.[p1].[it] * prefs.[p2].[it]]
    
    // Calculate r (Pearson score)
    let num = pSum - (sum1 * sum2 / n)
    let den = sqrt ((sum1Sq - (pown sum1 2) / n) * 
                    (sum2Sq - (pown sum2 2) / n ))
    if den = 0. then 0.
    else
      num / den

This function, unlike the Euclidean distance, will return between –1 and 1 with 1 meaning they have identical matches. We can now calculate much as we did above for the sim_distance function as below:

> sim_pearson critics "Mick LaSalle" "Jack Matthews";;
val it : float = 0.2112885637

As you notice, there’s a bit of a difference between the two, but yet there are more algorithms out there that we can use for this. But, I think this is a good stopping point for now, and we can pick up later on how to rank the critics as well as make recommendations all using our above formulas.

Conclusion

The field of machine learning and Collective Intelligence is a pretty interesting field. By applying functional programming constructs to these, we are able to concisely apply formulas using higher order functions and immutable data structures. You’ll note that no for loops, explicit recursion or mutable data structures were harmed during this blog post. This example as translated from the Programming Collective Intelligence book gives us a great example of applied functional programming and F# in turn.

In the sim_distance function, should the alternative (more efficient) implementation of sum_of_squares be

let sum_of_squares' =
Set.fold_left (fun acc k -> acc + (pown (prefs.[p1].[k] - prefs.[p2].[k]) 2)) 0.0 si

Alan - Saturday, April 4, 2009 9:32:00 AM

@Alan

Yes, that's a typographical error and it shall be corrected.

Matt

Matthew Podwysocki - Wednesday, April 8, 2009 1:37:58 AM

Getting Started

Conclusion

2 Comments