Building Recommendation System Using F# - Part 1
In this blog post, using F# language, we will try to build a simple recommendation system. The true intent of this post is to learn F# language by implementing a recommendation system. If you are truly interested in learning what it takes to build a true recommendation system, then I would encourage you to read the following text : A Programmers Guide to data mining. I too have used this text as a source of reference for the content of this post. In this post, I am going to port the original python based code to F# and while doing that I will try to highlight various functional programming concepts.
Note : The source code(reference) and examples listed in this post is taken from the book : A Programmers Guide to data mining.
This series will be divided in multiple posts. Following is the break-up of what each post will contain :
- Part 1 : Initial Setup. Test data creation and implementation of Manhattan distance function
- Part 2 : Implementation of getting closest neighbors in terms of similarity in ratings
- Part 3 : Final Recommend function which will bring all pieces together
Pre-requisites for building a recommendation system:
- Data : before we can recommend something, we will need some data against which we can build our recommendation system. Data could be anything like users giving rating to movies on IMDB or rating products on Amazon, although these sets of data would qualify as very large data set. For this blog post, I will use the same sample data that is used in the above mentioned book i.e. small set of users and their ratings to various bands. We will see later in the post as how we can represent the below mentioned matrix in F#.
Angelica Bill Chan Dan Hailey Jordyn Sam Veronica Blues Traveller 3.5 2 5 3 - - 5 3 Broken Bells 2 3.5 1 4 4 4.5 2 - Deadmau5 - 4 1 4.5 1 4 - - Norah Jones 4.5 - 3 - 4 5 3 5 Phoenix 5 2 5 3 - 5 5 4 Slightly Stoopid 1.5 3.5 1 4.5 - 4.5 4 2.5 The Strokes 2.5 - - 4 4 4 5 3 Vampire Weekend 2 3 - 2 1 4 - - - Mathematical Model : For recommending a given user, based on the inputs provided by other users, we will have to somehow compute the relative closeness(distance) between the targeted user against all other users. For this, we will make use of mathematical models like Manhattan distance or Euclidian distance or more general Minkowski distance. You can read about all these models in the book, I have mentioned above. In this blog post, I am going to implement the Manhattan distance model.
- Programming Language of your choice : Since we have decided to make use of F#, you will require Visual Studio(2010/2012/2013). Ensure that you have download and installed the latest F# Power Tools.
Lets Get Started..
We will start by creating a new F# console project and calling it as RecommendationSystem. In Visual Studio, you can create an F# console project by selecting New Project –> Templates –> Other Languages –> Visual F#. Once created, the project will have a default App.config and Program.fs file. Program.fs file contains the entry point main method of the project.
Now we will add another F# source file called Recommend.fs to the project, in which we will add the core logic of recommendation system. New files can be added by right clicking the project and selecting Add –> New Item.
Important : Although F# is a statically typed language, the F# compiler is a linear top-down compiler i.e. before any F# function or type is used, it has to be first declared or defined in advance. Thus, the files in F# project have to be manually moved up or down according to their usage. In our case, since Program.fs will call functions defined in Recommend.fs, the Recommend.fs file has to be moved ahead of Program.fs. This can be done by right clicking on the Recommend.fs file and then selecting “Move Up” or “Move Down” depending upon the context.
By default, any newly added F# source file will include just one line which is – module <name_of_the_file>. In our case, it will be : module Recommend
Module: Module just like namespace in C#, is a way to encapsulate related code base in one group. The primary difference between namespace and module is that, namespace can span across multiple files whereas a module is restricted to just one file. Another difference is that in a given file we can define multiple modules. By the way, F# also supports namespaces. So within a file, we can define a root level namespace, followed by multiple modules. In our example, we are not going to use namespaces. We can apply, accessibility identifiers to restrict the visibility of items defined inside a module to the outside world. I will cover accessibility identifiers in later part of this series.
Added the following code to the Recommend.fs source file.
open System type Ratings = {Name: string; Rating: float} let Manhattan(ratings1:Ratings list, ratings2:Ratings list) = let mutable distance = 0.0 for item in ratings1 do let otherRating = ratings2 |> List.tryFind (fun t -> t.Name = item.Name) match otherRating with | Some(v) -> distance <- distance + Math.Abs(item.Rating - v.Rating) | None -> () distance
In the above code, we have declared two sets of functionality. The very first line : type Ratings = {Name: string; Rating: float}, is called a Record Type in F#. Reason behind the creation of a record type is to represent a bands rating. For e.g. from the above matrix we can see that Bill has rated 4 for Deadmau5, 2 for Phoenix etc. If we represent these ratings as sets, then ratings of Bill it can be represented as : {“Deadmau5”,4} , {“Phoenix”, 2} etc. As we can see, the set based rating maps so closely with the record type definition in F#. If we have to do the same in C#, we would either create a class or struct, its just that C# declaration would be too verbose. The record type definition, can be used for representing ratings of any kind of entity.
Before I explain the next piece of code, I will present the original python code from the book:
def manhattan(rating1, rating2): distance = 0 for key inrating1: if key inrating2: distance += abs(rating1[key] -rating2[key]) return distanceIn the code, we have pass two sets of ratings. We iterate over one list, finds the matching record from the other list and if found, then compute the absolute difference between the two ratings and return the accumulated value. Quiet simple to understand.
Back to the F# code, lets go through it in a line by line manner.
- Function signature : We declare functions in F# using let keyword, followed by function name and then the function arguments. In our F# code, input arguments are called ratings1 and ratings2. Most of the time, F# can infer the type of parameter based on its usage but in some cases where the type is quiet complex, its required from our side to explicitly highlight the type of the parameter. This is true not just for function arguments but for all type of variable declarations in F#. Also note that, the type is written as “Ratings list”. This is same as writing List<Ratings> which would have been the case, had it been a C# code. Since F# and C# target .Net platform, all the constructs that we can use in C#, can be used in F# as well. So in the above case, I could have very well used C#’s generic list collection but I decided to instead use F#’s way of representing collection. For more on F# list : http://msdn.microsoft.com/en-in/library/dd233224.aspx.
- Mutable distance variable : The very first line in the function is : let mutable distance = 0.0. Immutability is one of the core concepts in F# programming. By default, any declared variable is immutable unless until explicitly declared as mutable. In F#, a variable is marked as mutable by adding the “mutable” keyword. For assigning values to mutable keyword, “<-“ operator is used. Most often, mutable accumulator variable and their usage can be replaced by recursively implementing the looping logic. I haven’t tried recursive approach yet as I wanted to keep the implementation simple and easy to understand. For more on mutability in F# : http://msdn.microsoft.com/en-us/library/dd233185.aspx
- The core for loop : For clarity lets look at the for loop itself :
for item in ratings1 do let otherRating = ratings2 |> List.tryFind (fun t -> t.Name = item.Name) match otherRating with | Some(v) -> distance <- distance + Math.Abs(item.Rating - v.Rating) | None -> ()
In the above code, we are iterating over the first list of ratings(ratings1). Inside for loop, for every item in first collection, we are looking for matching record(on the basis of name) from the second collection. In the line : ratings2 |> List.tryFind (fun t –> t.Name = item.Name), List.tryFind accepts a lambda expression and returns the first matching record. If not found, then it returns an OPTION type output.
Options : In F#, nullability or absence of value is handled via options type. In C#, if a matching record is not found then null value is returned. In that case, we will have to protect our code with null check. In case of F#, we rely on options type. For more on options type : http://msdn.microsoft.com/en-us/library/dd233245.aspx
Another interested construct used in the above statement is the pipeline operator (|>). In simple terms, pipeline operator feeds the value on left as input parameter to the function on the right side. We could have written: ratings2 |> List.tryFind (fun t –> t.Name = item.Name) without using the pipeline operator in the following manner :
List.tryFind (fun t –> t.Name = item.Name) ratings2
Finally, once we have the matching value, we make use of F# pattern matching functionality to extract value out of the found item(if present) and calculate the Manhattan distance value. This value is later assigned to the mutable accumulator. Once the for loop is completed, we return the “distance” value. By default, last statement or expression inside the function body is considered as the return value of the function. If we see the signature of the Manhattan function(by hovering over the function), it would appear as :
val Manhattan : ratings1:Ratings list * ratings2:Ratings list –> float
This is read as, the function “Manhattan” takes a tuple of (ratings1 and ratings2) as input parameter and returns a float value as output parameter.
Even though we have only written just one function, we have touched some really important and fundamental concepts of functional programming. If you are an experienced F# developer then you wouldn’t have faced any difficult in understanding the above code but for new-comers, I will list down the important topics that we have covered so far.
- F# Functions : http://msdn.microsoft.com/en-us/library/dd233229.aspx
- Modules : http://msdn.microsoft.com/en-us/library/dd233221.aspx
- Options : http://msdn.microsoft.com/en-us/library/dd233245.aspx
- Pattern Matching : http://msdn.microsoft.com/en-us/library/dd547125.aspx
In the next part, we will see how to represent the ratings test data and how to leverage the “Manhattan” function to compute distance between two users.