In the terminology of relational databases, a “join” is, semantically, like a nested loop over a pair of lists (or tables) of records, saving only those where some certain fields match. Unless we do something smart, this could be very expensive. Imagine searching a database of a million DNA profiles for the closest match to a sample that has 10,000 DNA features (I have no idea whether those are real numbers: I just made them up, but they sound ballpark to me). A dumb join would search all 1 million profiles for each of the 10,000 features, resulting in 10 billion match tests, almost all of which will fail – by design, of course. That’s going to hurt.
The “something smart” is to build an index and search through that. Your database doesn’t have to be large at all for this to pay off. In fact, even with just a few records, it’s cheaper to build an index, use it, and throw it away than it is to do a nested loop. Let’s prove that with a small sample. First, we’ll do everything with objects in memory just to keep it easy. In a later blog entry, I’ll show how to do exactly the same thing with DLINQ and SQL Express. So, in this entry, when I say “database,” I mean collection of objects in memory. I still have a bunch of interesting new VB9 things to show off, though, including object initializers, XML literals, and query comprehensions.
My partner, Erik Meijer, showed me a fantastic little database called Mondial, free for educational and research use. Huzzah! That’s us. It’s full of fun factoids about cities and countries and languages and politics. It’s downloadable in SQL, text, or XML. It illustrates most of the technical problems you’d encounter with a more typical business sample, you know, the old boring “Customers and Orders,” but it’s more likely to keep you awake.
Let’s start off with just two of the tables in Mondial: Countries and Cities. Turns out there are 195 countries and 3,053 cities in the database. Each Country has a unique “Code,” which is a very short unique string. The database can use this as a primary key: it only needs to check the “Code” field for uniqueness of a Country record. Here’s the VB for a class representing Mondial countries:
Public Class Country
Public Name As String
Public Code As String 'Primary key
Public Capital As String
Public Province As String
Public Area As Integer
Public Population As Integer
End Class
And here are a few instances, using VB9’s “object-initializer” syntax (you will need the VB9 CTP [Customer Technology Preview] to work with the code in this post):
{Name := "Albania", Code := "AL", Capital := "Tirane", Province := "Albania", Area := 28750, Population := 3249136}
{Name := "Andorra", Code := "AND", Capital := "Andorra la Vella", Province := "Andorra", Area := 450, Population := 72766}
{Name := "Spain", Code := "E", Capital := "Madrid", Province := "Madrid", Area := 504750, Population := 39181114}
Fun facts and figures already! How about cities? Here’s a class for Mondial cities:
Public Class City
Public Name As String
Public Country As String 'Foreign-key to Country.Code
Public Province As String
Public Population As Integer
Public Longitude As Double
Public Latitude As Double
End Class
And some samples:
{Name := "Ankara", Country := "TR", Province := "Ankara", Population := 2782200, Longitude := 32.8833, Latitude := 39.95}
{Name := "Helsinki", Country := "SF", Province := "Uusimaa", Population := 487428, Longitude := 24.95, Latitude := 60.1667}
{Name := "Rangoon", Country := "MYA", Province := "Yangon", Population := 2513000, Longitude := 96.15, Latitude := 16.7833}
Ok, so now we can see how Mondial links cities to their countries: City.Country == Country.Code, that is, the “Country” field of each city must match the “Code” field of the corresponding country. A city record doesn’t contain the name of its country, just the country’s code. So, unless we’ve memorized all 195 country codes, we can’t look at any old city and tell what country it’s in.
Database join to the rescue! We’ll use City.Country == Country.Code as our “join condition” and print out a list of cities with the names of their countries retrieved from the list of countries.
The initializers for all 195 countries and 3,053 cities fill up about, well, about 3,248 lines of code, which is 50 pages or so and too large to post. So, I think the best way to get them to you is to tell you how I made them. From the Mondial site, I copied one of the SQL formats, which contains 3,248 INSERT VALUES statements. Using emacs keyboard macros, I manually wrestled these statements into the appropriate VB format, shown in the pattern below.
Public Module DataInitializer
Public Function CountryData() As List(Of Country)
Dim cs As New List(Of Country)
cs.Add(New Country {Name := "Albania", Code := "AL", Capital := "Tirane", Province := "Albania", Area := 28750, Population := 3249136})
cs.Add(New Country {Name := "Greece", Code := "GR", Capital := "Athens", Province := "Greece", Area := 131940, Population := 10538594})
cs.Add(New Country {Name := "Macedonia", Code := "MK", Capital := "Skopje", Province := "Macedonia", Area := 25333, Population := 2104035})
...
CountryData = cs
End Function
Public Function CityData() As List(Of City)
Dim cs As New List(Of City)
cs.Add(New City {Name := "Luxembourg", Country := "L", Province := "Luxembourg", Population := 76600, Latitude := 6.08, Longitude := 49.4})
cs.Add(New City {Name := "Monaco", Country := "MC", Province := "Monaco", Population := 1234, Longitude := 7.2, Latitude := 43.7})
cs.Add(New City {Name := "San Marino", Country := "RSM", Province := "San Marino", Population := 4416, Longitude := 12.2, Latitude := 43.5})
...
CityData = cs
End Function
End Module
The whole process took me 10 minutes, including making and correcting the inevitable ham-handed mistakes with emacs regular expressions. Someone with good Perl skills could probably have done the job even more quickly. I do NOT recommend trying this with search-and-replace in the VB IDE, because the VB pretty lister will try to format each of the 3,248 lines one at a time, and it will hurt.
Let’s start up a VB module, declare a couple of variables to hold all our data, and a subroutine to fill them up.:
Module Module1
Dim Countries As List(Of Country)
Dim Cities As List(Of City)
...
Sub Initialize()
Countries = DataInitializer.CountryData
Cities = DataInitializer.CityData
End Sub
Now, what’s the most straightforward way to do the join? How about the following:
Dim xs As IEnumerable
' O(Cities) * O(Countries)
Sub SlowQuery()
xs = Select New {Cty := it.t.Name, Cny := it.n.Name} _
From t In CitySubset, n In CountrySubset _
Where it.t.Country = it.n.Code
For Each x In xs
'Print or whatever... must do something to force
'query execution, however.
Next
End Sub
This sub uses anonymous types, data initializers, and a query comprehension – all new VB9 features – to build an IEnumerable containing city names and their country names. Its execution time is of order 3,053*195 =~ 600,000; painful. Also note carefully that the LINQ query comprehension just sets up the query without executing it. Not until we do something with the result, xs, does any searching and matching take place. LINQ query execution is lazy.
Most of these 600,000 comparisons will fail, of course, because they’re checking whether Madrid is in Zimbabwe or Capetown is in Iceland, and so on. In fact, we know that exactly 3,053 of them will succeed. We can be a lot smarter.
Suppose we had an index that would let us look up the country for any city, or, equivalently, the list of cities for any country? Then, we could write something like the following:
Dim ys As IEnumerable
' O(Number of Cities)
Sub FastQuery0()
ys = Select New {Cty := it.t.Name, Cny := it.n.Name} _
From n In Countries, t In CitiesFromCountry(n)
For Each y In ys
'Force query execution lest lazy
Next
End Sub
Where
Dim CountryFromCode As Dictionary(Of String, Country)
Dim CitiesFromCountry As _
Dictionary(Of Country, List(Of City))
And where we build the indices as follows:
Sub BuildIndices()
' "CountryFromCode" looks up country object from a "code" string.
' Timing to build this index is O(Countries).
CountryFromCode = New Dictionary(Of String, Country)
For Each cn In Countries
CountryFromCode.Add(cn.Code, cn)
Next
' "CitiesFromCountry" looks up a list of cities
' from a country.
' Time to build index is O(Cities)
CitiesFromCountry = _
New Dictionary(Of Country, List(Of City))
For Each ct In Cities
' Missing code is worthy of an uncaught exception
Dim d = CountryFromCode(ct.Country)
Dim cts As List(Of City) = Nothing
If Not CitiesFromCountry.TryGetValue(d, cts) Then
cts = New List(Of City)
CitiesFromCountry.Add(d, cts)
End If
cts.Add(ct)
Next
End Sub
We’re not being very clever about building these indices. For instance, we check for existence of the city list for every city. It doesn’t matter, however, because it’s so much slower to loop over all 600,000 combinations that 2,858 redundant checks, by comparison, are negligible. Also, if we we’re really clever, we would only build one of the two indices, not both, choosing between them by estimating the time to build each and picking the fastest. Real databases do this kind of estimation as part of Query Planning, and a great deal of sophistication of SQL Server 2005, for instance, is devoted to such superoptimization.
For our illustrative purposes, though, we’re more interested in clarity, flexibility, and simplicity. Even though we’re overkilling in the index-building department, it’s still far cheaper than doing the slow, nested loop. It’s still just O(Number of Cities). Now, before I show you how I prove these claims, let me just jump to some results and explain what they mean. Then I’ll show you how to reproduce these timings:
================================================================
In-memory timings (hecta mu):
+-----+--------+--------+--------+--------+
| NRws| SlowQry| FastQry|IdxBuild| FastTot|
+-----+--------+--------+--------+--------+
| 3053| 1405.87| 10.77| 10.88| 21.64|
| 1413| 330.15| 5.19| 5.25| 10.44|
| 891| 105.39| 3.25| 3.23| 6.48|
| 329| 19.20| 1.25| 1.27| 2.52|
| 89| 2.91| 0.44| 0.43| 0.87|
| 28| 0.62| 0.20| 0.20| 0.40|
| 4| 0.18| 0.11| 0.11| 0.22|
| 2| 0.13| 0.09| 0.10| 0.19|
+-----+--------+--------+--------+--------+
================================================================
Wrote data in Excel format to file ..\..\PerfData31A5.xml
Press "Enter" to end the program ...
First, my dimensions. I present timings in hundreds of microseconds, or hecta-mus, just because they come out as nice, digestible magnitudes, where we can get a feel for them without doing a lot of conversions in our heads. The first column shows the number of city-rows over which we do a join. We vary this number by picking random subsets of the entire city set so we can find out how small a database we need before it’s cheaper to do the slow, nested-loop query than to build the index, use it in the fast query, and then throw it away. The second column is the timing of the slow query in hecta-mus. The third column is the timing of the fast query; the fourth column is the timing of the building of a fresh index over the randomly chosen subset; and the last column is the sum of those two. So, the numbers to compare are the second column versus the last column. We write out these timings to an Excel file (using XML literals, which we show off later), so we can chart them, as follows:
But we don’t even need the chart to see that it’s not until we have well fewer than 28 rows does it pay to even think of doing the nested-loop instead of the build-index, use-index, throw-index-away pattern. In other words, for all practical purposes, never.
So, how do we create these results? For high-resolution timing, I’ll just use the underlying Win32 performance counters. I think there is a .NET way to do this, but I just happen to know the Win32 interop programming pattern in my fingers, so here’s my cut at it. The following is a general facility for timing the execution of ANY function-of-no-arguments, which I’m calling a thunk here. This usage of the word “thunk” is borrowed from the Scheme literature: in the Windows zeitgeist it usually means something else, like a shim or a trampoline, so don’t be confused. I’m not using the term in that sense, here.
Public Delegate Sub Thunk()
Public Module HiRezTimer
Function TimeAThunk(ByVal Something As Thunk) As Double
Dim startCount As Int64 = QPC()
Something()
Dim stopCount As Int64 = QPC()
TimeAThunk = (stopCount - startCount) / QPF()
End Function
<System.Runtime.InteropServices.DllImport("Kernel32.dll")> _
Public Function QueryPerformanceCounter(ByRef perfcount As Int64) As Boolean
End Function
<System.Runtime.InteropServices.DllImport("Kernel32.dll")> _
Public Function QueryPerformanceFrequency(ByRef freq As Int64) As Boolean
End Function
Public Function QPC() As Int64
Dim perfcount As Int64
QueryPerformanceCounter(perfcount)
Return perfcount
End Function
Public Function QPF() As Int64
Dim freq As Int64
QueryPerformanceFrequency(freq)
Return freq
End Function
End Module
And here is how I use it to produce the results above:
Sub TimeOne(ByVal n As Integer)
Dim t As New TimingRow
t.ln = n
t.q1 = Hecta(Micro(TimeAThunk(AddressOf SlowQuery)))
t.ix = Hecta(Micro(TimeAThunk(AddressOf BuildSubIndices)))
t.q2 = Hecta(Micro(TimeAThunk(AddressOf FastQuery)))
RawRows.Add(t)
Console.WriteLine("|{0,5}|{1,8:0.00}|{2,8:0.00}|{3,8:0.00}|{4,8:0.00}|", t.ln, t.q1, t.ix, t.q2, t.ix + t.q2)
End Sub
Sub DoATiming()
DrawLine()
RawRows = New List(Of TimingRow)
Console.WriteLine("In-memory timings (hecta mu):")
BuildIndices()
CountrySubset = Countries
CitySubset = Cities
Console.WriteLine("+-----+--------+--------+--------+--------+")
Console.WriteLine("| NRws| SlowQry| FastQry|IdxBuild| FastTot|")
Console.WriteLine("+-----+--------+--------+--------+--------+")
TimeOne(CitySubset.Count)