Fun Hash Joins with VB9 LINQ

In the terminology of relational databases, a “join” is, semantically, like a nested loop over a pair of lists (or tables) of records, saving only those where some certain fields match. Unless we do something smart, this could be very expensive. Imagine searching a database of a million DNA profiles for the closest match to a sample that has 10,000 DNA features (I have no idea whether those are real numbers: I just made them up, but they sound ballpark to me). A dumb join would search all 1 million profiles for each of the 10,000 features, resulting in 10 billion match tests, almost all of which will fail – by design, of course. That’s going to hurt.

 

The “something smart” is to build an index and search through that. Your database doesn’t have to be large at all for this to pay off. In fact, even with just a few records, it’s cheaper to build an index, use it, and throw it away than it is to do a nested loop. Let’s prove that with a small sample. First, we’ll do everything with objects in memory just to keep it easy. In a later blog entry, I’ll show how to do exactly the same thing with DLINQ and SQL Express. So, in this entry, when I say “database,” I mean collection of objects in memory. I still have a bunch of interesting new VB9 things to show off, though, including object initializers, XML literals, and query comprehensions.

 

My partner, Erik Meijer, showed me a fantastic little database called Mondial, free for educational and research use. Huzzah! That’s us. It’s full of fun factoids about cities and countries and languages and politics. It’s downloadable in SQL, text, or XML. It illustrates most of the technical problems you’d encounter with a more typical business sample, you know, the old boring “Customers and Orders,” but it’s more likely to keep you awake.

 

Let’s start off with just two of the tables in Mondial: Countries and Cities. Turns out there are 195 countries and 3,053 cities in the database. Each Country has a unique “Code,” which is a very short unique string. The database can use this as a primary key: it only needs to check the “Code” field for uniqueness of a Country record. Here’s the VB for a class representing Mondial countries:

 

Public Class Country

    Public Name As String

    Public Code As String 'Primary key

    Public Capital As String

    Public Province As String

    Public Area As Integer

    Public Population As Integer

End Class

 

And here are a few instances, using VB9’s “object-initializer” syntax (you will need the VB9 CTP [Customer Technology Preview] to work with the code in this post):

 

{Name := "Albania", Code := "AL", Capital := "Tirane", Province := "Albania", Area := 28750, Population := 3249136}

 

{Name := "Andorra", Code := "AND", Capital := "Andorra la Vella", Province := "Andorra", Area := 450, Population := 72766}

 

{Name := "Spain", Code := "E", Capital := "Madrid", Province := "Madrid", Area := 504750, Population := 39181114}

 

Fun facts and figures already! How about cities? Here’s a class for Mondial cities:

 

Public Class City

    Public Name As String

    Public Country As String 'Foreign-key to Country.Code

    Public Province As String

    Public Population As Integer

    Public Longitude As Double

    Public Latitude As Double

End Class

 

And some samples:

 

{Name := "Ankara", Country := "TR", Province := "Ankara", Population := 2782200, Longitude := 32.8833, Latitude := 39.95}

 

{Name := "Helsinki", Country := "SF", Province := "Uusimaa", Population := 487428, Longitude := 24.95, Latitude := 60.1667}

 

{Name := "Rangoon", Country := "MYA", Province := "Yangon", Population := 2513000, Longitude := 96.15, Latitude := 16.7833}

 

Ok, so now we can see how Mondial links cities to their countries: City.Country == Country.Code, that is, the “Country” field of each city must match the “Code” field of the corresponding country. A city record doesn’t contain the name of its country, just the country’s code. So, unless we’ve memorized all 195 country codes, we can’t look at any old city and tell what country it’s in.

 

Database join to the rescue! We’ll use City.Country == Country.Code as our “join condition” and print out a list of cities with the names of their countries retrieved from the list of countries.

 

The initializers for all 195 countries and 3,053 cities fill up about, well, about 3,248 lines of code, which is 50 pages or so and too large to post. So, I think the best way to get them to you is to tell you how I made them. From the Mondial site, I copied one of the SQL formats, which contains 3,248 INSERT VALUES statements. Using emacs keyboard macros, I manually wrestled these statements into the appropriate VB format, shown in the pattern below.

 

Public Module DataInitializer

 

    Public Function CountryData() As List(Of Country)

        Dim cs As New List(Of Country)

        cs.Add(New Country {Name := "Albania", Code := "AL", Capital := "Tirane", Province := "Albania", Area := 28750, Population := 3249136})

        cs.Add(New Country {Name := "Greece", Code := "GR", Capital := "Athens", Province := "Greece", Area := 131940, Population := 10538594})

        cs.Add(New Country {Name := "Macedonia", Code := "MK", Capital := "Skopje", Province := "Macedonia", Area := 25333, Population := 2104035})

...

        CountryData = cs

    End Function

 

    Public Function CityData() As List(Of City)

        Dim cs As New List(Of City)

        cs.Add(New City {Name := "Luxembourg", Country := "L", Province := "Luxembourg", Population := 76600, Latitude := 6.08, Longitude := 49.4})

        cs.Add(New City {Name := "Monaco", Country := "MC", Province := "Monaco", Population := 1234, Longitude := 7.2, Latitude := 43.7})

        cs.Add(New City {Name := "San Marino", Country := "RSM", Province := "San Marino", Population := 4416, Longitude := 12.2, Latitude := 43.5})

...

        CityData = cs

    End Function

 

End Module

 

The whole process took me 10 minutes, including making and correcting the inevitable ham-handed mistakes with emacs regular expressions. Someone with good Perl skills could probably have done the job even more quickly. I do NOT recommend trying this with search-and-replace in the VB IDE, because the VB pretty lister will try to format each of the 3,248 lines one at a time, and it will hurt.

 

Let’s start up a VB module, declare a couple of variables to hold all our data, and a subroutine to fill them up.:

 

Module Module1

    Dim Countries As List(Of Country)

    Dim Cities As List(Of City)

...

 

    Sub Initialize()

        Countries = DataInitializer.CountryData

        Cities = DataInitializer.CityData

    End Sub

 

Now, what’s the most straightforward way to do the join? How about the following:

    Dim xs As IEnumerable

    ' O(Cities) * O(Countries)

    Sub SlowQuery()

        xs = Select New {Cty := it.t.Name, Cny := it.n.Name} _

            From t In CitySubset, n In CountrySubset _

            Where it.t.Country = it.n.Code

 

        For Each x In xs

            'Print or whatever... must do something to force

            'query execution, however.

        Next

    End Sub

 

This sub uses anonymous types, data initializers, and a query comprehension – all new VB9 features – to build an IEnumerable containing city names and their country names. Its execution time is of order 3,053*195 =~ 600,000; painful. Also note carefully that the LINQ query comprehension just sets up the query without executing it. Not until we do something with the result, xs, does any searching and matching take place. LINQ query execution is lazy.

 

Most of these 600,000 comparisons will fail, of course, because they’re checking whether Madrid is in Zimbabwe or Capetown is in Iceland, and so on. In fact, we know that exactly 3,053 of them will succeed. We can be a lot smarter.

 

Suppose we had an index that would let us look up the country for any city, or, equivalently, the list of cities for any country? Then, we could write something like the following:

 

    Dim ys As IEnumerable

    ' O(Number of Cities)

    Sub FastQuery0()

        ys = Select New {Cty := it.t.Name, Cny := it.n.Name} _

            From n In Countries, t In CitiesFromCountry(n)

 

        For Each y In ys

            'Force query execution lest lazy

        Next

    End Sub

 

Where

 

    Dim CountryFromCode As Dictionary(Of String, Country)

    Dim CitiesFromCountry As _

        Dictionary(Of Country, List(Of City))

 

And where we build the indices as follows:

 

    Sub BuildIndices()

        ' "CountryFromCode" looks up country object from a "code" string.

        ' Timing to build this index is O(Countries).

        CountryFromCode = New Dictionary(Of String, Country)

        For Each cn In Countries

            CountryFromCode.Add(cn.Code, cn)

        Next

        ' "CitiesFromCountry" looks up a list of cities

        ' from a country.

        ' Time to build index is O(Cities)

        CitiesFromCountry = _

            New Dictionary(Of Country, List(Of City))

        For Each ct In Cities

            ' Missing code is worthy of an uncaught exception

            Dim d = CountryFromCode(ct.Country)

            Dim cts As List(Of City) = Nothing

            If Not CitiesFromCountry.TryGetValue(d, cts) Then

                cts = New List(Of City)

                CitiesFromCountry.Add(d, cts)

            End If

            cts.Add(ct)

        Next

    End Sub

 

We’re not being very clever about building these indices. For instance, we check for existence of the city list for every city. It doesn’t matter, however, because it’s so much slower to loop over all 600,000 combinations that 2,858 redundant checks, by comparison, are negligible. Also, if we we’re really clever, we would only build one of the two indices, not both, choosing between them by estimating the time to build each and picking the fastest. Real databases do this kind of estimation as part of Query Planning, and a great deal of sophistication of SQL Server 2005, for instance, is devoted to such superoptimization.

 

For our illustrative purposes, though, we’re more interested in clarity, flexibility, and simplicity. Even though we’re overkilling in the index-building department, it’s still far cheaper than doing the slow, nested loop. It’s still just O(Number of Cities). Now, before I show you how I prove these claims, let me just jump to some results and explain what they mean. Then I’ll show you how to reproduce these timings:

 

================================================================

In-memory timings (hecta mu):

+-----+--------+--------+--------+--------+

| NRws| SlowQry| FastQry|IdxBuild| FastTot|

+-----+--------+--------+--------+--------+

| 3053| 1405.87|   10.77|   10.88|   21.64|

| 1413|  330.15|    5.19|    5.25|   10.44|

|  891|  105.39|    3.25|    3.23|    6.48|

|  329|   19.20|    1.25|    1.27|    2.52|

|   89|    2.91|    0.44|    0.43|    0.87|

|   28|    0.62|    0.20|    0.20|    0.40|

|    4|    0.18|    0.11|    0.11|    0.22|

|    2|    0.13|    0.09|    0.10|    0.19|

+-----+--------+--------+--------+--------+

================================================================

Wrote data in Excel format to file ..\..\PerfData31A5.xml

Press "Enter" to end the program ...

 

First, my dimensions. I present timings in hundreds of microseconds, or hecta-mus, just because they come out as nice, digestible magnitudes, where we can get a feel for them without doing a lot of conversions in our heads. The first column shows the number of city-rows over which we do a join. We vary this number by picking random subsets of the entire city set so we can find out how small a database we need before it’s cheaper to do the slow, nested-loop query than to build the index, use it in the fast query, and then throw it away. The second column is the timing of the slow query in hecta-mus. The third column is the timing of the fast query; the fourth column is the timing of the building of a fresh index over the randomly chosen subset; and the last column is the sum of those two. So, the numbers to compare are the second column versus the last column. We write out these timings to an Excel file (using XML literals, which we show off later), so we can chart them, as follows:

 

 

But we don’t even need the chart to see that it’s not until we have well fewer than 28 rows does it pay to even think of doing the nested-loop instead of the build-index, use-index, throw-index-away pattern. In other words, for all practical purposes, never.

 

So, how do we create these results? For high-resolution timing, I’ll just use the underlying Win32 performance counters. I think there is a .NET way to do this, but I just happen to know the Win32 interop programming pattern in my fingers, so here’s my cut at it. The following is a general facility for timing the execution of ANY function-of-no-arguments, which I’m calling a thunk here. This usage of the word “thunk” is borrowed from the Scheme literature: in the Windows zeitgeist it usually means something else, like a shim or a trampoline, so don’t be confused. I’m not using the term in that sense, here.

 

Public Delegate Sub Thunk()

 

Public Module HiRezTimer

 

    Function TimeAThunk(ByVal Something As Thunk) As Double

        Dim startCount As Int64 = QPC()

        Something()

        Dim stopCount As Int64 = QPC()

        TimeAThunk = (stopCount - startCount) / QPF()

    End Function

 

    <System.Runtime.InteropServices.DllImport("Kernel32.dll")> _

    Public Function QueryPerformanceCounter(ByRef perfcount As Int64) As Boolean

 

    End Function

 

    <System.Runtime.InteropServices.DllImport("Kernel32.dll")> _

    Public Function QueryPerformanceFrequency(ByRef freq As Int64) As Boolean

 

    End Function

 

    Public Function QPC() As Int64

        Dim perfcount As Int64

        QueryPerformanceCounter(perfcount)

        Return perfcount

    End Function

 

    Public Function QPF() As Int64

        Dim freq As Int64

        QueryPerformanceFrequency(freq)

        Return freq

    End Function

 

End Module

 

And here is how I use it to produce the results above:

 

    Sub TimeOne(ByVal n As Integer)

        Dim t As New TimingRow

 

        t.ln = n

        t.q1 = Hecta(Micro(TimeAThunk(AddressOf SlowQuery)))

        t.ix = Hecta(Micro(TimeAThunk(AddressOf BuildSubIndices)))

        t.q2 = Hecta(Micro(TimeAThunk(AddressOf FastQuery)))

 

        RawRows.Add(t)

        Console.WriteLine("|{0,5}|{1,8:0.00}|{2,8:0.00}|{3,8:0.00}|{4,8:0.00}|", t.ln, t.q1, t.ix, t.q2, t.ix + t.q2)

    End Sub

 

    Sub DoATiming()

        DrawLine()

 

        RawRows = New List(Of TimingRow)

        Console.WriteLine("In-memory timings (hecta mu):")

 

        BuildIndices()

 

        CountrySubset = Countries

        CitySubset = Cities

 

        Console.WriteLine("+-----+--------+--------+--------+--------+")

        Console.WriteLine("| NRws| SlowQry| FastQry|IdxBuild| FastTot|")

        Console.WriteLine("+-----+--------+--------+--------+--------+")

        TimeOne(CitySubset.Count)

        TimeOne(BuildSubsets(1, 2).m)

        TimeOne(BuildSubsets(1, 4).m)

        TimeOne(BuildSubsets(1, 8).m)

        TimeOne(BuildSubsets(1, 16).m)

        TimeOne(BuildSubsets(1, 32).m)

        TimeOne(BuildSubsets(1, 64).m)

        TimeOne(BuildSubsets(1, 128).m)

        Console.WriteLine("+-----+--------+--------+--------+--------+")

    End Sub

 

Over in the article section, I’ll post the code for “BuildSubsets,” along with all this code in complete VB modules. It’s pretty obvious, dumb, actually. But remember, it doesn’t need to be fast or smart, just clear and easy.

 

The last thing to show off here is how easy it is to write out the Excel version of the timing data using VB9 XML literals. It literally took me less than half an hour to do this, knowing next to nothing about Excel’s XML support. I still know next to nothing about it: I don’t need to know. The timing data goes in a “TimingRow,” which looks like this:

 

Public Class TimingRow

    Public ln As Integer

    Public q1 As Double

    Public ix As Double

    Public q2 As Double

End Class

 

Length, query1, index-build, and query2, yup yup. In the timing routines above, we not only Console.WriteLine the results, but fill new timing rows with them, so we can add a timing row to a list of “RawRows,” declared as follows:

 

Module ExcelWriter

    Dim xl As XDocument

    Public RawRows As List(Of TimingRow)

...

 

In our main routine, after we’re done with all the timings, we call

        ExcelWriter.SaveDocument()

 

Which looks like this:

 

    Public Sub SaveDocument()

        PrepareOutput()

        Dim filename = NonCollidingFileName("..\..\PerfData", ".xml")

        Dim str As New StringWriter

        xl.Save(str)

        Using sw = File.CreateText(filename)

            sw.WriteLine(str)

            sw.Close()

        End Using

        Console.WriteLine("Wrote data in Excel format to file {0}", filename)

    End Sub

 

The only interesting line, here, is “PrepareOutput;” the code for “NonCollidingFileName” is over in the article section with the rest of the entire sample. So, here is “PrepareOutput:”

 

    Private Sub PrepareOutput()

        xl = <?xml version="1.0"?>

             <?mso-application progid="Excel.Sheet"?>

             <Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"

                 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">

                 <Worksheet ss:Name="Sheet1">

                     <Table>

                         <%= MakeRow("Length", "Qry1", "Indx", "Qry2", "Tot2") %>

                         <%= Select MakeRow(c.ln, c.q1, c.ix, c.q2) _

                             From c In RawRows %>

                     </Table>

                 </Worksheet>

             </Workbook>

    End Sub

 

This creates the literal XML document. I got this document template by creating a bogey document, with constant data instead of variable data, in Excel by hand, choosing “Save As XML,” opening up the result in Notepad, and cutting and pasting the result over to my VB code. I didn’t even have to search for documentation on the XML format for Excel files, let alone read, understand, or learn it. Just blind cut and paste. Neat.

 

I replaced the constant rows with VB XML holes, bracketed by the <%= ... %> syntax. I created two overloads of “MakeRow,” one for column labels, another for the numerical data, calling one straight up and the other through a “Select” comprehension. Neat. Here are those two overloads:

 

    Private Function MakeRow(ByVal a As Integer, ByVal b As Double, _

    ByVal c As Double, ByVal d As Double) As XElement

        MakeRow = <Row xmlns="urn:schemas-microsoft-com:office:spreadsheet"

                      xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"

                      ss:AutoFitHeight="0">

                      <Cell><Data ss:Type="Number"><%= a %></Data></Cell>

                      <Cell><Data ss:Type="Number"><%= b %></Data></Cell>

                      <Cell><Data ss:Type="Number"><%= c %></Data></Cell>

                      <Cell><Data ss:Type="Number"><%= d %></Data></Cell>

                      <Cell ss:Formula="=RC[-2]+RC[-1]"></Cell>

                  </Row>

    End Function

 

    Private Function MakeRow(ByVal a As String, ByVal b As String, _

    ByVal c As String, ByVal d As String, ByVal e As String) As XElement

        MakeRow = <Row xmlns="urn:schemas-microsoft-com:office:spreadsheet"

                      xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"

                      ss:AutoFitHeight="0">

                      <Cell><Data ss:Type="String"><%= a %></Data></Cell>

                      <Cell><Data ss:Type="String"><%= b %></Data></Cell>

                      <Cell><Data ss:Type="String"><%= c %></Data></Cell>

                      <Cell><Data ss:Type="String"><%= d %></Data></Cell>

                      <Cell><Data ss:Type="String"><%= e %></Data></Cell>

                  </Row>

    End Function

 

More XML literals via blind cut-and-paste, with holes inserted via educated guesswork. Notice that the last line in the numerical overload writes out a constant cell, it’s just a constant formula in Excel, that uses relative cell references to calculate the totals. Neat.

 

So there you have it. How to do dumb queries, hash joins, time your results, and write them out in an Excel file, in one really short VB9 program. One last little detail, you really do need to do the timings TWICE and ignore the first set of results, because the CLR Jitter will need to warm up on the first go-round. This could be prevented by NGenning the program before running it, but it’s easier this way, and this post is all about how to do it the easy way. So here is my main routine, which reports that we should ignore the first timing:

 

    Sub Main()

        Initialize()

        Console.WriteLine("Ignore first timing (JIT warmup)")

        DoATiming() 'First one just warms up the Jit

        DoATiming()

        DrawLine()

        ExcelWriter.SaveDocument()

        Console.Write("Press ""Enter"" to end the program ...")

        Console.ReadLine()

        Exit Sub

    End Sub

 

Published Wednesday, March 15, 2006 10:04 AM by brianbec

Comments

# re: Fun Hash Joins with VB9 LINQ@ Thursday, March 16, 2006 6:44 AM

Re DNA: Depending on what you're doing, you're probably not too far off. Of course, the total number of features from which 10 thousand could be selected is on the order of >1000x greater, if you're thinking of SNPs.

Anon

# re: Fun Hash Joins with VB9 LINQ@ Wednesday, March 22, 2006 6:38 AM

You might be interested to know about the System.Diagnostics.Stopwatch class.

damien morton