March 2006 - Posts
Take three PhD specialists in programming languages, three PhD specialists in relational database, and three PhD specialists in network security, and seat them around a table. Now take three coconut-cream pies, three chocolate-cream pies, and three banana-cream pies and randomly place one pie in front of each philosopher. Finally, give them three hours to come up with a position paper on the concept of "Identity." Calculate how many philosophers at the end of three hours have three kinds of cream-pie on their faces, how many have two kinds, and how many have just one kind.
Ok, I thought it was funny, but then, I've been involved in a lot of design and standards committees. The point is, no single concept of "Identity" works for all three. In fact, no concept of Identity works for any pair of the three, and it's only barely possible to get a concept that works for any one of the three. Let's illustrate this just for programming languages and relational databases.
Consider the following data, stated in English:
Venice is in Italy.
Rome is in Italy.
Nogales is in Mexico.
Nogales is in the USA.
Hey, wait a minute. One city in two countries? Sure. Happens all the time. Trieste straddles Slovenia and Italy. How about Bratislava? It's in three countries: Slovakia, Hungary, and Austria. Sault Ste. Marie is in both Canada and the USA. And those are just the physically connected examples. The USA has at least one Rome – in New York – and at least one Venice – in California.
Ok, the point is that our Mondial database is going to have to deal with the fact that the Country-City relationship is really many-to-many, not one-to-many, and that means it's going to have to deal with Identity in a deeper way than we sketched out prior to this.
How would our three database specialists represent the data above? Experienced practitioners will normalize the data, that is, remove redundancies, resulting in three tables: one for the countries, one for the cities, and one, called a join table, middle table, or intersection table, for the relationship. The primary key for the Country table will be just the name of the country. The primary key for the City table will be just the name of the city. The middle table will contain pairs of foreign keys. One member of each pair will be a city name, and the other member of each pair will be a country name. The primary key of the middle table will be the pair of foreign keys. In pictures:
| Table Labels: | | CountryTable | | CityTable | | CountryCityTable |
| Column Labels: | | Country | | City | | Country | City |
| Data Rows: | | Italy | | Venice | | Italy | Venice |
| | | Mexico | | Rome | | Italy | Rome |
| | | USA | | Nogales | | Mexico | Nogales |
| | | | | | | USA | Nogales |
So we see that the only way to refer to a country or a city in the middle table is by a copy of its primary key. Therefore, in this normalized database,
Two rows are identical if and only if their primary keys have the same value.
Many database practitioners use underlining on the labels of columns containing primary keys; we'll use bold. Since every column contains a component of a primary key, all column labels are bold.
There is also an obvious non-normalized view form of the same data. Non-normalized basically means that this form contains mathematically redundant information, and this can cause ambiguity problems in the identity department. Typically, such a form would be temporary, not the permanent form, for constructing reports, building queries, analyzing by-products, and what not. It would look something like this:
| Table Labels: | | CountryFromCityTable | | CityFromCountryTable |
| Column Labels: | | City | Country | | Country | City |
| Data Rows: | | Venice | Italy | | Italy | Venice |
| | | Rome | Italy | | Italy | Rome |
| | | Nogales | Mexico | | Mexico | Nogales |
| | | Nogales | USA | | USA | Nogales |
What's the natural way to express this scenario in Visual Basic? Consider our running "Mondial" example again, but stripped way, way down:
Public Class Country
Public Name As String
Public MyCities As List(Of City)
End Class
Public Class City
Public Name As String
Public MyCountries As List(Of Country)
End Class
...
Sub InitData()
Dim Italy = New Country {Name := "Italy"}
Dim USA = New Country {Name := "USA"}
Dim Mexico = New Country {Name := "Mexico"}
Dim Venice = New City {Name := "Venice"}
Dim Rome = New City {Name := "Rome"}
Dim Nogales = New City {Name := "Nogales"}
Italy.MyCities = New List(Of City)
Mexico.MyCities = New List(Of City)
USA.MyCities = New List(Of City)
Mexico.MyCities.Add(Nogales)
Italy.MyCities.Add(Venice)
Italy.MyCities.Add(Rome)
USA.MyCities.Add(Nogales)
Venice.MyCountries = New List(Of Country)
Rome.MyCountries = New List(Of Country)
Nogales.MyCountries = New List(Of Country)
Venice.MyCountries.Add(Italy)
Rome.MyCountries.Add(Italy)
Nogales.MyCountries.Add(Mexico)
Nogales.MyCountries.Add(USA)
Countries.Add(Italy)
Countries.Add(USA)
Countries.Add(Mexico)
Cities.Add(Venice)
Cities.Add(Rome)
Cities.Add(Nogales)
End Sub
That all just represents a graph of objects in memory, which we can sketch as follows:
_.------------------------.
,-'' `v
+-------+-/-+---+ +---+--------+
| Italy | o | o |<-------------------= o | Venice |
+-------+---+--\+ +---+--------+
^ '-.
`--. `--.
`-------. `-----------.
`-------. `---.
`------. `.
`-. `v
+-------+---+ +\--+------+
| USA | o-=--------. | o | Rome |
+-------+---+ `--. +---+------+
^. `-----.
`----------. `---.
`-----. `.
`--. `--.
,------------. `--. `-.
v `-. `--. v
+--------+---+ `-. +---+-\-+---------+
| Mexico | o | `-= o | o | Nogales |
+--------+-\-+ +---+---+---------+
`-. ^
`----. /
`----------------------'
The point is that with objects in memory, the natural representation of the data is through references, and we may conclude that
Two objects are identical if and only if their references are identical.
So, indeed, since the USA object has a reference to Nogales that is identical to the reference to Nogales contained in the Mexico object, these two Nogales's are (drum roll) identical. Likewise, Venice has a reference to Italy, and Rome has a reference to the same, identical, Italy object.
Now, we might duplicate exactly this situation in a relational database, but we would have to introduce additional data for references, along the following lines:
| CountryTable | | CityTable | | CoutryCityTable |
| Country | CnyPK | | City | CtyPK | | CnyFK | CtyFK |
| Italy | 37 | | Venice | 26 | | 37 | 26 |
| Mexico | 162 | | Rome | 8 | | 37 | 8 |
| USA | 49 | | Nogales | 33 | | 49 | 33 |
| | | | | | | 162 | 33 |
In fact, this is the kind of automatic solution that object-oriented databases would typically compute on your behalf to persist an object graph for you. It just substitutes unique numbers for object references, a process called swizzling. But it still MUST USE a middle table, because there is no good way to represent a variable-length list of references in a column of a table. Most solutions for nested tables, in fact, just un-nest the nested information into additional tables, just as above. So, even though this appears to be a simulation of the object-oriented graph, it doesn't take much to see that it's a totally uninteresting transformation of the normalized relational database form: it just added numerical keys, and it still uses value-equality as the definition for identity.
The point I'm trying to make, here, is that object-oriented programs and relational databases have totally incompatible notions of identity, and that the incompatibilty opens up all kinds of downstream complexities. The various schemes for resolving the incompatibilities are mapping solutions. The world of mapping is very, very complicated, because mapping solutions typically try to resolve not just the mis-match in the notions of identity, but to provide all kinds of operational services, incorporating, for instance, updatable non-normalized views, and support for badly designed legacy databases, which are far too valuable to throw away but far too expensive to normalize.
Sorry, I don't have a bolt of lightning that, with blinding obviousness and clarity, puts this entire topic into order. I think it's best just to stock up on towels and view it as a never-ending pie fight we just have to deal with. That way, we'll interpret it as funny instead of troubling.
Buckle up, boys and girls, we’re going to be little DBA’s today ! (Data Base Administrators)
Earlier, we showed how to use LINQ over objects in memory. Today, we show how to write EXACTLY the same program, this time using DLINQ, over a real SQL-Express relational database. The whole DLINQ project, including the full data set of 3,053 rows, can be downloaded here. The earlier LINQ project – same program only over objects-in-memory, can be downloaded here. You will need the VB9 CTP (Customer Technology Preview), plus some patience. That is a pre-alpha version of VB, and you may have to create new projects and copy the source files (I have had some “inside” reports that this is so).
The point of both programs is to compare the time for a fast query to the time for a slow query. To deliver the punch line, even with 14 cities in a tiny DLINQ database, the slow query is 20 times slower than the fast one .
================================================================
SlowQuery1 found 14 cities, Time = 54.04 milliseconds
FastQuery1 found 14 cities, Time = 2.12 milliseconds
================================================================
With the full database of 3,053 cities, the results are crushing:
================================================================
SlowQuery1 found 3053 cities, Time = 46109.31 milliseconds
FastQuery1 found 3053 cities, Time = 40.53 milliseconds
================================================================
The data “entities” are from the Mondial database, a fun and free collection of facts and figures about the world. We start off small with just one “relationship” between cities and countries: each country has many cities, expressed through the “join condition” Country.Code == City.Country. In the in-memory case, we firmly establish that, practically speaking, it’s ALWAYS better to build an index, search the index, and throw the index away than to scan the relationship via a nested double loop. In the DLINQ case, we’ll use DLINQ to set up foreign keys in the Database and scan them in the fast query. Here is the query code for the DLINQ case (and we explain all the details below):
Sub SlowQuery1()
xs = Select New {Cty := it.t.Name, Cny := it.n.Name} _
From t In db.Cities, n In db.Countries _
Where it.t.Country = it.n.Code
xc = CountIEnum(xs)
End Sub
Sub FastQuery1()
xs = Select New {Cty := t.Name, Cny := t.MyCountry.Name} _
From t In db.Cities
xc = CountIEnum(xs)
End Sub
The neat thing is that DLINQ lets us express our relationship as a PROPERTY, “MyCountry,” on the City class. Neat. We don’t need an index in memory to get fast query. In my sample, I do build such an index, but I use it ONLY to set up the database quickly. I don’t use it at all in the queries and it’s absolutely not necessary.
Did I say that we’re making our SQL database FROM SCRATCH? Surprisingly, this is unusual. Almost all the samples I’ve seen magically install some big messy database and then show you how to manipulate it. They never show you how to create it. What’s up with that? I want to do EVERYTHING from my VB program. I don’t want to learn a bunch of SQL-Express GUI DBA tools too. Given the choice between “Hello World!” and “Hello Northwind!” I’ll take “Hello World!” every time. Wouldn’t you? Well, “Hello World!” translates into “Hello Mondial!”
Let’s start with class definitions. Remember, the Country class definition for our objects-in-memory LINQ version was:
Public Class Country
Public Name As String
Public Code As String 'Primary key
Public Capital As String
Public Province As String
Public Area As Integer
Public Population As Integer
End Class
The DLINQ version is, mostly, very similar. Visually, it’s more junky looking, but, fear not, I walk you through all of it:
<Table()> _
Public Class Country
<Column(DbType:="VARCHAR(32)")> _
Public Name As String
<Column(DbType:="VARCHAR(4)", Id:=True)> _
Public Code As String
<Column(DbType:="VARCHAR(35)")> _
Public Capital As String
<Column(DbType:="VARCHAR(32)")> _
Public Province As String
<Column(DbType:="INT")> _
Public Area As Integer
<Column(DbType:="INT")> _
Public Population As Integer
Private _myCities As New EntitySet(Of City)
<Association(Storage:="_myCities", OtherKey:="Country")> _
Public Property MyCities() As EntitySet(Of City)
Get
Return _myCities
End Get
Set(ByVal value As EntitySet(Of City))
_myCities.Assign(value)
End Set
End Property
End Class
The custom attribute <Table()> tells DLINQ that we want the class stored as a database table. The VB compiler ignores CLR custom attributes – it just copies them into assembly metadata so that libraries can read them through Reflection if they need to. That’s just what the DLINQ library does – it checks our class definitions at run time to see if they’re Tables.
Next, list out all the same fields as before, just decorated with <Column(…)> attributes to tell DLINQ what SQL types we want stored. I think there are some intelligent defaults I could have used, saving explicit specifying of these details, but I wanted to play it really safe, not testing those waters, so I slavishly set up column attributes for every field. The Mondial original was written in SQL, not VB, so it wasn’t hard to figure out the size of VARCHARS I needed and what not.
The new DLINQ EntitySet type and Association attribute are clearly the interesting things. They represent a recurring DLINQ pattern for one-to-many relationships. Over here, in the Country class, we’re on the ONE side of the one-to-many relationship. We must have a corresponding Association attribute on the MANY side, in the City class; we get to it in a minute. Let’s go through this side line-by-line:
Private _myCities As New EntitySet(Of City)
Ok, there is a private field that DLINQ will use behind our backs to store a “list” of cities, in effect, for each Country.
<Association(Storage:="_myCities", OtherKey:="Country")> _
Ok, DLINQ now knows the name of the storage field it can party on; it’s Country._myCities. It also knows the name of the field in the OTHER side of the association that points back to Country instances, here. Remember, our “join” condition is that City.Country == Country.Code. The OtherKey portion tells DLINQ that the field named “Country” in the OTHER side of the association hooks up to the primary key on THIS side of the relationship, which, oh-by-the-way, we already declared thus:
<Column(DbType:="VARCHAR(4)", Id:=True)> _
Public Code As String
The word “Country” in the fragment OtherKey:="Country" means the field named “Country” in the other side of the association, not the class named “Country” on this side of the association. Sorry about this, but that’s how Mondial was designed and we’re going to live with it.
Public Property MyCities() As EntitySet(Of City)
Get
Return _myCities
End Get
Set(ByVal value As EntitySet(Of City))
_myCities.Assign(value)
End Set
End Property
Ok, that is more-or-less standard stuff: it defines the public property allowing other parts of my program to interact with the EntitySet. DLINQ parties on _myCities behind the scenes, but we get to party on the public Property MyCities. The only thing to remember is to use the Assign method of EntitySet so DLINQ can keep track of the publicly made changes. Now, we just look at the City class and we’re done with almost everything difficult:
<Table()> _
Public Class City
<Column(DbType:="VARCHAR(35)", Id:=True)> _
Public Name As String
<Column(DbType:="VARCHAR(4)", Id:=True)> _
Public Country As String
<Column(DbType:="VARCHAR(35)", Id:=True)> _
Public Province As String
<Column(DbType:="INT")> _
Public Population As Integer
<Column(DbType:="FLOAT")> _
Public Longitude As Double
<Column(DbType:="FLOAT")> _
Public Latitude As Double
Private _myCountry As EntityRef(Of Country)
<Association(Storage:="_myCountry", ThisKey:="Country")> _
Public Property MyCountry() As Country
Get
Return _myCountry.Entity
End Get
Set(ByVal value As Country)
_myCountry.Entity = value
End Set
End Property
Ok, line-by-line. First, the primary key in this table is triplicate, consisting of the Name of the city, its Country code, and its province. Fair enough (how many “Springfields” ARE there, actually?) Let’s go straight to the Association. EntityRef(Of Country) on this side; a ThisKey whose name matches up with the OtherKey on Country. Once again, the word “Country” on both sides of the association means the field named “Country” in the City class, not to the Class named Country. City.Country holds a value that must be equal to the Code field value in class Country for every pair of City and Country participating in the relationship. Think about it for a minute and make sure you totally “get it” before moving on.
Last detail is that the Entity property on DLINQ’s EntityRef type is the one we must use to get and set EntityRefs on the background private party storage field, _myCountry.
There’s no freedom in the pattern, here. This is the exact way to express one-to-many relationships in DLINQ. Everything has to be wired up exactly this way. If we were NOT creating our database from scratch, we would just let the SQLmetal utility generate this goo for us from a pre-existing SQL schema, which we would have created from some other SQL toolsets I didn’t want to learn about. Furthermore, I don’t like stuff that generates VB code where I have no idea what’s going on, so I insisted on writing this sample by hand. They tried to tell me not to do it, but I march to a different drummer.
Ok, we’re done with almost all the hard stuff. Let’s get through some easy stuff:
Public Class Mondial
Inherits DataContext
Public Sub New(ByVal connection As String)
MyBase.New(connection)
End Sub
Public Countries As Table(Of Country)
Public Cities As Table(Of City)
End Class
DataContext is a DLINQ base class; we must inherit from it for our own database, and we must pass in a connection string when we create one of these. The connection string is the LAST bit of hard stuff. There is a cottage industry of helpage for these nasties: see http://www.connectionstrings.com for instance. They are amazingly brittle, so I just wish you luck. Here is the one I used (and I had to get help to construct it):
Private ReadOnly conn1 = "Data Source=.\SQLEXPRESS;AttachDbFilename=" & _
"'C:\Documents and Settings\brianbec\My Documents\Mondial.mdf';" & _
"Integrated Security=True;Connect Timeout=30;User Instance=True"
My suggestion is that you change only the highlighted thing to your user name. Even though I thought I understood this connection string, I found by bitter experience that I had no flexibility whatever with it. I couldn’t change the directory, the Timeout, the order of terms, NOTHING. If I looked at it sideways, I would get an exception from deep in the bowels.
Ok, that’s it, the rest is easy:
Module Module1
Public UseTinyData As Boolean = True
REM set to False to use big, slow data set
Public CountryFromCode As Dictionary(Of String, Country)
Private ReadOnly conn1 = ... (as before
Public db As Mondial
Public Sub Init()
CountryFromCode = New Dictionary(Of String, Country)
If db.DatabaseExists Then
db.DeleteDatabase()
End If
db.CreateDatabase()
If UseTinyData Then
InMemoryTinyDataInitializer.CountryData(db)
InMemoryTinyDataInitializer.CityData(db)
Else
InMemoryDataInitializer.CountryData(db)
InMemoryDataInitializer.CityData(db)
End If db.SubmitChanges()
End Sub
That just sets up our data tables, and yes, we do use an in-memory index to quickly set up the foreign-key attributes of type EntityRef, but we don’t have to at all. We also have two different data sets, one tiny, for debugging and development, and the full one with 3,053 city rows.
Module InMemoryTinyDataInitializer
Private Sub AddCountry(ByVal db As Mondial, ByVal c As Country)
CountryFromCode.Add(c.Code, c)
db.Countries.Add(c)
End Sub
Public Sub CountryData(ByVal db As Mondial)
AddCountry(db, New Country {Name := "Albania", Code := "AL", Capital := "Tirane", Province := "Albania", Area := 28750, Population := 3249136})
AddCountry(db, New Country {Name := "Greece", Code := "GR", Capital := "Athens", Province := "Greece", Area := 131940, Population := 10538594})
End Sub
Private Sub AddCity(ByVal db As Mondial, ByVal c As City)
c.MyCountry = CountryFromCode(c.Country)
db.Cities.Add(c)
End Sub
Public Sub CityData(ByVal db As Mondial)
AddCity(db, New City {Name := "Tirane", Country := "AL", Province := "Albania", Population := 192000, Longitude := 10.7, Latitude := 46.2})
AddCity(db, New City {Name := "Shkoder", Country := "AL", Province := "Albania", Population := 62000, Longitude := 19.2, Latitude := 42.2})
AddCity(db, New City {Name := "Durres", Country := "AL", Province := "Albania", Population := 60000, Longitude := 19.3, Latitude := 41.2})
AddCity(db, New City {Name := "Vlore", Country := "AL", Province := "Albania", Population := 56000, Longitude := 19.3, Latitude := 40.3})
AddCity(db, New City {Name := "Elbasan", Country := "AL", Province := "Albania", Population := 53000, Longitude := 20.1, Latitude := 41.1})
AddCity(db, New City {Name := "Korce", Country := "AL", Province := "Albania", Population := 52000, Longitude := 20.5, Latitude := 40.4})
AddCity(db, New City {Name := "Athens", Country := "GR", Province := "Greece", Population := 885737, Longitude := 23.7167, Latitude := 37.9667})
AddCity(db, New City {Name := "Thessaloniki", Country := "GR", Province := "Greece", Population := 406413, Longitude := 22.95, Latitude := 40.6167})
AddCity(db, New City {Name := "Piraeus", Country := "GR", Province := "Greece", Population := 196389, Longitude := Nothing, Latitude := Nothing})
AddCity(db, New City {Name := "Patrai", Country := "GR", Province := "Greece", Population := 142163, Longitude := Nothing, Latitude := Nothing})
AddCity(db, New City {Name := "Larisa", Country := "GR", Province := "Greece", Population := 102426, Longitude := Nothing, Latitude := Nothing})
AddCity(db, New City {Name := "Iraklion", Country := "GR", Province := "Greece", Population := 102398, Longitude := Nothing, Latitude := Nothing})
AddCity(db, New City {Name := "Volos", Country := "GR", Province := "Greece", Population := 71378, Longitude := Nothing, Latitude := Nothing})
AddCity(db, New City {Name := "Kavalla", Country := "GR", Province := "Greece", Population := 56705, Longitude := Nothing, Latitude := Nothing})
End Sub
End Module
I’ll leave the main routine and the timing details to the project I uploaded. Have fun!
One recurring but subliminal theme at the recent Compiler Dev Lab (via Devhawk) was how nice it is to do incremental, interactive development. I argue, here, that not only is it nice, it’s actually critically important. The argument is statistical.
If you make ten changes to a program all at once, the number of ways of getting all ten of them right is just one, but the number of ways of getting at least one of them wrong is 1,023. And, if you get even one of them wrong, then you’ve just inserted subtle bugs into your program that you – or even worse, someone else – may not find till much later.
Digest that for a minute. If you make ten lousy changes to a program before trying them out, your chances of getting the whole thing right is just one in a thousand. It never ceases to amaze me, watching programmers at work. They’ll type in ten lines of code and then spend all day debugging them. Whereas, if they had found a way to type them in one at a time, testing as they go, then they would only have had ten ways to mess up, instead of a thousand.
Multiply that kind of inefficiency by 1,000 developers, and suddenly we can see why software engineering in practice is anything but engineering.
Look at it this way: would you solve a jigsaw puzzle by fitting in one piece at a time, or by trying to get ten pieces all in at one go? The latter strategy is a non-starter. You might as well try to solve the puzzle by putting the pieces back in the box and shaking them.
So, what… How are we going to make our changes one at a time instead of in batches? For that, let’s go back to the Compiler Dev Lab.
The most popular talks were those wherein the speaker typed a line of code and then something interesting happened on the screen. F#, Dyalog APL, Iron Python, Monad, all went along this way. As more lines were typed, more and more sophisticated things happened, building exponentially on the things that happened just a line or two prior.
The host for such a development style is a REPL (pronounced “Repple”), which stands for Read-Eval-Print Loop. Such things are commonplace staples of programming languages since the 1950’s, pioneered with Lisp, Forth, Basic, and so on. The alternative is an ECDL (pronounced “Eckdull”), for Edit-Compile-Debug Loop. An ECDL is, in reality, just a semi-interactive simulation of a card punch, card reader, and line printer (those of you under the age of 45 will have no idea what I’m talking about). People who grew up with PL/1, Fortran, C, Java, are comfortable with ECDLs and may never even have seen a REPL.
But, almost any language can be hosted in a REPL or an ECDL, and some development environments, like those for PLT-scheme and Hugs, support both.
But here’s the main point: a REPL helps you make just one change at a time. Make a change, hit return, bango it’s executed. If you made a mistake (which we all do about half the time), you see it right away, NOW, not later, hidden in the weeds with 1,022 other potential mistakes.
The speakers knew that REPLs create an overwhelmingly compelling development experience, and thus, an overwhelmingly compelling presentation experience. After all, what good is a talk about your technology if your audience dozes off? Their knowledge of this is so deep that a couple of them even created specially doctored REPLs just for presentation purposes.
In fact, this style of development is *so overwhelmingly compelling* that these professional compiler developers sometimes mistaked dynamic development for dynamic languages! IronPython is a dynamic language, F# is a static language, but both are hosted in a dynamic development environment. An informal poll for the most popular dynamic language, however, resulted in F#! This goes to show that the dynamism of the development experience is far more important than language dynamism.
So, if you’re creating a language or an environment, please PLEASE design a REPL to go along with it, and help your users solve their jigsaw puzzles one piece at a time.
In the terminology of relational databases, a “join” is, semantically, like a nested loop over a pair of lists (or tables) of records, saving only those where some certain fields match. Unless we do something smart, this could be very expensive. Imagine searching a database of a million DNA profiles for the closest match to a sample that has 10,000 DNA features (I have no idea whether those are real numbers: I just made them up, but they sound ballpark to me). A dumb join would search all 1 million profiles for each of the 10,000 features, resulting in 10 billion match tests, almost all of which will fail – by design, of course. That’s going to hurt.
The “something smart” is to build an index and search through that. Your database doesn’t have to be large at all for this to pay off. In fact, even with just a few records, it’s cheaper to build an index, use it, and throw it away than it is to do a nested loop. Let’s prove that with a small sample. First, we’ll do everything with objects in memory just to keep it easy. In a later blog entry, I’ll show how to do exactly the same thing with DLINQ and SQL Express. So, in this entry, when I say “database,” I mean collection of objects in memory. I still have a bunch of interesting new VB9 things to show off, though, including object initializers, XML literals, and query comprehensions.
My partner, Erik Meijer, showed me a fantastic little database called Mondial, free for educational and research use. Huzzah! That’s us. It’s full of fun factoids about cities and countries and languages and politics. It’s downloadable in SQL, text, or XML. It illustrates most of the technical problems you’d encounter with a more typical business sample, you know, the old boring “Customers and Orders,” but it’s more likely to keep you awake.
Let’s start off with just two of the tables in Mondial: Countries and Cities. Turns out there are 195 countries and 3,053 cities in the database. Each Country has a unique “Code,” which is a very short unique string. The database can use this as a primary key: it only needs to check the “Code” field for uniqueness of a Country record. Here’s the VB for a class representing Mondial countries:
Public Class Country
Public Name As String
Public Code As String 'Primary key
Public Capital As String
Public Province As String
Public Area As Integer
Public Population As Integer
End Class
And here are a few instances, using VB9’s “object-initializer” syntax (you will need the VB9 CTP [Customer Technology Preview] to work with the code in this post):
{Name := "Albania", Code := "AL", Capital := "Tirane", Province := "Albania", Area := 28750, Population := 3249136}
{Name := "Andorra", Code := "AND", Capital := "Andorra la Vella", Province := "Andorra", Area := 450, Population := 72766}
{Name := "Spain", Code := "E", Capital := "Madrid", Province := "Madrid", Area := 504750, Population := 39181114}
Fun facts and figures already! How about cities? Here’s a class for Mondial cities:
Public Class City
Public Name As String
Public Country As String 'Foreign-key to Country.Code
Public Province As String
Public Population As Integer
Public Longitude As Double
Public Latitude As Double
End Class
And some samples:
{Name := "Ankara", Country := "TR", Province := "Ankara", Population := 2782200, Longitude := 32.8833, Latitude := 39.95}
{Name := "Helsinki", Country := "SF", Province := "Uusimaa", Population := 487428, Longitude := 24.95, Latitude := 60.1667}
{Name := "Rangoon", Country := "MYA", Province := "Yangon", Population := 2513000, Longitude := 96.15, Latitude := 16.7833}
Ok, so now we can see how Mondial links cities to their countries: City.Country == Country.Code, that is, the “Country” field of each city must match the “Code” field of the corresponding country. A city record doesn’t contain the name of its country, just the country’s code. So, unless we’ve memorized all 195 country codes, we can’t look at any old city and tell what country it’s in.
Database join to the rescue! We’ll use City.Country == Country.Code as our “join condition” and print out a list of cities with the names of their countries retrieved from the list of countries.
The initializers for all 195 countries and 3,053 cities fill up about, well, about 3,248 lines of code, which is 50 pages or so and too large to post. So, I think the best way to get them to you is to tell you how I made them. From the Mondial site, I copied one of the SQL formats, which contains 3,248 INSERT VALUES statements. Using emacs keyboard macros, I manually wrestled these statements into the appropriate VB format, shown in the pattern below.
Public Module DataInitializer
Public Function CountryData() As List(Of Country)
Dim cs As New List(Of Country)
cs.Add(New Country {Name := "Albania", Code := "AL", Capital := "Tirane", Province := "Albania", Area := 28750, Population := 3249136})
cs.Add(New Country {Name := "Greece", Code := "GR", Capital := "Athens", Province := "Greece", Area := 131940, Population := 10538594})
cs.Add(New Country {Name := "Macedonia", Code := "MK", Capital := "Skopje", Province := "Macedonia", Area := 25333, Population := 2104035})
...
CountryData = cs
End Function
Public Function CityData() As List(Of City)
Dim cs As New List(Of City)
cs.Add(New City {Name := "Luxembourg", Country := "L", Province := "Luxembourg", Population := 76600, Latitude := 6.08, Longitude := 49.4})
cs.Add(New City {Name := "Monaco", Country := "MC", Province := "Monaco", Population := 1234, Longitude := 7.2, Latitude := 43.7})
cs.Add(New City {Name := "San Marino", Country := "RSM", Province := "San Marino", Population := 4416, Longitude := 12.2, Latitude := 43.5})
...
CityData = cs
End Function
End Module
The whole process took me 10 minutes, including making and correcting the inevitable ham-handed mistakes with emacs regular expressions. Someone with good Perl skills could probably have done the job even more quickly. I do NOT recommend trying this with search-and-replace in the VB IDE, because the VB pretty lister will try to format each of the 3,248 lines one at a time, and it will hurt.
Let’s start up a VB module, declare a couple of variables to hold all our data, and a subroutine to fill them up.:
Module Module1
Dim Countries As List(Of Country)
Dim Cities As List(Of City)
...
Sub Initialize()
Countries = DataInitializer.CountryData
Cities = DataInitializer.CityData
End Sub
Now, what’s the most straightforward way to do the join? How about the following:
Dim xs As IEnumerable
' O(Cities) * O(Countries)
Sub SlowQuery()
xs = Select New {Cty := it.t.Name, Cny := it.n.Name} _
From t In CitySubset, n In CountrySubset _
Where it.t.Country = it.n.Code
For Each x In xs
'Print or whatever... must do something to force
'query execution, however.
Next
End Sub
This sub uses anonymous types, data initializers, and a query comprehension – all new VB9 features – to build an IEnumerable containing city names and their country names. Its execution time is of order 3,053*195 =~ 600,000; painful. Also note carefully that the LINQ query comprehension just sets up the query without executing it. Not until we do something with the result, xs, does any searching and matching take place. LINQ query execution is lazy.
Most of these 600,000 comparisons will fail, of course, because they’re checking whether Madrid is in Zimbabwe or Capetown is in Iceland, and so on. In fact, we know that exactly 3,053 of them will succeed. We can be a lot smarter.
Suppose we had an index that would let us look up the country for any city, or, equivalently, the list of cities for any country? Then, we could write something like the following:
Dim ys As IEnumerable
' O(Number of Cities)
Sub FastQuery0()
ys = Select New {Cty := it.t.Name, Cny := it.n.Name} _
From n In Countries, t In CitiesFromCountry(n)
For Each y In ys
'Force query execution lest lazy
Next
End Sub
Where
Dim CountryFromCode As Dictionary(Of String, Country)
Dim CitiesFromCountry As _
Dictionary(Of Country, List(Of City))
And where we build the indices as follows:
Sub BuildIndices()
' "CountryFromCode" looks up country object from a "code" string.
' Timing to build this index is O(Countries).
CountryFromCode = New Dictionary(Of String, Country)
For Each cn In Countries
CountryFromCode.Add(cn.Code, cn)
Next
' "CitiesFromCountry" looks up a list of cities
' from a country.
' Time to build index is O(Cities)
CitiesFromCountry = _
New Dictionary(Of Country, List(Of City))
For Each ct In Cities
' Missing code is worthy of an uncaught exception
Dim d = CountryFromCode(ct.Country)
Dim cts As List(Of City) = Nothing
If Not CitiesFromCountry.TryGetValue(d, cts) Then
cts = New List(Of City)
CitiesFromCountry.Add(d, cts)
End If
cts.Add(ct)
Next
End Sub
We’re not being very clever about building these indices. For instance, we check for existence of the city list for every city. It doesn’t matter, however, because it’s so much slower to loop over all 600,000 combinations that 2,858 redundant checks, by comparison, are negligible. Also, if we we’re really clever, we would only build one of the two indices, not both, choosing between them by estimating the time to build each and picking the fastest. Real databases do this kind of estimation as part of Query Planning, and a great deal of sophistication of SQL Server 2005, for instance, is devoted to such superoptimization.
For our illustrative purposes, though, we’re more interested in clarity, flexibility, and simplicity. Even though we’re overkilling in the index-building department, it’s still far cheaper than doing the slow, nested loop. It’s still just O(Number of Cities). Now, before I show you how I prove these claims, let me just jump to some results and explain what they mean. Then I’ll show you how to reproduce these timings:
================================================================
In-memory timings (hecta mu):
+-----+--------+--------+--------+--------+
| NRws| SlowQry| FastQry|IdxBuild| FastTot|
+-----+--------+--------+--------+--------+
| 3053| 1405.87| 10.77| 10.88| 21.64|
| 1413| 330.15| 5.19| 5.25| 10.44|
| 891| 105.39| 3.25| 3.23| 6.48|
| 329| 19.20| 1.25| 1.27| 2.52|
| 89| 2.91| 0.44| 0.43| 0.87|
| 28| 0.62| 0.20| 0.20| 0.40|
| 4| 0.18| 0.11| 0.11| 0.22|
| 2| 0.13| 0.09| 0.10| 0.19|
+-----+--------+--------+--------+--------+
================================================================
Wrote data in Excel format to file ..\..\PerfData31A5.xml
Press "Enter" to end the program ...
First, my dimensions. I present timings in hundreds of microseconds, or hecta-mus, just because they come out as nice, digestible magnitudes, where we can get a feel for them without doing a lot of conversions in our heads. The first column shows the number of city-rows over which we do a join. We vary this number by picking random subsets of the entire city set so we can find out how small a database we need before it’s cheaper to do the slow, nested-loop query than to build the index, use it in the fast query, and then throw it away. The second column is the timing of the slow query in hecta-mus. The third column is the timing of the fast query; the fourth column is the timing of the building of a fresh index over the randomly chosen subset; and the last column is the sum of those two. So, the numbers to compare are the second column versus the last column. We write out these timings to an Excel file (using XML literals, which we show off later), so we can chart them, as follows:
But we don’t even need the chart to see that it’s not until we have well fewer than 28 rows does it pay to even think of doing the nested-loop instead of the build-index, use-index, throw-index-away pattern. In other words, for all practical purposes, never.
So, how do we create these results? For high-resolution timing, I’ll just use the underlying Win32 performance counters. I think there is a .NET way to do this, but I just happen to know the Win32 interop programming pattern in my fingers, so here’s my cut at it. The following is a general facility for timing the execution of ANY function-of-no-arguments, which I’m calling a thunk here. This usage of the word “thunk” is borrowed from the Scheme literature: in the Windows zeitgeist it usually means something else, like a shim or a trampoline, so don’t be confused. I’m not using the term in that sense, here.
Public Delegate Sub Thunk()
Public Module HiRezTimer
Function TimeAThunk(ByVal Something As Thunk) As Double
Dim startCount As Int64 = QPC()
Something()
Dim stopCount As Int64 = QPC()
TimeAThunk = (stopCount - startCount) / QPF()
End Function
<System.Runtime.InteropServices.DllImport("Kernel32.dll")> _
Public Function QueryPerformanceCounter(ByRef perfcount As Int64) As Boolean
End Function
<System.Runtime.InteropServices.DllImport("Kernel32.dll")> _
Public Function QueryPerformanceFrequency(ByRef freq As Int64) As Boolean
End Function
Public Function QPC() As Int64
Dim perfcount As Int64
QueryPerformanceCounter(perfcount)
Return perfcount
End Function
Public Function QPF() As Int64
Dim freq As Int64
QueryPerformanceFrequency(freq)
Return freq
End Function
End Module
And here is how I use it to produce the results above:
Sub TimeOne(ByVal n As Integer)
Dim t As New TimingRow
t.ln = n
t.q1 = Hecta(Micro(TimeAThunk(AddressOf SlowQuery)))
t.ix = Hecta(Micro(TimeAThunk(AddressOf BuildSubIndices)))
t.q2 = Hecta(Micro(TimeAThunk(AddressOf FastQuery)))
RawRows.Add(t)
Console.WriteLine("|{0,5}|{1,8:0.00}|{2,8:0.00}|{3,8:0.00}|{4,8:0.00}|", t.ln, t.q1, t.ix, t.q2, t.ix + t.q2)
End Sub
Sub DoATiming()
DrawLine()
RawRows = New List(Of TimingRow)
Console.WriteLine("In-memory timings (hecta mu):")
BuildIndices()
CountrySubset = Countries
CitySubset = Cities
Console.WriteLine("+-----+--------+--------+--------+--------+")
Console.WriteLine("| NRws| SlowQry| FastQry|IdxBuild| FastTot|")
Console.WriteLine("+-----+--------+--------+--------+--------+")
TimeOne(CitySubset.Count)
TimeOne(BuildSubsets(1, 2).m)
TimeOne(BuildSubsets(1, 4).m)
TimeOne(BuildSubsets(1, 8).m)
TimeOne(BuildSubsets(1, 16).m)
TimeOne(BuildSubsets(1, 32).m)
TimeOne(BuildSubsets(1, 64).m)
TimeOne(BuildSubsets(1, 128).m)
Console.WriteLine("+-----+--------+--------+--------+--------+")
End Sub
Over in the article section, I’ll post the code for “BuildSubsets,” along with all this code in complete VB modules. It’s pretty obvious, dumb, actually. But remember, it doesn’t need to be fast or smart, just clear and easy.
The last thing to show off here is how easy it is to write out the Excel version of the timing data using VB9 XML literals. It literally took me less than half an hour to do this, knowing next to nothing about Excel’s XML support. I still know next to nothing about it: I don’t need to know. The timing data goes in a “TimingRow,” which looks like this:
Public Class TimingRow
Public ln As Integer
Public q1 As Double
Public ix As Double
Public q2 As Double
End Class
Length, query1, index-build, and query2, yup yup. In the timing routines above, we not only Console.WriteLine the results, but fill new timing rows with them, so we can add a timing row to a list of “RawRows,” declared as follows:
Module ExcelWriter
Dim xl As XDocument
Public RawRows As List(Of TimingRow)
...
In our main routine, after we’re done with all the timings, we call
ExcelWriter.SaveDocument()
Which looks like this:
Public Sub SaveDocument()
PrepareOutput()
Dim filename = NonCollidingFileName("..\..\PerfData", ".xml")
Dim str As New StringWriter
xl.Save(str)
Using sw = File.CreateText(filename)
sw.WriteLine(str)
sw.Close()
End Using
Console.WriteLine("Wrote data in Excel format to file {0}", filename)
End Sub
The only interesting line, here, is “PrepareOutput;” the code for “NonCollidingFileName” is over in the article section with the rest of the entire sample. So, here is “PrepareOutput:”
Private Sub PrepareOutput()
xl = <?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<Worksheet ss:Name="Sheet1">
<Table>
<%= MakeRow("Length", "Qry1", "Indx", "Qry2", "Tot2") %>
<%= Select MakeRow(c.ln, c.q1, c.ix, c.q2) _
From c In RawRows %>
</Table>
</Worksheet>
</Workbook>
End Sub
This creates the literal XML document. I got this document template by creating a bogey document, with constant data instead of variable data, in Excel by hand, choosing “Save As XML,” opening up the result in Notepad, and cutting and pasting the result over to my VB code. I didn’t even have to search for documentation on the XML format for Excel files, let alone read, understand, or learn it. Just blind cut and paste. Neat.
I replaced the constant rows with VB XML holes, bracketed by the <%= ... %> syntax. I created two overloads of “MakeRow,” one for column labels, another for the numerical data, calling one straight up and the other through a “Select” comprehension. Neat. Here are those two overloads:
Private Function MakeRow(ByVal a As Integer, ByVal b As Double, _
ByVal c As Double, ByVal d As Double) As XElement
MakeRow = <Row xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
ss:AutoFitHeight="0">
<Cell><Data ss:Type="Number"><%= a %></Data></Cell>
<Cell><Data ss:Type="Number"><%= b %></Data></Cell>
<Cell><Data ss:Type="Number"><%= c %></Data></Cell>
<Cell><Data ss:Type="Number"><%= d %></Data></Cell>
<Cell ss:Formula="=RC[-2]+RC[-1]"></Cell>
</Row>
End Function
Private Function MakeRow(ByVal a As String, ByVal b As String, _
ByVal c As String, ByVal d As String, ByVal e As String) As XElement
MakeRow = <Row xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
ss:AutoFitHeight="0">
<Cell><Data ss:Type="String"><%= a %></Data></Cell>
<Cell><Data ss:Type="String"><%= b %></Data></Cell>
<Cell><Data ss:Type="String"><%= c %></Data></Cell>
<Cell><Data ss:Type="String"><%= d %></Data></Cell>
<Cell><Data ss:Type="String"><%= e %></Data></Cell>
</Row>
End Function
More XML literals via blind cut-and-paste, with holes inserted via educated guesswork. Notice that the last line in the numerical overload writes out a constant cell, it’s just a constant formula in Excel, that uses relative cell references to calculate the totals. Neat.
So there you have it. How to do dumb queries, hash joins, time your results, and write them out in an Excel file, in one really short VB9 program. One last little detail, you really do need to do the timings TWICE and ignore the first set of results, because the CLR Jitter will need to warm up on the first go-round. This could be prevented by NGenning the program before running it, but it’s easier this way, and this post is all about how to do it the easy way. So here is my main routine, which reports that we should ignore the first timing:
Sub Main()
Initialize()
Console.WriteLine("Ignore first timing (JIT warmup)")
DoATiming() 'First one just warms up the Jit
DoATiming()
DrawLine()
ExcelWriter.SaveDocument()
Console.Write("Press ""Enter"" to end the program ...")
Console.ReadLine()
Exit Sub
End Sub
In case you ever find yourself having to implement (or simulate, or emulate) a stack-based Virtual Machine, a-la Forth or PostScript or even the CLR itself, here's a handy programming pattern for your consideration. The rough idea is to give every instruction some state variables for remembering its arguments, with virtual methods for "execute" and "ToString" (for “explain” mode) and what not. An instance of the VM will have an instruction stream, an instruction pointer, a stack, and methods for adding instructions to the stream, executing the stream, and explaining itself.
The sample below is intentionally very bare-bones. Think of it as the “hello, world” for a Forth-oid VM. It’s a pattern rather than a solution. I just have a handful of instructions here, just to offer the idea. To make this a real, complete, VM, you would add a lot more instructions, following the pattern. This sample doesn’t use the underlying CLR VM; it could, through Reflection. It doesn’t have a console; it could, through Regex or pick-your-favorite parsing solution. It doesn’t have a debugger; it could, through integration with Visual Studio. It doesn’t have a type system; it could -- it just uses VB late binding through object. In fact, this is so bare-bones that I don’t actually have a method for adding instructions, it just exposes the instruction stream as a public member (oh, the HORROR!) … you’re getting the idea.
So, first, here’s the outline for the VM itself, showing detail, for now, just for the self “Test” routine.
Public Class VM
Public Class Exception REM gotta have our own, don’tya know
Inherits ApplicationException REM ... impl omitted
Public stk As New Stack REM this is a .NET collection class
Private ix As New List(Of Instruction) REM Instr. is ours
Private ip As New Integer REM instruction pointer
Public Sub Execute() REM ... impl omitted
Public Sub Explain() REM ... impl omitted
Public Sub Test()
ix.Clear()
ix.Add(New Push("A"))
ix.Add(New Push("B"))
ix.Add(New Push(42))
ix.Add(New Push("C"))
ix.Add(New DisplayStack)
ix.Add(New MakeList(3))
ix.Add(New DisplayStack)
ix.Add(New Reverse)
ix.Add(New DisplayStack)
ix.Add(New UnList)
ix.Add(New DisplayStack)
ix.Add(New Pop(2))
ix.Add(New DisplayStack)
Explain()
Execute()
End Sub
End Class
Obviously, the interesting thing above is the instruction stream in the “Test” routine. The phasing of operations is “create the instructions, add them to the stream, make the stream show itself, then execute the entire stream.” “Push” is an instruction that, when executed, pushes its argument on the stack. “DisplayStack” does some more explaining at run time. “MakeList,” when executed, pops n arguments off the stack and makes a list out of them, pushing the list back on the stack; n is its argument, 3 in this case. “Reverse,” when executed, pops a list off the stack, reverses the elements, and pushes the resulting list back on the stack. “UnList,” when executed, pops a list off the stack, then serially pushes the elements of the list back on the stack. “Pop,” when executed, pops n elements off the stack, where n is its argument, 2 in this case.
What is the expected output of the above? Something like the following:
13 Instructions
0: Push A
1: Push B
2: Push 42
3: Push C
4: DisplayStack
5: MakeList 3
6: DisplayStack
7: Reverse
8: DisplayStack
9: UnList
10: DisplayStack
11: Pop 2
12: DisplayStack
4 elements in the stack
0: C
1: 42
2: B
3: A
2 elements in the stack
0: ["C", 42, "B"]
1: A
2 elements in the stack
0: ["B", 42, "C"]
1: A
4 elements in the stack
0: C
1: 42
2: B
3: A
2 elements in the stack
0: B
1: A
Neat. Ok, now the implementations of VM’s methods. Here’s Execute:
Public Sub Execute()
ip = 0
While ip < ix.Count
ip += ix(ip).execute(Me)
End While
End Sub
Yup, just calls the overridable “execute” method on each instruction, passing in the VM itself so that each instruction can access the stack, as needed. Notice that the ip (instruction pointer) is advanced by an amount returned by “execute.” That’s so we can write branch instructions that move the instruction pointer somewhere far away. Most instructions just return 1 from “execute.”
“Explain” is equally easy. The only thing remotely noteworthy is that Console.WriteLine automatically calls the “ToString” method on class “Instruction,” which will, by override (virtual dispatch) call the various formatting routines in the particular instruction subclasses:
Public Sub Explain()
Console.WriteLine("{0} Instructions", ix.Count)
ip = 0
While ip < ix.Count
Console.WriteLine("{0,4}: {1}", ip, ix(ip))
ip += 1
End While
Console.WriteLine()
End Sub
Now it’s time to see the base class for instruction, again with some detail omitted for now:
Public MustInherit Class Instruction
Public MustOverride Function execute(ByVal d As VM) _
As Integer
Public Overrides Function ToString() As String
Return String.Format("{0,-16} {1}", formatOpcode, _
formatArgs)
End Function
Public Overridable Function formatArgs() As String
formatArgs = ""
End Function
Public Function formatOpcode() As String
Const opcodePat = "\w+\.(?<opcode>\w+)$"
Dim s = Me.GetType.ToString
formatOpcode = Regex.Match _
(s, opcodePat).Result("${opcode}")
End Function
End Class
“execute” is a pure virtual method – each instruction must supply an implementation. That’s pretty obvious. “ToString,” called by the VM’s “Explain” procedure for each instruction, calls an overridable “formatArgs,” with the empty string being the default return value for those instructions that don’t override it because they don’t have arguments they want to print out. Interestingly, the “formatOpcode” isn’t overridable, because, via Reflection, we can get the name of each instruction subclass without having that subclass participate. Neat.
Nothing much left, now. Just show a few instructions, and you’ll get the idea:
Public Class Push
Inherits Instruction