March 2006 - Posts

The Pie-Fighting Philosophers Problem

Take three PhD specialists in programming languages, three PhD specialists in relational database, and three PhD specialists in network security, and seat them around a table. Now take three coconut-cream pies, three chocolate-cream pies, and three banana-cream pies and randomly place one pie in front of each philosopher. Finally, give them three hours to come up with a position paper on the concept of "Identity." Calculate how many philosophers at the end of three hours have three kinds of cream-pie on their faces, how many have two kinds, and how many have just one kind.

 

Ok, I thought it was funny, but then, I've been involved in a lot of design and standards committees. The point is, no single concept of "Identity" works for all three. In fact, no concept of Identity works for any pair of the three, and it's only barely possible to get a concept that works for any one of the three. Let's illustrate this just for programming languages and relational databases.

 

Consider the following data, stated in English:

 

Venice is in Italy.

Rome is in Italy.

Nogales is in Mexico.

Nogales is in the USA.

 

Hey, wait a minute. One city in two countries? Sure. Happens all the time. Trieste straddles Slovenia and Italy. How about Bratislava? It's in three countries: Slovakia, Hungary, and Austria. Sault Ste. Marie is in both Canada and the USA. And those are just the physically connected examples. The USA has at least one Rome – in New York – and at least one Venice – in California.

 

Ok, the point is that our Mondial database is going to have to deal with the fact that the Country-City relationship is really many-to-many, not one-to-many, and that means it's going to have to deal with Identity in a deeper way than we sketched out prior to this.

 

How would our three database specialists represent the data above? Experienced practitioners will normalize the data, that is, remove redundancies, resulting in three tables: one for the countries, one for the cities, and one, called a join table, middle table, or intersection table, for the relationship. The primary key for the Country table will be just the name of the country. The primary key for the City table will be just the name of the city. The middle table will contain pairs of foreign keys. One member of each pair will be a city name, and the other member of each pair will be a country name. The primary key of the middle table will be the pair of foreign keys. In pictures:

 

Table Labels:

 

CountryTable

 

CityTable

 

CountryCityTable

Column Labels:

 

Country

 

City

 

Country

City

Data Rows:

 

Italy

 

Venice

 

Italy

Venice

 

 

Mexico

 

Rome

 

Italy

Rome

 

 

USA

 

Nogales

 

Mexico

Nogales

 

 

 

 

 

 

USA

Nogales

                               

So we see that the only way to refer to a country or a city in the middle table is by a copy of its primary key. Therefore, in this normalized database,

 

Two rows are identical if and only if their primary keys have the same value.

 

Many database practitioners use underlining on the labels of columns containing primary keys; we'll use bold. Since every column contains a component of a primary key, all column labels are bold.

 

There is also an obvious non-normalized view form of the same data. Non-normalized basically means that this form contains mathematically redundant information, and this can cause ambiguity problems in the identity department. Typically, such a form would be temporary, not the permanent form, for constructing reports, building queries, analyzing by-products, and what not. It would look something like this:

 

Table Labels:

 

CountryFromCityTable

 

CityFromCountryTable

Column Labels:

 

City

Country

 

Country

City

Data Rows:

 

Venice

Italy

 

Italy

Venice

 

 

Rome

Italy

 

Italy

Rome

 

 

Nogales

Mexico

 

Mexico

Nogales

 

 

Nogales

USA

 

USA

Nogales

 

What's the natural way to express this scenario in Visual Basic? Consider our running "Mondial" example again, but stripped way, way down:

 

Public Class Country

    Public Name As String

    Public MyCities As List(Of City)

End Class

 

Public Class City

    Public Name As String

    Public MyCountries As List(Of Country)

End Class

...

    Sub InitData()

        Dim Italy = New Country {Name := "Italy"}

        Dim USA = New Country {Name := "USA"}

        Dim Mexico = New Country {Name := "Mexico"}

 

        Dim Venice = New City {Name := "Venice"}

        Dim Rome = New City {Name := "Rome"}

        Dim Nogales = New City {Name := "Nogales"}

 

        Italy.MyCities = New List(Of City)

        Mexico.MyCities = New List(Of City)

        USA.MyCities = New List(Of City)

 

        Mexico.MyCities.Add(Nogales)

        Italy.MyCities.Add(Venice)

        Italy.MyCities.Add(Rome)

        USA.MyCities.Add(Nogales)

 

        Venice.MyCountries = New List(Of Country)

        Rome.MyCountries = New List(Of Country)

        Nogales.MyCountries = New List(Of Country)

 

        Venice.MyCountries.Add(Italy)

        Rome.MyCountries.Add(Italy)

        Nogales.MyCountries.Add(Mexico)

        Nogales.MyCountries.Add(USA)

 

        Countries.Add(Italy)

        Countries.Add(USA)

        Countries.Add(Mexico)

 

        Cities.Add(Venice)

        Cities.Add(Rome)

        Cities.Add(Nogales)

    End Sub

 

That all just represents a graph of objects in memory, which we can sketch as follows:

 

                   _.------------------------.

               ,-''                           `v

    +-------+-/-+---+                    +---+--------+

    | Italy | o | o |<-------------------= o | Venice |

    +-------+---+--\+                    +---+--------+

         ^           '-.

          `--.          `--.

              `-------.     `-----------.

                       `-------.         `---.

                                `------.      `.

                                        `-.     `v

    +-------+---+                         +\--+------+

    |  USA  | o-=--------.                | o | Rome |

    +-------+---+         `--.            +---+------+

          ^.                  `-----.

            `----------.             `---.

                        `-----.           `.

                               `--.         `--.

              ,------------.      `--.         `-.

             v               `-.       `--.        v

    +--------+---+              `-.  +---+-\-+---------+

    | Mexico | o |                 `-= o | o | Nogales |

    +--------+-\-+                   +---+---+---------+

                `-.                               ^

                   `----.                        /

                         `----------------------'

 

The point is that with objects in memory, the natural representation of the data is through references, and we may conclude that

 

Two objects are identical if and only if their references are identical.

 

So, indeed, since the USA object has a reference to Nogales that is identical to the reference to Nogales contained in the Mexico object, these two Nogales's are (drum roll) identical. Likewise, Venice has a reference to Italy, and Rome has a reference to the same, identical, Italy object.

 

Now, we might duplicate exactly this situation in a relational database, but we would have to introduce additional data for references, along the following lines:

 

CountryTable

 

CityTable

 

CoutryCityTable

Country

CnyPK

 

City

CtyPK

 

CnyFK

CtyFK

Italy

37

 

Venice

26

 

37

26

Mexico

162

 

Rome

8

 

37

8

USA

49

 

Nogales

33

 

49

33

 

 

 

 

 

 

162

33

 

In fact, this is the kind of automatic solution that object-oriented databases would typically compute on your behalf to persist an object graph for you. It just substitutes unique numbers for object references, a process called swizzling. But it still MUST USE a middle table, because there is no good way to represent a variable-length list of references in a column of a table. Most solutions for nested tables, in fact, just un-nest the nested information into additional tables, just as above. So, even though this appears to be a simulation of the object-oriented graph, it doesn't take much to see that it's a totally uninteresting transformation of the normalized relational database form: it just added numerical keys, and it still uses value-equality as the definition for identity.

 

The point I'm trying to make, here, is that object-oriented programs and relational databases have totally incompatible notions of identity, and that the incompatibilty opens up all kinds of downstream complexities. The various schemes for resolving the incompatibilities are mapping solutions. The world of mapping is very, very complicated, because mapping solutions typically try to resolve not just the mis-match in the notions of identity, but to provide all kinds of operational services, incorporating, for instance, updatable non-normalized views, and support for badly designed legacy databases, which are far too valuable to throw away but far too expensive to normalize.

 

Sorry, I don't have a bolt of lightning that, with blinding obviousness and clarity, puts this entire topic into order. I think it's best just to stock up on towels and view it as a never-ending pie fight we just have to deal with. That way, we'll interpret it as funny instead of troubling. 

 

 

Posted Sunday, March 19, 2006 9:58 AM by brianbec | 2 comment(s)

More Mondial Fun with VB and DLINQ

Buckle up, boys and girls, we’re going to be little DBA’s today ! (Data Base Administrators)

 

Earlier, we showed how to use LINQ over objects in memory. Today, we show how to write EXACTLY the same program, this time using DLINQ, over a real SQL-Express relational database. The whole DLINQ project, including the full data set of 3,053 rows, can be downloaded here. The earlier LINQ project – same program only over objects-in-memory, can be downloaded here. You will need the VB9 CTP (Customer Technology Preview), plus some patience. That is a pre-alpha version of VB, and you may have to create new projects and copy the source files (I have had some “inside” reports that this is so).

 

The point of both programs is to compare the time for a fast query to the time for a slow query. To deliver the punch line, even with 14 cities in a tiny DLINQ database, the slow query is 20 times slower than the fast one .

 

================================================================

SlowQuery1 found 14 cities, Time =    54.04 milliseconds

FastQuery1 found 14 cities, Time =     2.12 milliseconds

================================================================

 

With the full database of 3,053 cities, the results are crushing:

 

================================================================

SlowQuery1 found 3053 cities, Time = 46109.31 milliseconds

FastQuery1 found 3053 cities, Time =    40.53 milliseconds

================================================================

 

The data “entities” are from the Mondial database, a fun and free collection of facts and figures about the world. We start off small with just one “relationship” between cities and countries: each country has many cities, expressed through the “join condition” Country.Code == City.Country. In the in-memory case, we firmly establish that, practically speaking, it’s ALWAYS better to build an index, search the index, and throw the index away than to scan the relationship via a nested double loop. In the DLINQ case, we’ll use DLINQ to set up foreign keys in the Database and scan them in the fast query. Here is the query code for the DLINQ case (and we explain all the details below):

 

    Sub SlowQuery1()

        xs = Select New {Cty := it.t.Name, Cny := it.n.Name} _

            From t In db.Cities, n In db.Countries _

            Where it.t.Country = it.n.Code

        xc = CountIEnum(xs)

    End Sub

 

    Sub FastQuery1()

        xs = Select New {Cty := t.Name, Cny := t.MyCountry.Name} _

            From t In db.Cities

        xc = CountIEnum(xs)

    End Sub

 

The neat thing is that DLINQ lets us express our relationship as a PROPERTY, “MyCountry,” on the City class. Neat. We don’t need an index in memory to get fast query. In my sample, I do build such an index, but I use it ONLY to set up the database quickly. I don’t use it at all in the queries and it’s absolutely not necessary.

 

Did I say that we’re making our SQL database FROM SCRATCH? Surprisingly, this is unusual. Almost all the samples I’ve seen magically install some big messy database and then show you how to manipulate it. They never show you how to create it. What’s up with that? I want to do EVERYTHING from my VB program. I don’t want to learn a bunch of SQL-Express GUI DBA tools too. Given the choice between “Hello World!” and “Hello Northwind!” I’ll take “Hello World!” every time. Wouldn’t you? Well, “Hello World!” translates into “Hello Mondial!”

 

Let’s start with class definitions. Remember, the Country class definition for our objects-in-memory LINQ version was:

 

Public Class Country

    Public Name As String

    Public Code As String 'Primary key

    Public Capital As String

    Public Province As String

    Public Area As Integer

    Public Population As Integer

End Class

 

The DLINQ version is, mostly, very similar. Visually, it’s more junky looking, but, fear not, I walk you through all of it:

 

<Table()> _

Public Class Country

    <Column(DbType:="VARCHAR(32)")> _

    Public Name As String

    <Column(DbType:="VARCHAR(4)", Id:=True)> _

    Public Code As String

    <Column(DbType:="VARCHAR(35)")> _

    Public Capital As String

    <Column(DbType:="VARCHAR(32)")> _

    Public Province As String

    <Column(DbType:="INT")> _

    Public Area As Integer

    <Column(DbType:="INT")> _

    Public Population As Integer

 

    Private _myCities As New EntitySet(Of City)

    <Association(Storage:="_myCities", OtherKey:="Country")> _

    Public Property MyCities() As EntitySet(Of City)

        Get

            Return _myCities

        End Get

        Set(ByVal value As EntitySet(Of City))

            _myCities.Assign(value)

        End Set

    End Property

 

End Class

 

The custom attribute <Table()> tells DLINQ that we want the class stored as a database table. The VB compiler ignores CLR custom attributes – it just copies them into assembly metadata so that libraries can read them through Reflection if they need to. That’s just what the DLINQ library does – it checks our class definitions at run time to see if they’re Tables.

 

Next, list out all the same fields as before, just decorated with <Column(…)> attributes to tell DLINQ what SQL types we want stored. I think there are some intelligent defaults I could have used, saving explicit specifying of these details, but I wanted to play it really safe, not testing those waters, so I slavishly set up column attributes for every field. The Mondial original was written in SQL, not VB, so it wasn’t hard to figure out the size of VARCHARS I needed and what not.

 

The new DLINQ EntitySet type and Association attribute are clearly the interesting things. They represent a recurring DLINQ pattern for one-to-many relationships. Over here, in the Country class, we’re on the ONE side of the one-to-many relationship. We must have a corresponding Association attribute on the MANY side, in the City class; we get to it in a minute. Let’s go through this side line-by-line:

 

    Private _myCities As New EntitySet(Of City)

 

Ok, there is a private field that DLINQ will use behind our backs to store a “list” of cities, in effect, for each Country.

 

    <Association(Storage:="_myCities", OtherKey:="Country")> _

 

Ok, DLINQ now knows the name of the storage field it can party on; it’s Country._myCities. It also knows the name of the field in the OTHER side of the association that points back to Country instances, here. Remember, our “join” condition is that City.Country == Country.Code. The OtherKey portion tells DLINQ that the field named “Country” in the OTHER side of the association hooks up to the primary key on THIS side of the relationship, which, oh-by-the-way, we already declared thus:

 

    <Column(DbType:="VARCHAR(4)", Id:=True)> _

    Public Code As String

 

The word “Country” in the fragment OtherKey:="Country" means the field named “Country” in the other side of the association, not the class named “Country” on this side of the association. Sorry about this, but that’s how Mondial was designed and we’re going to live with it.

 

    Public Property MyCities() As EntitySet(Of City)

        Get

            Return _myCities

        End Get

        Set(ByVal value As EntitySet(Of City))

            _myCities.Assign(value)

        End Set

    End Property

 

Ok, that is more-or-less standard stuff: it defines the public property allowing other parts of my program to interact with the EntitySet. DLINQ parties on _myCities behind the scenes, but we get to party on the public Property MyCities. The only thing to remember is to use the Assign method of EntitySet so DLINQ can keep track of the publicly made changes. Now, we just look at the City class and we’re done with almost everything difficult:

 

<Table()> _

Public Class City

    <Column(DbType:="VARCHAR(35)", Id:=True)> _

    Public Name As String

    <Column(DbType:="VARCHAR(4)", Id:=True)> _

    Public Country As String

    <Column(DbType:="VARCHAR(35)", Id:=True)> _

    Public Province As String

    <Column(DbType:="INT")> _

    Public Population As Integer

    <Column(DbType:="FLOAT")> _

    Public Longitude As Double

    <Column(DbType:="FLOAT")> _

    Public Latitude As Double

 

    Private _myCountry As EntityRef(Of Country)

    <Association(Storage:="_myCountry", ThisKey:="Country")> _

    Public Property MyCountry() As Country

        Get

            Return _myCountry.Entity

        End Get

        Set(ByVal value As Country)

            _myCountry.Entity = value

        End Set

    End Property

 

Ok, line-by-line. First, the primary key in this table is triplicate, consisting of the Name of the city, its Country code, and its province. Fair enough (how many “Springfields” ARE there, actually?) Let’s go straight to the Association. EntityRef(Of Country) on this side; a ThisKey whose name matches up with the OtherKey on Country. Once again, the word “Country” on both sides of the association means the field named “Country” in the City class, not to the Class named Country. City.Country holds a value that must be equal to the Code field value in class Country for every pair of City and Country participating in the relationship. Think about it for a minute and make sure you totally “get it” before moving on.

 

Last detail is that the Entity property on DLINQ’s EntityRef type is the one we must use to get and set EntityRefs on the background private party storage field, _myCountry.

 

There’s no freedom in the pattern, here. This is the exact way to express one-to-many relationships in DLINQ. Everything has to be wired up exactly this way. If we were NOT creating our database from scratch, we would just let the SQLmetal utility generate this goo for us from a pre-existing SQL schema, which we would have created from some other SQL toolsets I didn’t want to learn about. Furthermore, I don’t like stuff that generates VB code where I have no idea what’s going on, so I insisted on writing this sample by hand. They tried to tell me not to do it, but I march to a different drummer.

 

Ok, we’re done with almost all the hard stuff. Let’s get through some easy stuff:

 

Public Class Mondial

    Inherits DataContext

 

    Public Sub New(ByVal connection As String)

        MyBase.New(connection)

    End Sub

 

    Public Countries As Table(Of Country)

    Public Cities As Table(Of City)

End Class

 

DataContext is a DLINQ base class; we must inherit from it for our own database, and we must pass in a connection string when we create one of these. The connection string is the LAST bit of hard stuff. There is a cottage industry of helpage for these nasties: see http://www.connectionstrings.com for instance. They are amazingly brittle, so I just wish you luck. Here is the one I used (and I had to get help to construct it):

 

    Private ReadOnly conn1 = "Data Source=.\SQLEXPRESS;AttachDbFilename=" & _

    "'C:\Documents and Settings\brianbec\My Documents\Mondial.mdf';" & _

    "Integrated Security=True;Connect Timeout=30;User Instance=True"

 

My suggestion is that you change only the highlighted thing to your user name. Even though I thought I understood this connection string, I found by bitter experience that I had no flexibility whatever with it. I couldn’t change the directory, the Timeout, the order of terms, NOTHING. If I looked at it sideways, I would get an exception from deep in the bowels.

 

Ok, that’s it, the rest is easy:

 

Module Module1

    Public UseTinyData As Boolean = True

    REM set to False to use big, slow data set

    Public CountryFromCode As Dictionary(Of String, Country)

    Private ReadOnly conn1 = ... (as before

    Public db As Mondial

    Public Sub Init()

        CountryFromCode = New Dictionary(Of String, Country)

        If db.DatabaseExists Then

            db.DeleteDatabase()

        End If

        db.CreateDatabase()

        If UseTinyData Then

            InMemoryTinyDataInitializer.CountryData(db)

            InMemoryTinyDataInitializer.CityData(db)

        Else

            InMemoryDataInitializer.CountryData(db)

            InMemoryDataInitializer.CityData(db)

        End If

        db.SubmitChanges()

    End Sub

 

That just sets up our data tables, and yes, we do use an in-memory index to quickly set up the foreign-key attributes of type EntityRef, but we don’t have to at all. We also have two different data sets, one tiny, for debugging and development, and the full one with 3,053 city rows.

 

Module InMemoryTinyDataInitializer

 

    Private Sub AddCountry(ByVal db As Mondial, ByVal c As Country)

        CountryFromCode.Add(c.Code, c)

        db.Countries.Add(c)

    End Sub

 

    Public Sub CountryData(ByVal db As Mondial)

        AddCountry(db, New Country {Name := "Albania", Code := "AL", Capital := "Tirane", Province := "Albania", Area := 28750, Population := 3249136})

        AddCountry(db, New Country {Name := "Greece", Code := "GR", Capital := "Athens", Province := "Greece", Area := 131940, Population := 10538594})

    End Sub

 

    Private Sub AddCity(ByVal db As Mondial, ByVal c As City)

        c.MyCountry = CountryFromCode(c.Country)

        db.Cities.Add(c)

    End Sub

 

    Public Sub CityData(ByVal db As Mondial)

        AddCity(db, New City {Name := "Tirane", Country := "AL", Province := "Albania", Population := 192000, Longitude := 10.7, Latitude := 46.2})

        AddCity(db, New City {Name := "Shkoder", Country := "AL", Province := "Albania", Population := 62000, Longitude := 19.2, Latitude := 42.2})

        AddCity(db, New City {Name := "Durres", Country := "AL", Province := "Albania", Population := 60000, Longitude := 19.3, Latitude := 41.2})

        AddCity(db, New City {Name := "Vlore", Country := "AL", Province := "Albania", Population := 56000, Longitude := 19.3, Latitude := 40.3})

        AddCity(db, New City {Name := "Elbasan", Country := "AL", Province := "Albania", Population := 53000, Longitude := 20.1, Latitude := 41.1})

        AddCity(db, New City {Name := "Korce", Country := "AL", Province := "Albania", Population := 52000, Longitude := 20.5, Latitude := 40.4})

 

        AddCity(db, New City {Name := "Athens", Country := "GR", Province := "Greece", Population := 885737, Longitude := 23.7167, Latitude := 37.9667})

        AddCity(db, New City {Name := "Thessaloniki", Country := "GR", Province := "Greece", Population := 406413, Longitude := 22.95, Latitude := 40.6167})

        AddCity(db, New City {Name := "Piraeus", Country := "GR", Province := "Greece", Population := 196389, Longitude := Nothing, Latitude := Nothing})

        AddCity(db, New City {Name := "Patrai", Country := "GR", Province := "Greece", Population := 142163, Longitude := Nothing, Latitude := Nothing})

        AddCity(db, New City {Name := "Larisa", Country := "GR", Province := "Greece", Population := 102426, Longitude := Nothing, Latitude := Nothing})

        AddCity(db, New City {Name := "Iraklion", Country := "GR", Province := "Greece", Population := 102398, Longitude := Nothing, Latitude := Nothing})

        AddCity(db, New City {Name := "Volos", Country := "GR", Province := "Greece", Population := 71378, Longitude := Nothing, Latitude := Nothing})

        AddCity(db, New City {Name := "Kavalla", Country := "GR", Province := "Greece", Population := 56705, Longitude := Nothing, Latitude := Nothing})

    End Sub

End Module

 

I’ll leave the main routine and the timing details to the project I uploaded. Have fun!

 

Posted Friday, March 17, 2006 10:16 AM by brianbec | with no comments

Got REPL? Or, "Please Don't Shake My Jigsaw Puzzle!"

One recurring but subliminal theme at the recent Compiler Dev Lab (via Devhawk) was how nice it is to do incremental, interactive development. I argue, here, that not only is it nice, it’s actually critically important. The argument is statistical.

 

If you make ten changes to a program all at once, the number of ways of getting all ten of them right is just one, but the number of ways of getting at least one of them wrong is 1,023. And, if you get even one of them wrong, then you’ve just inserted subtle bugs into your program that you – or even worse, someone else – may not find till much later.

 

Digest that for a minute. If you make ten lousy changes to a program before trying them out, your chances of getting the whole thing right is just one in a thousand. It never ceases to amaze me, watching programmers at work. They’ll type in ten lines of code and then spend all day debugging them. Whereas, if they had found a way to type them in one at a time, testing as they go, then they would only have had ten ways to mess up, instead of a thousand.

 

Multiply that kind of inefficiency by 1,000 developers, and suddenly we can see why software engineering in practice is anything but engineering.

 

Look at it this way: would you solve a jigsaw puzzle by fitting in one piece at a time, or by trying to get ten pieces all in at one go? The latter strategy is a non-starter. You might as well try to solve the puzzle by putting the pieces back in the box and shaking them.

 

So, what… How are we going to make our changes one at a time instead of in batches? For that, let’s go back to the Compiler Dev Lab.

 

The most popular talks were those wherein the speaker typed a line of code and then something interesting happened on the screen. F#, Dyalog APL, Iron Python, Monad, all went along this way. As more lines were typed, more and more sophisticated things happened, building exponentially on the things that happened just a line or two prior.

 

The host for such a development style is a REPL (pronounced “Repple”), which stands for Read-Eval-Print Loop. Such things are commonplace staples of programming languages since the 1950’s, pioneered with Lisp, Forth, Basic, and so on. The alternative is an ECDL (pronounced “Eckdull”), for Edit-Compile-Debug Loop. An ECDL is, in reality, just a semi-interactive simulation of a card punch, card reader, and line printer (those of you under the age of 45 will have no idea what I’m talking about). People who grew up with PL/1, Fortran, C, Java, are comfortable with ECDLs and may never even have seen a REPL.

 

But, almost any language can be hosted in a REPL or an ECDL, and some development environments, like those for PLT-scheme and Hugs, support both.

 

But here’s the main point: a REPL helps you make just one change at a time. Make a change, hit return, bango it’s executed. If you made a mistake (which we all do about half the time), you see it right away, NOW, not later, hidden in the weeds with 1,022 other potential mistakes.

 

The speakers knew that REPLs create an overwhelmingly compelling development experience, and thus, an overwhelmingly compelling presentation experience. After all, what good is a talk about your technology if your audience dozes off? Their knowledge of this is so deep that a couple of them even created specially doctored REPLs just for presentation purposes.

 

In fact, this style of development is *so overwhelmingly compelling* that these professional compiler developers sometimes mistaked dynamic development for dynamic languages! IronPython is a dynamic language, F# is a static language, but both are hosted in a dynamic development environment. An informal poll for the most popular dynamic language, however, resulted in F#! This goes to show that the dynamism of the development experience is far more important than language dynamism.

 

So, if you’re creating a language or an environment, please PLEASE design a REPL to go along with it, and help your users solve their jigsaw puzzles one piece at a time.

Posted Thursday, March 16, 2006 10:46 AM by brianbec | 2 comment(s)

Fun Hash Joins with VB9 LINQ

In the terminology of relational databases, a “join” is, semantically, like a nested loop over a pair of lists (or tables) of records, saving only those where some certain fields match. Unless we do something smart, this could be very expensive. Imagine searching a database of a million DNA profiles for the closest match to a sample that has 10,000 DNA features (I have no idea whether those are real numbers: I just made them up, but they sound ballpark to me). A dumb join would search all 1 million profiles for each of the 10,000 features, resulting in 10 billion match tests, almost all of which will fail – by design, of course. That’s going to hurt.

 

The “something smart” is to build an index and search through that. Your database doesn’t have to be large at all for this to pay off. In fact, even with just a few records, it’s cheaper to build an index, use it, and throw it away than it is to do a nested loop. Let’s prove that with a small sample. First, we’ll do everything with objects in memory just to keep it easy. In a later blog entry, I’ll show how to do exactly the same thing with DLINQ and SQL Express. So, in this entry, when I say “database,” I mean collection of objects in memory. I still have a bunch of interesting new VB9 things to show off, though, including object initializers, XML literals, and query comprehensions.

 

My partner, Erik Meijer, showed me a fantastic little database called Mondial, free for educational and research use. Huzzah! That’s us. It’s full of fun factoids about cities and countries and languages and politics. It’s downloadable in SQL, text, or XML. It illustrates most of the technical problems you’d encounter with a more typical business sample, you know, the old boring “Customers and Orders,” but it’s more likely to keep you awake.

 

Let’s start off with just two of the tables in Mondial: Countries and Cities. Turns out there are 195 countries and 3,053 cities in the database. Each Country has a unique “Code,” which is a very short unique string. The database can use this as a primary key: it only needs to check the “Code” field for uniqueness of a Country record. Here’s the VB for a class representing Mondial countries:

 

Public Class Country

    Public Name As String

    Public Code As String 'Primary key

    Public Capital As String

    Public Province As String

    Public Area As Integer

    Public Population As Integer

End Class

 

And here are a few instances, using VB9’s “object-initializer” syntax (you will need the VB9 CTP [Customer Technology Preview] to work with the code in this post):

 

{Name := "Albania", Code := "AL", Capital := "Tirane", Province := "Albania", Area := 28750, Population := 3249136}

 

{Name := "Andorra", Code := "AND", Capital := "Andorra la Vella", Province := "Andorra", Area := 450, Population := 72766}

 

{Name := "Spain", Code := "E", Capital := "Madrid", Province := "Madrid", Area := 504750, Population := 39181114}

 

Fun facts and figures already! How about cities? Here’s a class for Mondial cities:

 

Public Class City

    Public Name As String

    Public Country As String 'Foreign-key to Country.Code

    Public Province As String

    Public Population As Integer

    Public Longitude As Double

    Public Latitude As Double

End Class

 

And some samples:

 

{Name := "Ankara", Country := "TR", Province := "Ankara", Population := 2782200, Longitude := 32.8833, Latitude := 39.95}

 

{Name := "Helsinki", Country := "SF", Province := "Uusimaa", Population := 487428, Longitude := 24.95, Latitude := 60.1667}

 

{Name := "Rangoon", Country := "MYA", Province := "Yangon", Population := 2513000, Longitude := 96.15, Latitude := 16.7833}

 

Ok, so now we can see how Mondial links cities to their countries: City.Country == Country.Code, that is, the “Country” field of each city must match the “Code” field of the corresponding country. A city record doesn’t contain the name of its country, just the country’s code. So, unless we’ve memorized all 195 country codes, we can’t look at any old city and tell what country it’s in.

 

Database join to the rescue! We’ll use City.Country == Country.Code as our “join condition” and print out a list of cities with the names of their countries retrieved from the list of countries.

 

The initializers for all 195 countries and 3,053 cities fill up about, well, about 3,248 lines of code, which is 50 pages or so and too large to post. So, I think the best way to get them to you is to tell you how I made them. From the Mondial site, I copied one of the SQL formats, which contains 3,248 INSERT VALUES statements. Using emacs keyboard macros, I manually wrestled these statements into the appropriate VB format, shown in the pattern below.

 

Public Module DataInitializer

 

    Public Function CountryData() As List(Of Country)

        Dim cs As New List(Of Country)

        cs.Add(New Country {Name := "Albania", Code := "AL", Capital := "Tirane", Province := "Albania", Area := 28750, Population := 3249136})

        cs.Add(New Country {Name := "Greece", Code := "GR", Capital := "Athens", Province := "Greece", Area := 131940, Population := 10538594})

        cs.Add(New Country {Name := "Macedonia", Code := "MK", Capital := "Skopje", Province := "Macedonia", Area := 25333, Population := 2104035})

...

        CountryData = cs

    End Function

 

    Public Function CityData() As List(Of City)

        Dim cs As New List(Of City)

        cs.Add(New City {Name := "Luxembourg", Country := "L", Province := "Luxembourg", Population := 76600, Latitude := 6.08, Longitude := 49.4})

        cs.Add(New City {Name := "Monaco", Country := "MC", Province := "Monaco", Population := 1234, Longitude := 7.2, Latitude := 43.7})

        cs.Add(New City {Name := "San Marino", Country := "RSM", Province := "San Marino", Population := 4416, Longitude := 12.2, Latitude := 43.5})

...

        CityData = cs

    End Function

 

End Module

 

The whole process took me 10 minutes, including making and correcting the inevitable ham-handed mistakes with emacs regular expressions. Someone with good Perl skills could probably have done the job even more quickly. I do NOT recommend trying this with search-and-replace in the VB IDE, because the VB pretty lister will try to format each of the 3,248 lines one at a time, and it will hurt.

 

Let’s start up a VB module, declare a couple of variables to hold all our data, and a subroutine to fill them up.:

 

Module Module1

    Dim Countries As List(Of Country)

    Dim Cities As List(Of City)

...

 

    Sub Initialize()

        Countries = DataInitializer.CountryData

        Cities = DataInitializer.CityData

    End Sub

 

Now, what’s the most straightforward way to do the join? How about the following:

    Dim xs As IEnumerable

    ' O(Cities) * O(Countries)

    Sub SlowQuery()

        xs = Select New {Cty := it.t.Name, Cny := it.n.Name} _

            From t In CitySubset, n In CountrySubset _

            Where it.t.Country = it.n.Code

 

        For Each x In xs

            'Print or whatever... must do something to force

            'query execution, however.

        Next

    End Sub

 

This sub uses anonymous types, data initializers, and a query comprehension – all new VB9 features – to build an IEnumerable containing city names and their country names. Its execution time is of order 3,053*195 =~ 600,000; painful. Also note carefully that the LINQ query comprehension just sets up the query without executing it. Not until we do something with the result, xs, does any searching and matching take place. LINQ query execution is lazy.

 

Most of these 600,000 comparisons will fail, of course, because they’re checking whether Madrid is in Zimbabwe or Capetown is in Iceland, and so on. In fact, we know that exactly 3,053 of them will succeed. We can be a lot smarter.

 

Suppose we had an index that would let us look up the country for any city, or, equivalently, the list of cities for any country? Then, we could write something like the following:

 

    Dim ys As IEnumerable

    ' O(Number of Cities)

    Sub FastQuery0()

        ys = Select New {Cty := it.t.Name, Cny := it.n.Name} _

            From n In Countries, t In CitiesFromCountry(n)

 

        For Each y In ys

            'Force query execution lest lazy

        Next

    End Sub

 

Where

 

    Dim CountryFromCode As Dictionary(Of String, Country)

    Dim CitiesFromCountry As _

        Dictionary(Of Country, List(Of City))

 

And where we build the indices as follows:

 

    Sub BuildIndices()

        ' "CountryFromCode" looks up country object from a "code" string.

        ' Timing to build this index is O(Countries).

        CountryFromCode = New Dictionary(Of String, Country)

        For Each cn In Countries

            CountryFromCode.Add(cn.Code, cn)

        Next

        ' "CitiesFromCountry" looks up a list of cities

        ' from a country.

        ' Time to build index is O(Cities)

        CitiesFromCountry = _

            New Dictionary(Of Country, List(Of City))

        For Each ct In Cities

            ' Missing code is worthy of an uncaught exception

            Dim d = CountryFromCode(ct.Country)

            Dim cts As List(Of City) = Nothing

            If Not CitiesFromCountry.TryGetValue(d, cts) Then

                cts = New List(Of City)

                CitiesFromCountry.Add(d, cts)

            End If

            cts.Add(ct)

        Next

    End Sub

 

We’re not being very clever about building these indices. For instance, we check for existence of the city list for every city. It doesn’t matter, however, because it’s so much slower to loop over all 600,000 combinations that 2,858 redundant checks, by comparison, are negligible. Also, if we we’re really clever, we would only build one of the two indices, not both, choosing between them by estimating the time to build each and picking the fastest. Real databases do this kind of estimation as part of Query Planning, and a great deal of sophistication of SQL Server 2005, for instance, is devoted to such superoptimization.

 

For our illustrative purposes, though, we’re more interested in clarity, flexibility, and simplicity. Even though we’re overkilling in the index-building department, it’s still far cheaper than doing the slow, nested loop. It’s still just O(Number of Cities). Now, before I show you how I prove these claims, let me just jump to some results and explain what they mean. Then I’ll show you how to reproduce these timings:

 

================================================================

In-memory timings (hecta mu):

+-----+--------+--------+--------+--------+

| NRws| SlowQry| FastQry|IdxBuild| FastTot|

+-----+--------+--------+--------+--------+

| 3053| 1405.87|   10.77|   10.88|   21.64|

| 1413|  330.15|    5.19|    5.25|   10.44|

|  891|  105.39|    3.25|    3.23|    6.48|

|  329|   19.20|    1.25|    1.27|    2.52|

|   89|    2.91|    0.44|    0.43|    0.87|

|   28|    0.62|    0.20|    0.20|    0.40|

|    4|    0.18|    0.11|    0.11|    0.22|

|    2|    0.13|    0.09|    0.10|    0.19|

+-----+--------+--------+--------+--------+

================================================================

Wrote data in Excel format to file ..\..\PerfData31A5.xml

Press "Enter" to end the program ...

 

First, my dimensions. I present timings in hundreds of microseconds, or hecta-mus, just because they come out as nice, digestible magnitudes, where we can get a feel for them without doing a lot of conversions in our heads. The first column shows the number of city-rows over which we do a join. We vary this number by picking random subsets of the entire city set so we can find out how small a database we need before it’s cheaper to do the slow, nested-loop query than to build the index, use it in the fast query, and then throw it away. The second column is the timing of the slow query in hecta-mus. The third column is the timing of the fast query; the fourth column is the timing of the building of a fresh index over the randomly chosen subset; and the last column is the sum of those two. So, the numbers to compare are the second column versus the last column. We write out these timings to an Excel file (using XML literals, which we show off later), so we can chart them, as follows:

 

 

But we don’t even need the chart to see that it’s not until we have well fewer than 28 rows does it pay to even think of doing the nested-loop instead of the build-index, use-index, throw-index-away pattern. In other words, for all practical purposes, never.

 

So, how do we create these results? For high-resolution timing, I’ll just use the underlying Win32 performance counters. I think there is a .NET way to do this, but I just happen to know the Win32 interop programming pattern in my fingers, so here’s my cut at it. The following is a general facility for timing the execution of ANY function-of-no-arguments, which I’m calling a thunk here. This usage of the word “thunk” is borrowed from the Scheme literature: in the Windows zeitgeist it usually means something else, like a shim or a trampoline, so don’t be confused. I’m not using the term in that sense, here.

 

Public Delegate Sub Thunk()

 

Public Module HiRezTimer

 

    Function TimeAThunk(ByVal Something As Thunk) As Double

        Dim startCount As Int64 = QPC()

        Something()

        Dim stopCount As Int64 = QPC()

        TimeAThunk = (stopCount - startCount) / QPF()

    End Function

 

    <System.Runtime.InteropServices.DllImport("Kernel32.dll")> _

    Public Function QueryPerformanceCounter(ByRef perfcount As Int64) As Boolean

 

    End Function

 

    <System.Runtime.InteropServices.DllImport("Kernel32.dll")> _

    Public Function QueryPerformanceFrequency(ByRef freq As Int64) As Boolean

 

    End Function

 

    Public Function QPC() As Int64

        Dim perfcount As Int64

        QueryPerformanceCounter(perfcount)

        Return perfcount

    End Function

 

    Public Function QPF() As Int64

        Dim freq As Int64

        QueryPerformanceFrequency(freq)

        Return freq

    End Function

 

End Module

 

And here is how I use it to produce the results above:

 

    Sub TimeOne(ByVal n As Integer)

        Dim t As New TimingRow

 

        t.ln = n

        t.q1 = Hecta(Micro(TimeAThunk(AddressOf SlowQuery)))

        t.ix = Hecta(Micro(TimeAThunk(AddressOf BuildSubIndices)))

        t.q2 = Hecta(Micro(TimeAThunk(AddressOf FastQuery)))

 

        RawRows.Add(t)

        Console.WriteLine("|{0,5}|{1,8:0.00}|{2,8:0.00}|{3,8:0.00}|{4,8:0.00}|", t.ln, t.q1, t.ix, t.q2, t.ix + t.q2)

    End Sub

 

    Sub DoATiming()

        DrawLine()

 

        RawRows = New List(Of TimingRow)

        Console.WriteLine("In-memory timings (hecta mu):")

 

        BuildIndices()

 

        CountrySubset = Countries

        CitySubset = Cities

 

        Console.WriteLine("+-----+--------+--------+--------+--------+")

        Console.WriteLine("| NRws| SlowQry| FastQry|IdxBuild| FastTot|")

        Console.WriteLine("+-----+--------+--------+--------+--------+")

        TimeOne(CitySubset.Count)

        TimeOne(BuildSubsets(1, 2).m)

        TimeOne(BuildSubsets(1, 4).m)

        TimeOne(BuildSubsets(1, 8).m)

        TimeOne(BuildSubsets(1, 16).m)

        TimeOne(BuildSubsets(1, 32).m)

        TimeOne(BuildSubsets(1, 64).m)

        TimeOne(BuildSubsets(1, 128).m)

        Console.WriteLine("+-----+--------+--------+--------+--------+")

    End Sub

 

Over in the article section, I’ll post the code for “BuildSubsets,” along with all this code in complete VB modules. It’s pretty obvious, dumb, actually. But remember, it doesn’t need to be fast or smart, just clear and easy.

 

The last thing to show off here is how easy it is to write out the Excel version of the timing data using VB9 XML literals. It literally took me less than half an hour to do this, knowing next to nothing about Excel’s XML support. I still know next to nothing about it: I don’t need to know. The timing data goes in a “TimingRow,” which looks like this:

 

Public Class TimingRow

    Public ln As Integer

    Public q1 As Double

    Public ix As Double

    Public q2 As Double

End Class

 

Length, query1, index-build, and query2, yup yup. In the timing routines above, we not only Console.WriteLine the results, but fill new timing rows with them, so we can add a timing row to a list of “RawRows,” declared as follows:

 

Module ExcelWriter

    Dim xl As XDocument

    Public RawRows As List(Of TimingRow)

...

 

In our main routine, after we’re done with all the timings, we call

        ExcelWriter.SaveDocument()

 

Which looks like this:

 

    Public Sub SaveDocument()

        PrepareOutput()

        Dim filename = NonCollidingFileName("..\..\PerfData", ".xml")

        Dim str As New StringWriter

        xl.Save(str)

        Using sw = File.CreateText(filename)

            sw.WriteLine(str)

            sw.Close()

        End Using

        Console.WriteLine("Wrote data in Excel format to file {0}", filename)

    End Sub

 

The only interesting line, here, is “PrepareOutput;” the code for “NonCollidingFileName” is over in the article section with the rest of the entire sample. So, here is “PrepareOutput:”

 

    Private Sub PrepareOutput()

        xl = <?xml version="1.0"?>

             <?mso-application progid="Excel.Sheet"?>

             <Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"

                 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">

                 <Worksheet ss:Name="Sheet1">

                     <Table>

                         <%= MakeRow("Length", "Qry1", "Indx", "Qry2", "Tot2") %>

                         <%= Select MakeRow(c.ln, c.q1, c.ix, c.q2) _

                             From c In RawRows %>

                     </Table>

                 </Worksheet>

             </Workbook>

    End Sub

 

This creates the literal XML document. I got this document template by creating a bogey document, with constant data instead of variable data, in Excel by hand, choosing “Save As XML,” opening up the result in Notepad, and cutting and pasting the result over to my VB code. I didn’t even have to search for documentation on the XML format for Excel files, let alone read, understand, or learn it. Just blind cut and paste. Neat.

 

I replaced the constant rows with VB XML holes, bracketed by the <%= ... %> syntax. I created two overloads of “MakeRow,” one for column labels, another for the numerical data, calling one straight up and the other through a “Select” comprehension. Neat. Here are those two overloads:

 

    Private Function MakeRow(ByVal a As Integer, ByVal b As Double, _

    ByVal c As Double, ByVal d As Double) As XElement

        MakeRow = <Row xmlns="urn:schemas-microsoft-com:office:spreadsheet"

                      xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"

                      ss:AutoFitHeight="0">

                      <Cell><Data ss:Type="Number"><%= a %></Data></Cell>

                      <Cell><Data ss:Type="Number"><%= b %></Data></Cell>

                      <Cell><Data ss:Type="Number"><%= c %></Data></Cell>

                      <Cell><Data ss:Type="Number"><%= d %></Data></Cell>

                      <Cell ss:Formula="=RC[-2]+RC[-1]"></Cell>

                  </Row>

    End Function

 

    Private Function MakeRow(ByVal a As String, ByVal b As String, _

    ByVal c As String, ByVal d As String, ByVal e As String) As XElement

        MakeRow = <Row xmlns="urn:schemas-microsoft-com:office:spreadsheet"

                      xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"

                      ss:AutoFitHeight="0">

                      <Cell><Data ss:Type="String"><%= a %></Data></Cell>

                      <Cell><Data ss:Type="String"><%= b %></Data></Cell>

                      <Cell><Data ss:Type="String"><%= c %></Data></Cell>

                      <Cell><Data ss:Type="String"><%= d %></Data></Cell>

                      <Cell><Data ss:Type="String"><%= e %></Data></Cell>

                  </Row>

    End Function

 

More XML literals via blind cut-and-paste, with holes inserted via educated guesswork. Notice that the last line in the numerical overload writes out a constant cell, it’s just a constant formula in Excel, that uses relative cell references to calculate the totals. Neat.

 

So there you have it. How to do dumb queries, hash joins, time your results, and write them out in an Excel file, in one really short VB9 program. One last little detail, you really do need to do the timings TWICE and ignore the first set of results, because the CLR Jitter will need to warm up on the first go-round. This could be prevented by NGenning the program before running it, but it’s easier this way, and this post is all about how to do it the easy way. So here is my main routine, which reports that we should ignore the first timing:

 

    Sub Main()

        Initialize()

        Console.WriteLine("Ignore first timing (JIT warmup)")

        DoATiming() 'First one just warms up the Jit

        DoATiming()

        DrawLine()

        ExcelWriter.SaveDocument()

        Console.Write("Press ""Enter"" to end the program ...")

        Console.ReadLine()

        Exit Sub

    End Sub

 

Posted Wednesday, March 15, 2006 10:04 AM by brianbec | 5 comment(s)

How to write a Standalone VM

In case you ever find yourself having to implement (or simulate, or emulate) a stack-based Virtual Machine, a-la Forth or PostScript or even the CLR itself, here's a handy programming pattern for your consideration. The rough idea is to give every instruction some state variables for remembering its arguments, with virtual methods for "execute" and "ToString" (for “explain” mode) and what not. An instance of the VM will have an instruction stream, an instruction pointer, a stack, and methods for adding instructions to the stream, executing the stream, and explaining itself.

The sample below is intentionally very bare-bones. Think of it as the “hello, world” for a Forth-oid VM. It’s a pattern rather than a solution. I just have a handful of instructions here, just to offer the idea. To make this a real, complete, VM, you would add a lot more instructions, following the pattern. This sample doesn’t use the underlying CLR VM; it could, through Reflection. It doesn’t have a console; it could, through Regex or pick-your-favorite parsing solution. It doesn’t have a debugger; it could, through integration with Visual Studio. It doesn’t have a type system; it could -- it just uses VB late binding through object. In fact, this is so bare-bones that I don’t actually have a method for adding instructions, it just exposes the instruction stream as a public member (oh, the HORROR!) … you’re getting the idea.

So, first, here’s the outline for the VM itself, showing detail, for now, just for the self “Test” routine.

Public Class VM

    Public Class Exception REM gotta have our own, don’tya know

        Inherits ApplicationException REM ... impl omitted

    Public stk As New Stack REM this is a .NET collection class

    Private ix As New List(Of Instruction) REM Instr. is ours

    Private ip As New Integer REM instruction pointer

    Public Sub Execute() REM ... impl omitted

    Public Sub Explain() REM ... impl omitted

    Public Sub Test()

        ix.Clear()

        ix.Add(New Push("A"))

        ix.Add(New Push("B"))

        ix.Add(New Push(42))

        ix.Add(New Push("C"))

        ix.Add(New DisplayStack)

        ix.Add(New MakeList(3))

        ix.Add(New DisplayStack)

        ix.Add(New Reverse)

        ix.Add(New DisplayStack)

        ix.Add(New UnList)

        ix.Add(New DisplayStack)

        ix.Add(New Pop(2))

        ix.Add(New DisplayStack)

        Explain()

        Execute()

    End Sub

End Class

 

Obviously, the interesting thing above is the instruction stream in the “Test” routine. The phasing of operations is “create the instructions, add them to the stream, make the stream show itself, then execute the entire stream.” “Push” is an instruction that, when executed, pushes its argument on the stack. “DisplayStack” does some more explaining at run time. “MakeList,” when executed, pops n arguments off the stack and makes a list out of them, pushing the list back on the stack; n is its argument, 3 in this case.  “Reverse,” when executed, pops a list off the stack, reverses the elements, and pushes the resulting list back on the stack. “UnList,” when executed, pops a list off the stack, then serially pushes the elements of the list back on the stack. “Pop,” when executed, pops n elements off the stack, where n is its argument, 2 in this case.

 

What is the expected output of the above? Something like the following:

 

13 Instructions

   0: Push             A

   1: Push             B

   2: Push             42

   3: Push             C

   4: DisplayStack

   5: MakeList         3

   6: DisplayStack

   7: Reverse

   8: DisplayStack

   9: UnList

  10: DisplayStack

  11: Pop              2

  12: DisplayStack

 

4 elements in the stack

   0: C

   1: 42

   2: B

   3: A

 

2 elements in the stack

   0: ["C", 42, "B"]

   1: A

 

2 elements in the stack

   0: ["B", 42, "C"]

   1: A

 

4 elements in the stack

   0: C

   1: 42

   2: B

   3: A

 

2 elements in the stack

   0: B

   1: A

 

Neat. Ok, now the implementations of VM’s methods. Here’s Execute:

 

    Public Sub Execute()

        ip = 0

        While ip < ix.Count

            ip += ix(ip).execute(Me)

        End While

    End Sub

 

Yup, just calls the overridable “execute” method on each instruction, passing in the VM itself so that each instruction can access the stack, as needed. Notice that the ip (instruction pointer) is advanced by an amount returned by “execute.” That’s so we can write branch instructions that move the instruction pointer somewhere far away. Most instructions just return 1 from “execute.”

 

“Explain” is equally easy. The only thing remotely noteworthy is that Console.WriteLine automatically calls the “ToString” method on class “Instruction,” which will, by override (virtual dispatch) call the various formatting routines in the particular instruction subclasses:

 

    Public Sub Explain()

        Console.WriteLine("{0} Instructions", ix.Count)

        ip = 0

        While ip < ix.Count

            Console.WriteLine("{0,4}: {1}", ip, ix(ip))

            ip += 1

        End While

        Console.WriteLine()

    End Sub

 

Now it’s time to see the base class for instruction, again with some detail omitted for now:

 

Public MustInherit Class Instruction

    Public MustOverride Function execute(ByVal d As VM) _

      As Integer

 

    Public Overrides Function ToString() As String

        Return String.Format("{0,-16} {1}", formatOpcode, _

        formatArgs)

    End Function

 

    Public Overridable Function formatArgs() As String

        formatArgs = ""

    End Function

 

    Public Function formatOpcode() As String

        Const opcodePat = "\w+\.(?<opcode>\w+)$"

        Dim s = Me.GetType.ToString

        formatOpcode = Regex.Match _

        (s, opcodePat).Result("${opcode}")

    End Function

End Class

 

“execute” is a pure virtual method – each instruction must supply an implementation. That’s pretty obvious. “ToString,” called by the VM’s “Explain” procedure for each instruction, calls an overridable “formatArgs,” with the empty string being the default return value for those instructions that don’t override it because they don’t have arguments they want to print out. Interestingly, the “formatOpcode” isn’t overridable, because, via Reflection, we can get the name of each instruction subclass without having that subclass participate. Neat.

 

Nothing much left, now. Just show a few instructions, and you’ll get the idea:

 

Public Class Push

    Inherits Instruction