Designing a language is hard, and M won't change that

Wednesday, November 5, 2008

I was a bit surprised about the large scale of the positive hype around 'M' in Oslo. Fortunately, Roger Alsing wrote a 'Back to reality' post about M which stood out as a single critical remark against the sea of positivism. I like it when someone chooses to be the more sceptical voice in the crowd: this way an outsider gets a better chance to see the whole picture instead of just one side. This post is written in that light, to give my two cents in line with what Roger said and to give some more background in what I think he meant.

What's the idea of 'M', the codename for the textual language in Oslo? It's supposed to make it easy to write DSLs. A DSL is a 'Domain Specific Language', which in short is a name for a language (in any shape or form!) to express behavior, data structures, data and what have you, for a specific, narrow, well defined domain. So if you want your application to have a custom set of rules, definable by the user, a DSL might be a good mechanism to formulate these rules in. Make no mistake about what a DSL is though: a grid, form, text in a textfile, can be a representation of a DSL. It's a concept, not a physical toolkit.

Everyone who has ever written a parser for a grammar, defined a grammar for a language or designed a language, has probably read or heard about the 'Dragon book', or more officially: Compilers: Principles, Techniques, and Tools, by Aho, Sethi and Ullman. In that book, it's explained in all detail what's needed to define a language, to recognize elements in texts containing elements defined in the grammar of the language, parse it and do something with that parse result. It's not a book for the hobby programmer, and for a reason: computer languages, and thus also DSLs, are complex beasts to handle. The domain of computer languages has been studied a lot in the past decades and it's one of the few fields in Computer Science which can be considered a bit more mature than the rest: today, the aspects of computer languages, grammars, lexers, parsers and friends are well understood and documented in many papers and books.

One of the nice things of a relatively mature field is that the techniques to work with the material are fairly mature as well. This means that if you want to define a grammar, look for a way to understand texts defined in that grammar, the solutions are available to you in mature formats, with tooling which has proven its value for many years. The question that arises often is of course: with all this tooling and knowledge, is it possible for a less educated / hobby programmer (or a non-programmer) to define a language for a given domain so the domain's complexity is easier to handle? (because that's the main purpose of DSLs). The answer is sadly: no, it's not. As the M toolset is too a toolset for producing a parser and do something with the input, it too doesn't make things easier. I'll explain why below.

The L(AL)R(k) vs. LL(k) difference
If you already wonder what LALR, LR and LL stand for, I've proven my point. I've linked them to the Wikipedia articles, so if you're not familiar with the terms you can read up and follow the rest of the argument. LALR and LR are so called 'Rightmost derivation' parsers, which can be seen as 'bottom up': it uses a stack and a state machine to recognize Non-Terminals in the input so it can in the end, understand the complete input.

When an LR parser understands it has seen a Non-Terminal N, it will replace the elements in the input stream (Terminals and / or Non-Terminals) with the Non-Terminal it just recognized. This is called 'Reduce', and it's typical an event which is used to trigger handler code outside the parser, e.g. to build up the Abstract Syntax Tree (AST). This process continues till it has reduced the whole input stream to a single Non-Terminal, the start symbol. As the Wikipedia article describes, writing a parser for LR(k) grammars by hand is hard, as you have to produce tables with data for the parser engine which is a state machine. A generator can help there. A good example of such a generator is the Gold Parser or YACC. The Gold Parser comes with a visual toolkit which allows you to test and debug your grammars.

The main problem with LALR/LR languages is that grammars can lead to conflicts, in the form of reduce/reduce conflicts or shift/reduce conflicts or even shift/shift conflicts. The conflicts all arise inside the parser's state machine where it is in state S and the path to get there with the current input decides which state will follow. Reduce/reduce conflicts occur when the parser sees an input element E and on the stack are a group of elements which together with E form a match for two (or even more!) Non-Terminals: it can't decide, based on E and the stack contents, which Non-Terminal should replace the set of elements on the stack + E. A shift/reduce conflict arises when the parser sees an element E and it has to decide whether replace the set of elements on the stack + E with Non-Terminal N1, or push E onto the stack (shift) to try to form the required elements for Non-Terminal N2. Shift/shift conflicts occur when the parser can't decide for which Non-Terminal to push onto the stack (shift) the input element E.

These conflicts should be resolved to have a proper language which doesn't contain a bug by itself. It's sometimes sufficient to pick one over the other (e.g. shift instead of reduce in a shift/reduce conflict) but it's in general not adviced to do so. Add to that that as soon as a language is ambiguistic, (like C# and VB.NET are for example), it's harder to solve these conflicts: the parser has to 'look ahead' to see what's coming to decide what to do. LALR parsers therefore are less suffering from these conflicts, yet it's something one has to deal with in a lot of cases. Though to be able to solve these conflicts, one has to understand how the parsing process works.

There's another type of parser, the LL(k) parser. An LL(k) parser is a Leftmost derivation parser, which means that it's a top-down parser. It doesn't use shift/reduce tables and a stack, instead it simply looks at the input and calls a handler for the Non-Terminal the input symbol is the start symbol of and leaves the handling to that handler. So instead of waiting for the parser to see the complete 'If' expression for the 'If' Non-Terminal to reduce it and call the handler for the 'If' Non-Terminal, it calls the 'If' Non-Terminal handler after it sees a Terminal 'if' and expects the If Non-Terminal handler to take care of it. LL(k) parsers can be seen as state machines similar to LR/LALR parsers, yet the state machine is less visible: it's inside the handler code.

As the handling of a Non-Terminal is done in a process you have every control over, as you're there along the way, it's much easier to handle errors: if you expect a ')' and it's not there, you can report the error and act as if you've seen that bracket and move on, to handle more errors (or to even ignore the error, if the language has to be very flexible). An LR/LALR parser can't do that easily because it simply has to give up: it sees an element E which combined with the stack contents can't be used to either shift or reduce. This makes LL(k) easier to handle ambiguistic languages and parse along when errors occur. Of course generators for LL(k) are available as well, a good example is the widely known ANTLR, which despite its name might suggest is an LL(k) toolkit.

One core difference between LL(k) and LALR/LR is that code to build the AST is inside the grammar for LL(k) and outside the grammar for LALR/LR(k). This is logical, because LL(k) is handled with custom handlers for the grammar's Non-Terminals, so the code which handles Non-Terminal N is custom made for N, containing the code to build the AST node for N along the way (of course using re-usable classes/methods, but the structure of the handler is unique for that Non-Terminal). To be able to do that, the custom code is specified inside the grammar which is fed to the generator which produces a proper parser for the grammar, including the custom code. For LALR/LR grammars, the code can be outside the grammar because the generator doesn't generate any code, it generates data, which is the input for the state machine to make sure the parser knows what to do when a given input is seen.

My experience with both LALR/LR and LL is that with LL, it's easier to write a grammar but harder to produce custom code for the final parser: it's very hard to debug code which is inside a grammar which is used to generate a parser which task it is to build an AST from the input which in turn is used to do something with the input. For LALR/LR it's hard(er) to get the grammar right, harder to get good error recovery, but writing the handling code to build the AST is relatively easy: you just fill in the blanks for every Non-Terminal in your grammar. I know that in one of the Oslo presentations, it was shown that the language could be debugged. The ANTLR tools can do that too (for Java), though does what the M tools can do, make things less complex than the complexitiy level of the ANTLR tools? One still has to know what's on the screen, what everything means and what's handled by what at what moment.

So when I look at M's elements, I can only conclude that it too has the same aspects as the well known tooling and building blocks for parsers, grammars and the like available to everyone today: it's complex. With 'complex' I mean: you need proper knowledge about computer languages, parsers and grammars to understand what's going on and how to use the tools available properly. There's nothing wrong with that: if something is hard, well... that's life, deal with it by understanding the concepts which makes a hard problem manageable. Though to me it's impossible that all crucial aspects of writing computer languages are suddenly not important anymore with M's elements on the table. Sure, people will write DSLs with M's elements, but today you can do that too with ANTLR or Gold Parser, use a visual toolkit to design, test and debug the language, and you're likely up and running in a day or two. That doesn't mean a person who has no clue what LL(k) means, what an AST is, can create a language for a given domain with ANTLR, some knowledge about parsers, grammars etc. is required. And with M's elements, this isn't any different at all.

That's not bad, not at all. It's however to me a different reality than the picture Microsoft tried to paint with the Oslo material provided through the PDC, and therefore I fully agree with what Roger said in his blogpost.

The core issue is that to be able to write a succesful DSL, you have to be able to create a grammar, within the restrictions of the system in which you have to specify the grammar. With the restrictions, rules and aspects of the various parsing techniques as the guideline of what you can do inside the grammar, it's still very hard to define a grammar for a DSL, because not only has the DSL to be expressive enough and compact enough, it also has to work on all valid input and be able to produce a proper result in the form expected. If you want to experience how complex this is, try to write a very simple business rule grammar, in EBNF, which for example allows you to specify validation rules for objects. The difficulty is for example in 'all valid input'. What's valid input for your language? How do you determine that? You do that by defining rules, which will eventually become your grammar. An initial start is probably easy, though it gets out of hand pretty quickly if you want to add another couple of Non-Terminals, or want to re-use a Terminal inside multiple Non-Terminals, which might all of a sudden lead to an ambiguistic language and thus either lead to conflicts or more complex code in your handlers.

I appreciate Microsoft for making the effort in this direction, and future will tell what the M toolbox will become. I truly hope it can meet the expectations of many people around, though I seriously doubt it will be a set of tools used by what looks like one of the big parts of the target audience: the group of people who lack the knowledge of grammars, parsers, languages and the like, as to me it doesn't lower the bar of complexity one bit. While that's not something to worry about in general, it is an aspect we have to be careful about: there's already enough complexity in our field: we need elements to make things more understandable, to make complexity more manageable, not to muddy the waters as if it's all of a sudden easy as tic-tac-toe to write a grammar, language, parser and AST interpreter code and every non-programmer can create DSLs.

Frans,

Do you know that since it's 3.0 version, ANTLR is now LL(*) and no more LL(k). It means you don't need to define the lookup size anymore, which makes it more porwerfull ?

Besides, did you read "M in a Nutshell" (it's freely downloadable) with a focus on the last chapters about MGrammar (Mg) ? There are some instructions like "final", "left(x)", and lists manipulation which help a lot to define grammars.

I've used a lot of parser generators and in my own opinion Mg is the simplest one. Maybe it's also because I have some background with the other tools also.

Though, I agree with you, it's not the panacea and DSL won't be simple to create anyhow.

Kind Regargs,
Sébastien

Sébastien Ros - Wednesday, November 5, 2008 12:23:59 PM

I don't have any experience with ANTLR, I have used YACC though and while M is not the silver bullet that will make DSL and language design simple for the masses (nothing probably will) I do think it is a step forward in making it easier and more accessible, at least for the simple scenarios.

Torkel - Wednesday, November 5, 2008 5:40:51 PM

Frans,
I think a few clarifying comments would be useful here. First, it's important to be clear which part of 'M' we're talking about. Only one part of 'M', MGrammer has to do with defining DSLs. The other two parts, MGraph and MSchema are mostly to do with M's use as a language for defining data repositories (MSchema is to T-SQL what C is to Assembler, is how Don Box put it). (Note that MGraph can also be used to define the output of a DSL).

Secondly, MGrammer and the ability to define DSLs is actually a very late addition to the Oslo platform. In his presentation, Chris Anderson said that it wasn't in existence six months back. So I think we can expect to see further advances yet.

Finally, not every DSL needs to be a Turing-Complete, full-blown programming language. My impression is that for DSLs that are weighted towards the data-definition end of the spectrum will be much easier to create and work with than ever before, especially given the IntelliPad tooling that they've created to go with it. I don't think a full knowledge of parser theory will be required for this kind of project.

That's how I see it, anyway.

Samuel Jack - Thursday, November 6, 2008 9:14:07 AM

You're really missing the point. Parsing is the easiest part.

Come On - Monday, November 10, 2008 7:07:09 PM

@come on: I don't you understand what I meant: the parsing is done by their parser, so yes, that's the easy part, as you (the user) don't deal with that. It's the language design which is the problem.

FransBouma - Monday, November 10, 2008 8:24:30 PM

@frans: You the whole sentence. I didn't quite understand your comment.

Anyway, your entire article is about how M doesn't make parsing much easier. That is missing the point, since grammars and parsing is the easiest part design and implement (when you write a compiler or interpreter). If you had rounded it off by, say "and since parsing is the easiest part of a compiler/interpreter, qed", it would have made more sense.

But I agree language design is the hard part. Your article just doesn't mention it.

(btw, I'm not come on)

come again - Monday, November 10, 2008 9:02:22 PM

@come again (real names would be great for a discussion, don't you think? ;))

I think it's a matter of definition then, because I see grammar as one of the founding pillars of a language design. So if you want to design a language, the grammar is one big part of it. I still find defining grammar as being hard, despite the fact that people claim it not to be, that was the point of my article.

FransBouma - Tuesday, November 11, 2008 10:03:17 AM

Saber, what were you looking for in the Oslo SDK CTP that you didn't find? Feel free to drop me a line so I don't miss your response: csells@microsoft.com.

Chris Sells - Wednesday, November 12, 2008 5:46:34 PM

@Clemens:
so the DSLs creatable with M are really more in the category of data transformation languages and not for example really suitable for defining rule logic and the like?

My main point is that the route towards a grammar which reflects what the language has to be about is really a long route, despite the extra tooling: the main hurdle isn't addressed: how to get to the point that one can formulate a non-terminal (or M's equivalent) and it fits the goal of the language.

Imagine someone who wants to write a DSL for a narrow domain. How would that person start? Probably with writing out some example texts which represent valid input for the DSL. But then the problem starts: how to create grammar which is able to parse these inputs properly so an AST which is usable to interpret the input as it was meant and also how to make sure inputs which are wrong aren't accepted? That's a big step. Once that step is taken, the rest is 'easy': whatever tool you're using, writing the grammar out in the tool of choice isn't that hard compared to that initial step, and IMHO that step isn't addressed at all in oslo, so what's there is really just what's already available (perhaps more fancier and with more bells/whistles) with other parser generators/toolkits. It's that first step that counts.

FransBouma - Thursday, November 13, 2008 5:29:41 PM

@Frans:

I think we agree - designing a DSL is not easy and a parser generator, as such, doesn't make it easier. The details of the meta language do make a big difference (and we hope to make MGrammar a really good one), but it doesn't as such address the issue you bring up.

At that level, integrating MGrammar into the "M" story serves one purpose: _if_ you know how to think about and design DSLs that are essentially data transformers, _then_ we aim to offer a smooth story for how to do that in the context of "Oslo". (So, MGrammar is not targeting the scenarios of existing parser generators that are all far more general than MGrammar, with mechanisms such as arbitrary code actions on the RHS of productions.)

Something better, like "define DSLs by example" could likely be build on top, once we understand how to do that in the first place. (Ok, I shouldn't say "we" here, just because I don't understand how that could be done. :))

Clemens Szyperski - Friday, November 14, 2008 7:35:55 PM

Oh, I forgot to address your first question -- "so the DSLs creatable with M are really more in the category of data transformation languages and not for example really suitable for defining rule logic and the like?"

A key idea behind M is that specific semantics, such as rule logic or workflow etc., is not part of or defined in M. It is defined by the runtime you choose to interpret/execute your models. Any given runtime will "understand" only a subset of your models, of course. And any given model may be interpreted by multiple runtimes for multiple purposes. This decoupling is fundamental to the Oslo approach - there isn't the one semantics baked into a model.

Therefore, DSLs are truly just syntactic sugar over your domain models and, thus, no more than, as you put it, data transformers: the DSL definition maps any valid DSL instance to the desired model instance(s).

Clemens Szyperski - Thursday, November 20, 2008 4:52:32 PM

11 Comments