Designing a language is hard, and M won't change that
I was a bit surprised about the large scale of the positive
hype around 'M' in Oslo. Fortunately,
Roger Alsing
wrote a
'Back to reality' post about M
which stood out as a single critical remark against the sea
of positivism. I like it when someone chooses to be the more
sceptical voice in the crowd: this way an outsider gets a
better chance to see the whole picture instead of just one
side. This post is written in that light, to give my two
cents in line with what Roger said and to give some more
background in what I think he meant.
What's the idea of 'M', the codename for the textual
language in Oslo? It's supposed to make it easy to write
DSLs. A DSL is a 'Domain Specific Language', which in short is a name for a language (in any shape or
form!) to express behavior, data structures, data and what
have you, for a specific, narrow, well defined
domain. So if you want your application to have a custom set
of rules, definable by the user, a DSL might be a good
mechanism to formulate these rules in. Make no mistake about
what a DSL is though: a grid, form, text in a textfile, can
be a representation of a DSL. It's a concept, not a physical
toolkit.
Everyone who has ever written a parser for a grammar, defined a grammar for a language or designed a language, has probably read or heard about the 'Dragon book', or more officially: Compilers: Principles, Techniques, and Tools, by Aho, Sethi and Ullman. In that book, it's explained in all detail what's needed to define a language, to recognize elements in texts containing elements defined in the grammar of the language, parse it and do something with that parse result. It's not a book for the hobby programmer, and for a reason: computer languages, and thus also DSLs, are complex beasts to handle. The domain of computer languages has been studied a lot in the past decades and it's one of the few fields in Computer Science which can be considered a bit more mature than the rest: today, the aspects of computer languages, grammars, lexers, parsers and friends are well understood and documented in many papers and books.
One of the nice things of a relatively mature field is that the techniques to work with the material are fairly mature as well. This means that if you want to define a grammar, look for a way to understand texts defined in that grammar, the solutions are available to you in mature formats, with tooling which has proven its value for many years. The question that arises often is of course: with all this tooling and knowledge, is it possible for a less educated / hobby programmer (or a non-programmer) to define a language for a given domain so the domain's complexity is easier to handle? (because that's the main purpose of DSLs). The answer is sadly: no, it's not. As the M toolset is too a toolset for producing a parser and do something with the input, it too doesn't make things easier. I'll explain why below.
The L(AL)R(k) vs. LL(k) difference
If you already wonder what
LALR,
LR
and
LL
stand for, I've proven my point. I've linked them to the
Wikipedia articles, so if you're not familiar with the terms
you can read up and follow the rest of the argument. LALR
and LR are so called 'Rightmost derivation' parsers, which
can be seen as 'bottom up': it uses a stack and a state
machine to recognize
Non-Terminals
in the input so it can in the end, understand the complete
input.
When an LR parser understands it has seen a Non-Terminal N,
it will replace the elements in the input stream (Terminals
and / or Non-Terminals) with the Non-Terminal it just
recognized. This is called 'Reduce', and it's typical an
event which is used to trigger handler code outside the
parser, e.g. to build up the
Abstract Syntax Tree (AST). This process continues till it has reduced the whole
input stream to a single Non-Terminal, the start symbol. As
the Wikipedia article describes, writing a parser for LR(k)
grammars by hand is hard, as you have to produce tables with
data for the parser engine which is a state machine. A
generator can help there. A good example of such a generator
is the
Gold Parser
or YACC. The Gold Parser comes with a visual toolkit which
allows you to test and debug your grammars.
The main problem with LALR/LR languages is that grammars can
lead to conflicts, in the form of reduce/reduce conflicts or
shift/reduce conflicts or even shift/shift conflicts. The
conflicts all arise inside the parser's state machine where
it is in state S and the path to get there with the current
input decides which state will follow. Reduce/reduce
conflicts occur when the parser sees an input element E and
on the stack are a group of elements which together with E
form a match for two (or even more!) Non-Terminals: it can't
decide, based on E and the stack contents, which
Non-Terminal should replace the set of elements on the stack
+ E. A shift/reduce conflict arises when the parser sees an
element E and it has to decide whether replace the set of
elements on the stack + E with Non-Terminal N1, or push E
onto the stack (shift) to try to form the required elements
for Non-Terminal N2. Shift/shift conflicts occur when the
parser can't decide for which Non-Terminal to push onto the
stack (shift) the input element E.
These conflicts should be resolved to have a proper language
which doesn't contain a bug by itself. It's sometimes
sufficient to pick one over the other (e.g. shift instead of
reduce in a shift/reduce conflict) but it's in general not
adviced to do so. Add to that that as soon as a language is
ambiguistic, (like C# and VB.NET are for example), it's
harder to solve these conflicts: the parser has to 'look
ahead' to see what's coming to decide what to do. LALR
parsers therefore are less suffering from these conflicts,
yet it's something one has to deal with in a lot of cases.
Though to be able to solve these conflicts, one has to
understand how the parsing process works.
There's another type of parser, the LL(k) parser. An LL(k)
parser is a Leftmost derivation parser, which means that
it's a top-down parser. It doesn't use shift/reduce tables
and a stack, instead it simply looks at the input and calls
a handler for the Non-Terminal the input symbol is the start
symbol of and leaves the handling to that handler. So
instead of waiting for the parser to see the complete 'If'
expression for the 'If' Non-Terminal to reduce it and call
the handler for the 'If' Non-Terminal, it calls the 'If'
Non-Terminal handler after it sees a Terminal 'if' and
expects the If Non-Terminal handler to take care of it.
LL(k) parsers can be seen as state machines similar to
LR/LALR parsers, yet the state machine is less visible: it's
inside the handler code.
As the handling of a Non-Terminal is done in a process you
have every control over, as you're there along the way, it's
much easier to handle errors: if you expect a ')' and it's
not there, you can report the error and act as if you've
seen that bracket and move on, to handle more errors (or to
even ignore the error, if the language has to be very
flexible). An LR/LALR parser can't do that easily because it
simply has to give up: it sees an element E which combined
with the stack contents can't be used to either shift or
reduce. This makes LL(k) easier to handle ambiguistic
languages and parse along when errors occur. Of course
generators for LL(k) are available as well, a good example
is the widely known
ANTLR,
which despite its name might suggest is an LL(k) toolkit.
One core difference between LL(k) and LALR/LR is that code to build the AST is inside the grammar for LL(k) and outside the grammar for LALR/LR(k). This is logical, because LL(k) is handled with custom handlers for the grammar's Non-Terminals, so the code which handles Non-Terminal N is custom made for N, containing the code to build the AST node for N along the way (of course using re-usable classes/methods, but the structure of the handler is unique for that Non-Terminal). To be able to do that, the custom code is specified inside the grammar which is fed to the generator which produces a proper parser for the grammar, including the custom code. For LALR/LR grammars, the code can be outside the grammar because the generator doesn't generate any code, it generates data, which is the input for the state machine to make sure the parser knows what to do when a given input is seen.
My experience with both LALR/LR and LL is that with LL, it's easier to write a grammar but harder to produce custom code for the final parser: it's very hard to debug code which is inside a grammar which is used to generate a parser which task it is to build an AST from the input which in turn is used to do something with the input. For LALR/LR it's hard(er) to get the grammar right, harder to get good error recovery, but writing the handling code to build the AST is relatively easy: you just fill in the blanks for every Non-Terminal in your grammar. I know that in one of the Oslo presentations, it was shown that the language could be debugged. The ANTLR tools can do that too (for Java), though does what the M tools can do, make things less complex than the complexitiy level of the ANTLR tools? One still has to know what's on the screen, what everything means and what's handled by what at what moment.
So when I look at M's elements, I can only conclude that it
too has the same aspects as the well known tooling and
building blocks for parsers, grammars and the like available
to everyone today: it's complex. With 'complex' I mean: you
need proper knowledge about computer languages, parsers and
grammars to understand what's going on and how to use the
tools available properly. There's nothing wrong with
that: if something is hard, well... that's life, deal with
it by understanding the concepts which makes a hard problem
manageable. Though to me it's impossible that all crucial
aspects of writing computer languages are suddenly not
important anymore with M's elements on the table. Sure,
people will write DSLs with M's elements, but today you can
do that too with ANTLR or Gold Parser, use a visual toolkit
to design, test and debug the language, and you're likely up
and running in a day or two. That doesn't mean a person who
has no clue what LL(k) means, what an AST is, can create a
language for a given domain with ANTLR, some knowledge about
parsers, grammars etc. is required. And with M's elements,
this isn't any different at all.
That's not bad, not at all. It's however to me a different
reality than the picture Microsoft tried to paint with the
Oslo material provided through the PDC, and therefore I
fully agree with what Roger said in his blogpost.
The core issue is that to be able to write a succesful DSL, you have to be able to create a grammar, within the restrictions of the system in which you have to specify the grammar. With the restrictions, rules and aspects of the various parsing techniques as the guideline of what you can do inside the grammar, it's still very hard to define a grammar for a DSL, because not only has the DSL to be expressive enough and compact enough, it also has to work on all valid input and be able to produce a proper result in the form expected. If you want to experience how complex this is, try to write a very simple business rule grammar, in EBNF, which for example allows you to specify validation rules for objects. The difficulty is for example in 'all valid input'. What's valid input for your language? How do you determine that? You do that by defining rules, which will eventually become your grammar. An initial start is probably easy, though it gets out of hand pretty quickly if you want to add another couple of Non-Terminals, or want to re-use a Terminal inside multiple Non-Terminals, which might all of a sudden lead to an ambiguistic language and thus either lead to conflicts or more complex code in your handlers.
I appreciate Microsoft for making the effort in this direction, and future will tell what the M toolbox will become. I truly hope it can meet the expectations of many people around, though I seriously doubt it will be a set of tools used by what looks like one of the big parts of the target audience: the group of people who lack the knowledge of grammars, parsers, languages and the like, as to me it doesn't lower the bar of complexity one bit. While that's not something to worry about in general, it is an aspect we have to be careful about: there's already enough complexity in our field: we need elements to make things more understandable, to make complexity more manageable, not to muddy the waters as if it's all of a sudden easy as tic-tac-toe to write a grammar, language, parser and AST interpreter code and every non-programmer can create DSLs.