Do syntax parsing and code highlighting require language parsers or something more?
I've done work on this before, both in terms of creating language parsers, and then using them to examine syntax and perform code highlighting (among many other things). As a matter of fact, Chris Anderson worked on a small C# language parser when he was still on the Windows Forms team and I was still working on the QuickStarts. We got together a couple of times to look over the code and talk about the possibilities for the tool at the time. The idea then was to provide an extremely good set of tools to use with the QuickStarts and make maintenance easier by doing code transformation between languages, parse old C# syntax and convert it to the latest syntax, and do perfect syntax highlighting.
Now, what isn't known is that both of us pretty much deep-sixed the project right when it was getting good (unless Chris did something with it that I don't know about). Before I quit working on the source though, I had it doing some amazing things. Code highlighting was just the start and was actually a pretty easy and logical expansion of the intermediate format the parser spit into. In fact, it just spit out nested XML with a bunch of pseudo-information about each of the tokens. This made it simple for an XSLT transform to work with the code. Other tools included a language translator from C# to JScript .NET, which never did work as well as I had wanted, the ability to use multiple versions of the parser to deal with C# language changes (you don't see this in a production world though), and a few small one-off tools for transforming certain constructs of C# into better constructs. I also wrote a search component, but I don't really think it had much warrant, except that it could search only through comments or only for fields, etc... Basically contextual searching. Enough background though, what does this have to do with syntax parsing and code highlighting versus a language parser?
You can answer this question by examining what a syntax parser has to do. It has to notify you that code is wrong or incomplete and of possible ways to fix it while you are typing it in. I'm not talking full blown intellisense here, I'm just talking about basic brace matching and normal stuff. Resilience is the keyword I think, and the C# parser Chris and I worked on was not that (I did make it a bit more resilient while using it). It would blow up at the slightest issue in the underlying code. It was an ideal parser and had no rollback technology built in to give the user information about failures in the code. After all, this wasn't a compiler, which is where all of that logic would have gone. This means any syntax parser needs to be a little more than just a blind parser, in that it needs to be able to insert missing abstract parse nodes and mark them as not yet present in the underlying medium. Pretending that various constructs exist is one of the ways that some parsers actually continue their job without breaking.
Did you know that JScript .NET will automatically insert statement terminators (semi-colons) into your source and try again to see if that would fix your error? Yep, you don't even need to use semi-colons to terminate your statements, and JScript .NET will actually pretend like the semi-colon would have existed at the end of the line. That is pretty darn cool.
Well, what about for code-highlighting? The same resilience needs to be present, since you are most likely coloring in real-time. Coloring in real-time means that you can't parse the tree constantly (here are some hints for my friend Darren who is working on a parser as of late), and that you only really need to parse what is seen by the user in the code window (taking into account block elements that are outside of the scope of the current code window). You have to immediately parse keywords and tokens as they are completed, but again, you don't want to have to parse the entire code set.
That my friend, is what trees are for. You see a code highlighting parser needs to be able to take a given set of contextual information and process only a subset of the code in the window based on that context. What is a context? Well, in many cases, you can use a parse node as context. Take the example of code highlighting a method body. All of the code you want to parse gets added to the method body's parse node. You have your context now (where the code starts and ends, since it starts at the beginning of the method body, and finishes at the end) so you can parse at will based on this information. Other forms of context might be the text body of a string. You know, that when in a string, you don't color keywords, and you know there are rules for when and how the string terminates. Remember that you have to insert a *missing* end of string abstract node, and mark it *present* once the user actually fills it in.
That truly answers my title question. Syntax parsers and code highlighters are very specialized language parsers. They require extra instrumentation and node information to make their underlying parse tree more usable for the task at hand than would be necessary in a compiler. At the same time a compiler would probably add different information to the tree as well. Remember that the parsers used by a compiler aren't real-time. They don't have to respond to user input, and dynamically parse newly entered text, and make recommendations to the user. They simply have to parse the code from front to back (or however you choose to parse), and spit out the parse tree. Specialized language parsers are required for real-time use to emphasize performance and usability. You can't simply grab out any old parser and have it work. Very little documentation is available on real-time parsers, lazy parsing, code window parsing, though a lot of cool applications make use of it. It is one of those black box programming areas where those that enter the fray reinvent the wheel multiple times with varying ideas of what a wheel is supposed to look like (this generally leads to plenty of hacky code that integrates the code window with the parser).