How hard/easy is an HTML parser using the BasicLex/BasicParser design?
HTML always seems a popular format for a processor/compiler of some type or another. HTML really is only useful in either it's abstract form, as the HTML, or in a more well-formed syntax. XML is always nice because you can easily process it with the DOM and get any information you'd want. A while back, Chris Lovett wrote an SGML parser that is really good. For a more complete sample I recommend going there. He parses using various HTML DTD definitions, and so you can enforce any HTML rules you'd like. He has both a loose and a strict version of the DTD so you can parse well formed for very poorly formed HTML.
My sample revolves around the concepts of recursive descent parsing. Darren was telling me that determining the code-path to take when you approach a LeftAngleBracket wasn't at all easy. I've taken the steps to demonstrate this process using my framework, as well as using the SymbolTable in order to define custom tag entries in TokenType. These extra entries allow you to easily provide conditional processing for different tag types. For instance the following HTML is valid and parsed in a specific way. Each <li> self terminates when it reaches the next. This is different than if we had used say <span> tags which would have recursively nested.
<ul><li>foo<li>bar<li>baz</ul> <!-- Non Nesting -->
<td><span>foo <span>bar <span> baz</td> <!-- Nesting -->
To make the HTML pretty, well-formed, and do tag completion, you have to use various rules based on the element type which is derived from the name of the element. IE does a lot of this on the back-end and I can imagine the parsing code to make documents render correctly is quite complex and rightfully so. If you need something a bit simpler then a basic recursive descent parser might be just what you need. Code-Only: A super basic HTML style parser using BasicLex and a SymbolTable