How hard/easy is an HTML parser using the BasicLex/BasicParser design?

HTML always seems a popular format for a processor/compiler of some type or another.  HTML really is only useful in either it's abstract form, as the HTML, or in a more well-formed syntax.  XML is always nice because you can easily process it with the DOM and get any information you'd want.  A while back, Chris Lovett wrote an SGML parser that is really good.  For a more complete sample I recommend going there.  He parses using various HTML DTD definitions, and so you can enforce any HTML rules you'd like.  He has both a loose and a strict version of the DTD so you can parse well formed for very poorly formed HTML.

My sample revolves around the concepts of recursive descent parsing.  Darren was telling me that determining the code-path to take when you approach a LeftAngleBracket wasn't at all easy.  I've taken the steps to demonstrate this process using my framework, as well as using the SymbolTable in order to define custom tag entries in TokenType.  These extra entries allow you to easily provide conditional processing for different tag types.  For instance the following HTML is valid and parsed in a specific way.  Each <li> self terminates when it reaches the next.  This is different than if we had used say <span> tags which would have recursively nested.

<ul><li>foo<li>bar<li>baz</ul> <!-- Non Nesting -->
<td><span>foo <span>bar <span> baz</td> <!-- Nesting -->

To make the HTML pretty, well-formed, and do tag completion, you have to use various rules based on the element type which is derived from the name of the element.  IE does a lot of this on the back-end and I can imagine the parsing code to make documents render correctly is quite complex and rightfully so.  If you need something a bit simpler then a basic recursive descent parser might be just what you need.  Code-Only: A super basic HTML style parser using BasicLex and a SymbolTable

Published Thursday, May 20, 2004 12:09 AM by Justin Rogers

Comments

Thursday, May 20, 2004 6:37 AM by Stephane Rodriguez

# re: How hard/easy is an HTML parser using the BasicLex/BasicParser design?


Me think that you have no real choice if you intend to write a parser as part of the renderer. You need to apply the same rules than IE. That's why dominant positions are such a shame, especially when the software is so much broken and full of internal choices.

Regarding the parsing service itself, although Lovett's SGML parser might be regarded as a reference, I guess it becomes completely useless in the real world, where most of the time you get invalid html. The strength on an html parser should be to have an efficient diagnostics mechanism (with a meaningful report system, see Safari) and be able to switch from several parsing modes including loose, strict, ... All in all, you are not going to do that kind of reliable parser with a regexp (contrary to what some people try to show without short-sighted teasers).

My 0.5 cent



Thursday, May 20, 2004 7:13 AM by Justin Rogers

# re: How hard/easy is an HTML parser using the BasicLex/BasicParser design?

I think Darren's parser is well on the way to being fairly efficient and resilient to real world HTML. I agree that the SGML parser, even with the loose DTD, can be rather rigid.

I think the end result of Darren's work will be some relatively powerful HTML tools. I already have some ideas for what he is working on, as well as some small tools I'm planning myself.

Creating order out of chaos is how I refer to this process. If I create just slightly enough order to make my life easier, then a job well done.
Thursday, May 20, 2004 11:12 AM by TrackBack

# Short on time but here's some important parsing stuff...

Leave a Comment

(required) 
(required) 
(optional)
(required)