Should HTML be considered as a data format?

As HTML is becoming more and more semantic, at least in intent, and all styling is moving into CSS, one has to wonder what it is now representing. It seems like it is now a format for unstructured data (a.k.a. rich text), in the same sense that XML and JSON are formats for semi-structured and structured data and CSV is a format for tabular data.

If that is the case, it should become commonplace that this data gets rendered by a variety of clients, not just browsers. This has already begun of course: an RSS feed reader for example consumes HTML, word processors can read and write HTML, e-mail clients use HTML for rich text. Naturally most of the times, these applications work by embedding a web browser but it doesn’t need to be the case.

If HTML becomes truly semantic (and if we can ignore the huge majority of existing contents that is less than ideally written), you could imagine it being rendered in many different ways. For example, you could collapse it to outlines, you could consume it as a repository or even display it in a completely different, non CSS-driven rendering engine. The point here is that there is an opportunity to take this decoupling of data and its graphical representation that semantic HTML and CSS promise and use HTML to its full potential as a data format.

I realize these thoughts might seem a little vague. This post really is a call for comments and ideas. Does this make sense or are we in the middle of Obviousland?


  • I totally agree. I've always considered HTML to be content and have been pushing the semantic message for a while, but I don't think it's obvious to most people. I've done some CSS 101 talks at conferences and the semantic message is the most radical to a lot of people.

  • Excel has been able to parse all the tables in HTML at a given URL for many versions. It's one of the best reasons to semantically lay out your HTML.

  • Sure makes sense ... I'm of the opinion that it's still a bit early to talk of pure content and alternatives to css until there's more widespread support for the full range of CSS3 selectors (which would allow full styling without actually touching the markup), but it's definitely a worthy ideal to work towards.

  • I think this is a decent observation. It has also been something being discussed in the "web standards" community for quite a while now, though.

    There's the potential, for example, for assistive technologies to be more useful. E.g. screen readers and the like. Instead, they have had to spend a lot of time and resources trying to understand horrible markup, nested layout-tables and the like. Resources that could otherwise go into providing all this extra functionality for their users.

    Microformats are quite interesting too (though the ones I have seen seem limited mostly to the "social web" kind of examples).

    HTML 5 of course offers a lot more native HTML elements that can give more meaning to the content, so hopefully if there is any way HTML 5 progress can be given a boost by all browser vendors then we might get somewhere in achieving what you have observed... :)

  • When stripped of stylistic intent, HTML boils down to a text markup language more akin to something like Wikitext than to an "enriched text" format like the aptly named Rich Text Format, which can include stylistic elements such as font and formatting information.

    The semantic elements of HTML and wikitext are useful outside of formatting, as you alluded to the use for cataloging outlines, etc... The difference is the precision with which you can discern intent, whereas wikitext has a more direct and inherent meaning to its elements necessarily imposed by common convention. HTML has been (mis)used so many different ways that it's difficult to generally decipher intent from its semantics.

    Neither HTML, XML, nor JSON are better or worse suited to being consumed semantically, but it is common convention and ubiquity which gives meaning to the data. In that respect HTML has a big advantage in being used much more widely, but also means we should not ignore the majority of existing (less than ideal) contents.

  • I agree, I think HTML would be a great data format. I recently asked myself a similar question: why couldn't we just us XHTML as the format for our word processors? Surely it encapsulates everything that we can do already, in a much cleaner format that doc or docx?

  • Bertrand,

    I hear what you are saying, but there are still alot of sites out there that don't use CSS (or use inline CSS... something I personally am guilty because I need to add something quickly to a site).

    XHTML or HTML that uses a more modern/standards-based approach are bordering on data formats. But even there we are looking at a mess of divs with content that may not make sense. For instance, I've seen people use "float:right" to push the real content closer to a the top of a file. I guess you could have a bunch of content, but laying at that content in some other view engine might not be that easy.

    I know you are talking theory, but when I look at XML I know that there are tags that will make sense as to what they represent. With HTML that may or may not be the case (it all depends on the programmer or designer, and if you don't want someone re-using your content then you don't want it making sense)

    Just my two cents...
    Jay Kimble
    -- The Dev Theologian

Comments have been disabled for this content.