Snippable: a human-writable multipart document format

More and more projects are managing their documentation as a bunch of Markdown files in a repository. Sites such as Github make that really easy and convenient, by providing an easy web interface around viewing and editing Markdown files, creating and integrating pull requests, viewing changesets, etc. It seems to provide all the advantages of a wiki, and more, while using a standard and easy set of tools.

Documents are not just Markdown however: there should almost always be some associated metadata such as a title (that is often extracted from the file name or the first heading of the document, but that’s suboptimal and unreliable), tags, authors, etc. That additional data could be stored in a database, but that’s hard to manage, doesn’t version along with the documents, and it creates too much potential for orphaned files or records. Another solution is companion files, but if you’re going to use the file system, why have two files when you could have only one?

I think the best solution to that problem is a multipart document format that allows for a structured metadata section, followed by a rich text body. This is similar to image formats such as JPEG, that allow for embedding EXIF metadata.

Markdown itself has been hugely successful by providing a rich text format that can not only be expressed as plain text, but can also be read and written by non-technical human beings. There are other examples of successful plain text human-readable formats, such as YAML and modern diff and patch formats. The multipart format that I need should also be plain text, and easy to author for non-technical users.

The second requirement for this multipart format is that it should be minimalist: it should only deal with assembling multiple documents into one, but should get out of the way as far as the actual document parts are concerned.

I’ve opted for a very simple and fun separator format to delimit the different parts of a document, taking a clue from Markdown to use something that immediately makes sense to an untrained user:

-8<--------------

The separator is simply the emoticon for a pair of scissors cutting through a dotted line. The number of dash characters on the right of the scissors can be anything above 2. The separator has to be on its own line.

With this we’re halfway through. The second part of the puzzle is a way to specify the format of the parts.

First there has to be good defaults. I’ve chosen those to be a YAML header followed by a Markdown body, because that corresponds to the main scenario, and those are the best and most successful structured and rich text formats that are also human-writable, so that’s what you get if part formats are not otherwise explicitly specified.

The first and preferred way that you can explicitly specify part formats is through file extensions. File extensions are already widely used and understood: foo.md is a Markdown file, bar.json is a JSON file, baz.yaml is a YAML file, so it would only make sense that foo.yaml.md would be a multipart file with a YAML part followed by a Markdown part.

If you can’t use the file extension for some reason, you can instead embed the file format into the separator. You can specify the format of the part before the break, after the break, or both:

-8<--^-yaml--------
-8<--v-md----------
-8<--^-yaml--v-md--

Here’s an example of a typical snippable document.yaml.md multipart document:

Title: A simple snippable document
Author: Bertrand Le Roy
Tags: snippable, yaml, markdown, multipart

-8<-----------------------------------------------------

A Snippable Document
====================

This is what a snippable document looks like.
This document has two parts:

* a YAML header
* this Markdown document

It should not look too terrible to a regular Markdown parser
and can be parsed to extract the header.

Currently, there’s a JavaScript implementation that can be found on Github:
https://github.com/bleroy/snippable.js

If you find this useful, and want to use the format yourself, please do: it’s under the MIT license. If you create an implementation in another language, please let me know, and I’ll point to it.

6 Comments

  • Cool idea! But I prefer the Front Matter (http://jekyllrb.com/docs/frontmatter/) syntax used by Jekyll, wich separates a header part (with metadata) written in YAML followed by the body. The body can be any content, which is determined by the file extension.
    Also I think it's more efficient to parse, because you don't have to scan the document for the scissor pattern.

  • Well, you have to scan the document for the triple dash pattern, so that's pretty much the same, no?

  • Other problems I see with Front Matter are:

    * their separator can be confused with a Markdown header: a standard Markdown parser will interpret the last line of YAML as a header, resulting in a funny-looking document, when viewing the file on Github for example.
    * It pretends to be a format but is different from what it pretends to be: putting YAML in front of an HTML document does not give a valid HTML document.

    But thanks for the pointer.

  • Not entirely, because you have to scan only for the first three characters in a Stream and decide if its a Front Matter file or a "normal" Markdown/HTML/CSS/... file.
    I second your other concerns, altough GitHub supports Front Matter (which is no wonder, because Jekyll is used for GitHub pages): https://github.com/jekyll/jekyll/blob/gh-pages/_posts/2015-01-20-jekyll-meet-and-greet.markdown

  • And I don't even have to open the file in order to determine that it's multipart: I can just look at the double extension. I was talking about the second set of dashes: you still have to parse for that. Parsing such simple separators is negligible anyway when compared to the parsing of the YAML and the Markdown.

  • Oh, my bad! I've overlooked the part with the double extensions completely.

Comments have been disabled for this content.