Notebook XML Format

General Considerations

Notebook files should always use a text encoding that can handle the full Unicode standard. Specifically, UTF-8 is the preferred encoding because of its broad support.

Entities other than those specified in the XML standard are not permitted. More generally, a DTD or DOCTYPE declaration should not be needed to correctly parse the file. The XML parser that we use does not handle DTD-specific entity references.

Try not to be a bozo.

Tags

In this section, indentation corresponds with the expected nesting of elements. More or less. Most of these tags have been implemented in the document generation component (unless they are marked by "raise NotImplementedError"), but many have not been directly implemented by the GUI, yet.

<notebook version="1">

This is the root node of the document. It contains one (and only one) <head> element, zero or more <ipython-log> elements, and zero or more <sheet> elements. It has one required attribute:

version - an integer which is the version number of the file format in use. We still need to develop a VersioningPolicy.

<head>

This element will contain <meta> elements that provide metadata about the document. In the future, this might be extended to include RDF data to satisfy users with more complicated metadata needs (not to mention buzzword compliance for people writing grant proposals. Hint, hint.).

<meta name="foo" content="bar"/>

The <meta> element contains no content, but has two attributes:

name - The kind of metadata (e.g. "title", "author", "version", etc.)

content - The value

<ipython-log id="default-log">

This contains all of the input and output from one run of the interpreter. The value of the id attribute should be unique across the entire document (for all elements, not just <ipython-log> elements).

The <ipython-log> is essentially just a database. It has no inherent presentation.

<cell number="1">

This element groups an input with all of the responses given by the interpreter. The number attribute corresponds with the number in the prompt, e.g. "In [1]:". The only required child element is <input>.

<input>

The text of this element is what the user typed in.

<transformed> (raise NotImplementedError)

When an input contains IPython special features like %magics, then the interpreter transforms it into pure Python code. This element records result of this transformation. When exporting an <ipython-log> to a regular Python file, the content of this element will be used in place of the cell's <input>.

<output>

The text of this element corresponds to what is displayed after the "Out[1]:" prompt in regular IPython.

<stdout>

The text of this element is whatever got printed to stdout when the input was executed.

<stderr>

The text of this element is whatever got printed to stderr when the input was executed.

<traceback>

This contains traceback information if an exception was raised. Currently, this is just text, but eventually we might want to consider providing more structured information.

<sheet id="default-sheet">

Sheets store the visual representation of the notebook. They are most directly related to what the user sees and edits on screen in the GUI. As with <ipython-log>, the id attribute should be unique across the entire document (raise NotImplementedError, currently).

There are two types of child elements of <sheet>, ones that we stole from DocBook, and ones that are specific to the IPython notebook. The latter elements have tag names prefixed by ipython- and are converted to DocBook elements when exporting the sheet. WhyNotNamespaces?

<ipython-block logid="default-log">

This contains a list of elements which reference specific cells in the log specified by logid.

<ipython-cell type="input" number="1"/>

In a horrible abuse of terminology, this references a particular component of a cell with the given number.

RTK: This needs to be redesigned a bit, I think. To display a single cell with, say <input>, <output>, and <stderr>, we currently use three <ipython-cell> elements explicitly. I think we should be able to that; I for one want to be able to hide over-lengthy outputs, or simply display a particular output somewhere (without any of the other parts of the cell). However, the default should be to display all of the components of the cell with a single element. I think we can do that by making the type attribute optional. If it's not present, the whole cell gets displayed with the components in a predefined order. If it is present, then only the specified component gets displayed. Comments?

<ipython-figure number="1" type="png" filename="foo.png" caption="This is my fantastic plot."/>

This places a possibly captioned figure into the sheet. The semantics and content of this element are not settled, yet. We may split this into two different elements, one which just refers to a static image, and one which grabs an image somehow (think matplotlib) at a certain point in the executed code. Or we'll use plain DocBook for the former and reserve <ipython-figure> for the latter. Or perhaps the latter will become another component of the <cell>s in <ipython-log>. I don't know. Comments?

The following elements are taken from DocBook. Although they are more or less fully supported by the document generation component, they are less supported by the GUI. We hope to extend the GUI to support all of the following elements though not all of DocBook itself.

<para>

This is the basic paragraph element. It contains text. It may also contain other elements like <emphasis> in a mixed-content fashion (e.g. <para>This is <emphasis>mixed</emphasis> content</para>).

<emphasis role="strong">

The text should be emphasized. If the role attribute is not provided, then the text will usually be italicized. To make the text bold, role="strong".

<ipython-equation verb="0">

A display equation on its own line. The content of this tag is the formula in LaTeX format. If verb is not "1" or not given at all, then the formula will be surrounded by \begin{equation}\end{equation} commands. If verb is "1", however, the formula will pass straight through. This is useful when the formula must use some LaTeX environment other than equation, like eqnarray.

As of this writing, converting LaTeX equations to PNG graphics when exporting the sheet as HTML is handled when matplotlib and dvipng is installed.

<ipython-inlineequation verb="0">

An inline equation. The content of this tag is the formula in LaTeX format. If verb is not "1" or not given at all, then the formula will be surrounded by the math-mode $ delimiters in the LaTeX file. If verb is "1", however, the formula will pass straight through. This option is more useful in <ipython-equation>, but is included here for completeness. I'm sure someone will find a use for it.

<code>

A code fragment. Usually this would be used to change the font on a single word to show that it is a Python variable or name of a module or similar. E.g. "the scipy.stats module...."

<section>

This is a so-called structural element. It has no text content itself, but it can contain so-called "block" elements like <para> and <title> or nested <section> elements. Exactly what kind of section this element represents is determined by its context. If the <sheet> were to correspond to a journal article, then in LaTeX terminology, the top-level <section> elements would correspond to "\section", and the <section> elements below them would correspond to "\subsection", etc.

<title>

This element displays a title for its parent structural element (either a <section> or the <sheet> itself).

<bibliography> (raise NotImplementedError)

A bibliography. It would have all of the child elements as in DocBook, but I won't list them here.

<latex> (raise NotImplementedError)

This element passes its content directly through when we are generating LaTeX. It is not an official DocBook element, but an extension provided by the DB2LaTeX XSLT files that we use to implement the LaTeX backend. Using this element might yield an invalid LaTeX file, of course. We'll have to figure out how to handle it in the generated HTML of course, but that shouldn't be too hard. Probably we'll pass the content through as plain text.

<xref>, <link>, etc. (raise NotImplementedError)

The following is complete speculation. There are several kinds of linking that notebooks really need:

Intra-sheet hyperlinks: such as to a specific "In [NN]:" prompt. The generated DocBook generates anchor tags for these so <link> elements that point to them should already work.

Inter-sheet links: we'll have to figure out some kind of URI scheme to work this out.

External links to a URI: in the GUI, these would probably be passed to a web browser.

Cross-references: being able to talk about "Figure NN" and have the label NN update everywhere in the document seems pretty important to me. We also need to be able to make references in the text to "In [NN]" and "Out[NN]" and have the reference track the number if the cell gets reexecuted.

Substitution links: we might want to split up our work into different sheets but also include them wholly into a super-sheet that just contains structure and links to the various sub-sheets which get replaced by the sheets themselves. For example, to do the Scipy Tutorial as a notebook, I would probably make each chapter a separate <sheet>. To export the document as a whole PDF, I would just make a super-sheet that links by substitution to each of the sheets that have actual content.