(see also: http://scipy.org/StatisticalDataStructures )
Proposal for NumPy ndarray with named axes
Background
Fernando Perez devised a working prototype of a proposed NumPy enhancement, an ndarray with named axes, tentatively named DataArray. The prototype code and documentation are at: http://github.com/fperez/datarray
The protype was implemented by Fernando, Mike Trumpis, Jonathan Taylor (working on Fernando's laptop so not in git logs), Matthew Brett, Kilian Koepsell and Stefan van der Walt.
At SciPy 2010 on July 1, Fernando convened a BOF (Birds of a Feather) discussion of DataArray. The several dozen (?) attendees generally agreed on the need for named axes; similar features have already been independently implemented in
- pandas: http://code.google.com/p/pandas
- larry: http://github.com/kwgoodman/la
- MetaArray: http://www.scipy.org/Cookbook/MetaArray
- exppsy (reference at http://mail.scipy.org/pipermail/nipy-devel/2009-July/001738.html)
- and others.
Some discussion on this in the context of the very sophisticated pytables is found here: http://www.mail-archive.com/pytables-users@lists.sourceforge.net/msg01384.html.
A discussion of needs has also been started here with a comparison of pandas and larry: http://scipy.org/StatisticalDataStructures.
Resolved or clarified
- The original prototype indexes directly on "tick values" when they exist: instance["Chicago"]. This could cause confusion and add overhead. So instead, such indexing will only be supported on a particular attribute of a DataArray (perhaps named "name" or "byname" or "tick" or "bytick" or "key" or "bykey"): instance.bykey["Chicago"].
- Axis labels (the name of a dimension) must be valid Python identifiers.
- Tick values can be any hashable object. (So integers and integer strings are fine.)
- Ranges of tick values are supported in slices of DataArrays.
- Pandas would probably be able to build on DataArrays as specified (and Wes wants it to).
Alignment
- When two DataArrays are combined in an arithmetic operation, their tick values *should* match along every named axis, or the result will generally be undefined. pandas automagically ensures this alignment. DataArrays will not, at least not now.
- There was some uncertainty about whether DataArrays should always check that such alignment is correct. The advantage is greater robustness of client code. The disadvantage is greater forced overhead and implementation complexity. A possible middle ground would be to provide methods to explicitly check the alignment of all elements of a set of DataArrays, only when deemed necessary by the client programmer.
Question
Is "ticks" the best/clearest name for the values associated with an axis? (i.e. the values which can be used in lieu of indices 0..shape_element). What about "key values"? Or indices, even.
Concerns regarding the mental model
Are people going to have problems grokking the axes/ticks paradigm? Most people end up working with tabular data, which at best should actually be represented in two dimensions, and at worst shoehorns n dimensions into 2. Many tabular implementations either depend on this view (pandas), or alternately merely work best with it (larry).
To be addressed by DataArrays
- Labeling of dimensions (rows, columns, etc)
- Labeling of indices (time, temperature, interest rate, etc)
- Use of labels in regular indexing
- Use of labels in fancy indexing?
- (probably) a basic ascii representation so people can understand what's going on
- The ability to "pivot" data from a tick ("column") with associated values into an axis with associated ticks (and back?). Note comment on not implementing by-tick grouping.
Not to be addressed by DataArrays
- Support for non-homogeneous data (where different tick/key values index to different data types) (Eric)
- Automagical data alignment is the crucial feature needed. While not addressed by DataArrays, hopefully auto alignment can be built on top of them. (Wes)
- Non-unique (repeated) tick/key values to permit grouping. Intrinsically impossible with DataArrays unless another dimension (sequence or equivalent) is added. (Shepherd)
- Concern of Perry Greenfield? <--????
- Fancy IO (csv, excel, html, LaTeX)
Help needed
Fernando does not have the resources to drive the project beyond this prototype, which already does what he needs. If this is to go anywhere, it needs people to do the work. Please step forward.
API discussion
This is meant to record all discussed possibilities on the syntax and semantics, along with the rationale behind them.
Some other discussions exists at http://github.com/kwgoodman/datarrayQ, and http://thread.gmane.org/gmane.comp.python.numeric.general/38908/focus=38929.
Concept naming
- Axis: a container for the metadata of a single axis/dimension of a DataArray?
- label: a unique name for this Axis in Axes
- name: returns the label of Axis if not None, otherwise returns "_0" for the 1st axis, "_1" for the 2nd, etc.
- tick: an arbitrary python object that Axis is able to translate into an index inside an ndarray
- index: an arbitrary python object that Axis is able to translate into an index inside an ndarray
- Axes: an ordered set of the Axis objects for a given DataArray?
Construction
Empty
Create an object without any metadata:
DatArray([...])
This object will still have metadata, but it's of no real use unless it is filled-in later.
Metadata as arguments
There are various proposed syntaxes for creating a DataArray?.
All-in-one sequence
Fernando
DatArray(contents, metadata) metadata -> sequence(axis) axis -> tuple(label, ticks) label -> str | None ticks -> sequence(tick) tick -> object
Examples:
DataArray([[1, 2], [3, 4]], (('row', ['A','B']), ('col', ['C', 'D'])))
DataArray([[1, 2], [3, 4]], ((None, ['A','B']), (None, ['C', 'D'])))
Separate sequences
Keith Goodman: I think this would make it easier for new users to construct a DataArray? with ticks just from looking at the function signature. It would match the function signature of Axis. My use case is to use ticks only and not names axes (at first)
DatArray(contents, labels=labels, ticks=all_ticks) labels -> sequence(label) | None label -> str | None all_ticks -> sequence(ticks) ticks -> sequence(tick) tick -> object
Examples:
DataArray([[1, 2], [3, 4]], labels=('row', 'col'), ticks=[['A', 'B'], ['C', 'D']])
DataArray([[1, 2], [3, 4]], ticks=[['A', 'B'], ['C', 'D']])
DataArray? methods and attributes
- __getitem__ / __setitem__: access the underlying ndarray with whatever index support it has
- axes: returns a tuple with all the Axis objects
- axis: proxy to access the Axis objects
- axis.whatever: returns the Axis whose name is whatever; otherwise raises an exception
- named: proxy to indexing all available Axis (i.e., comma-separated elements for each Axis)
Lluís: I'd rather simplify it to a single axis attribute that can retrieve Axis objects by name, but can also be iterated on, sliced, etc. That is, use __getitem__. Note that this would not allow integers as Axis labels, or else they should be stringified to be accessed as labels instead of index.
Axis methods and attributes
Feed me.
Indexing through metadata
There is discussion on how the overall indexing syntax.
The default option -- works in any case:
arr.axis.country.named['Netherlands'].axis.year[-1]
The "stuple" option:
arr[ arr.aix.country.named['Netherlands'].year[-1] ]
The "magical" option:
arr.country.named['Netherlands'].year[-1]
The "semi-magical" option:
arr.country_named['Netherlands'].year_index[-1]
