Version 28 (modified by pierregm, 6 years ago)

Removed the reference to the old package

For a high-level overview of nan and masked array support, see also SciPy FAQ.

MaskedArray is a numpy ndarray look-alike that allows one to keep track of missing values.

MaskedArray is implemented in the [source:trunk/numpy/core/ma.py numpy.core.ma] module. The numpy ma module is originally written for Numeric by Paul Dubois and adapted for numpy by Travis Oliphant and (mainly) Paul Dubois.


Migration Guide

  1. a.mask and getmask(a) will now return ma.nomask constant, which is defined as an array boolean scalar False_.

If your code relies on an explicit "m is None" check, it should be changed to "m is nomask." In many case this check will now be redundant because nomask provides full array interface. For example "m is None or not sometrue(m)" can now be written as "not m.any()".

Missing features (work in progress)

Some current features of numpy are not yet implemented for ma, either because they were introduced to numpy only recently (eg ndim ?), or because they were never adapted to ma in the first place (eg, the mlab package). As Paul Dubois noted, it does not make sense to extend the handling of missing values to all numpy features (a typical example would be the FFT package). However, ma is still invaluable in many cases, and it's unfortunate that its use is currently a bit limited.

A non exhaustive list of missing features is presented below. The features are organized by potential problems and naive suggestions to solve them. More features will be added as I run into them.

Case 1

The function would work OK with masked arrays if it called ma.asarray instead of numeric.asarray (as it's currently the case). A fix could be to add a mask=False_ property by default to any ndarray, and get rid of the MaskedArray class ? A second possibility would be to check in numeric.asarray whether the argument is already a (masked) array.

  • diff

Case 2

The function can be applied only to the data part once missing values are adequately filled. If needed, the masked version is obtained easily by applying the initial mask to the result. An use_missing option could be introduced to allow the use of missing values (the output would be masked), or discard them (default option?).

  • ndim: The masked array could inherit ndim from its data part. Implemented in changeset:2185.
  • std, var: An example of implementation of the function (not the method) std is given there. A suggestion for the method implementation is presented in this patch.
  • trace: Fill with 0 if use_missing is False. Please check attached patch (download). Implemented in r2267.
  • cumprod, cumsum:
    • use_missing=True: The output is masked for indices [i...N], where i is the index of the first missing value, and N the nb of data (including missing).
    • use_missing=False: Fill the initial missing values by 1 for cumprod or 0 for cumsum. A simple implementation of cumprod and cumsum, without the questionable use-missing flag, is suggested in this patch.
  • clip: Please check attached patch (download). Implemented in r2267.

Case 3

The function must be applied to both the data part and the mask. I assume it's the case for most of the functions in shape_base, index_trick. As an illustration, a quick and dirty adaptation of the concatenator r_ could be:

mar_ = lambda seq:ma.array(data=[s.data for s in seq],mask=[s.mask for s in seq])

Case 4

The trickiest case where missing values must be remain masked during the process.

  • median: The two functions in this attachment (seem to) work well for 1- and 2D arrays. The problem gets more complex for higher dimensions.

Ufuncs and Masked Arrays

In changeset:1835, Travis added __array_wrap__ hook to the MaskedArray class. This was done in an attempt to fix mixed arithmetics. Unfortunately, there is not enought information within the __array_wrap__ hook to correctly generate the mask:

>>> from numpy import *
>>> print ma.array([1])/ma.array([0])
[--]
>>> print array([1])/ma.array([0])
[0]

In order to fix this situation, more information has to be passed to the __array_wrap__ hook by the ufunc. Sasha proposes to change the __array_wrap__ signature form

def __array_wrap__(self, arr)

to

def __array_wrap__(self, arr, context)

and make ufuncs pass a tuple context=(func, args, i), where func is the ufunc itself, args is the tuple of ufunc arguments and i is the index of self in the args tuple. Strictly speaking, i is not necessary, but it is available in ufunc and may prove to be helpful in the future. Extended __array_wrap__ is implemented in changeset:1898.

Once ufuncs can handle the case of mixed arguments to binary operations, it is tempting to get rid of ma wrapers to ufuncs alltogether and implement ma logic entirely in __array_wrap__ and __array__ hooks. Unfortunately, __array__ hook suffers from the same problem: before passing data to ufunc ma array heeds to replace masked values with something safe for the given operation. In order to do this more information in needed than passed to __array__ hook. Sasha proposes to make a similar change to __array__ hook as for __array_wrap__ hook above.

The new signature will be

def __array__(self, dtype=None, context=None)

The ufuncs will pass to __array__ a tuple context=(func, args, i), where func is the ufunc itself, args is the tuple of ufunc arguments and i is the index of self in the args tuple. For backward compatibility ufuncs will allow a two-argument __array__ and classes that will take advantage of context will define __array__ with a default value for context so that it can be called with one or two arguments as well.

Implemented in changeset:1929.


Remaining Issues

Some of the same issues that were resolved in numpy need to be revisited for ma. (See Numeric3.0 Design Document)

  • What does single element indexing return? Scalars or rank-0 arrays?
    An additional complication is that that single element may be masked.

    • What should a single element indexing return for an unmasked element?

    • What should a single element indexing return for a masked element?

      As of changeset:1882, the answer is ma.masked. The singleton ma.masked is defined in [source:tags/0.9.2/numpy/core/ma.py ma.py] as follows:
      masked = MaskedArray([0], int, mask=[1])[0:0]
      masked = masked[0:0]
      
      This changed from MA, where masked was defined as a rank-0 array. This definition leads to some surprizing properties:
      >>> from numpy.core.ma import *
      >>> x = array([1,2,3.0])
      >>> x[1].shape
      (0,)
      
      At the same time
      >>> x[0].shape
      ()
      
      This can easily be fixed by changing the definition of "masked" back to rank-0 array. (Done in changeset:1888)
      >>> x[1].dtype
      <type 'int64_arrtype'>
      
      At the same time
       >>> x[0].dtype
      <type 'float64_arrtype'>
      
      Unlike the first problem, this one cannot be easily fixed without giving up the ability to check for mising values using
      >>> x[1] is masked
      True
      >>> x[0] is masked
      False
      
      It is tempting to eliminate the special case and just use x[i].mask.all() and x[i].mask.any(), the constructs that have clear meaning for any number of elements. The downside of changing the return value of x[i] for masked elements is that "x[i] is masked" will silently break in a dangerous way - it will always be false.

It may be safer to also change the name "masked" to say "missing" and educate users that x[i] is masked should be changed to x[i].mask.any(), x[i].mask.all() or even just x[i].mask as appropriate and x[i] = masked should be changed to x[i] = missing.

  • Can arrays be used as truth values directly?

An alternative implementation of MaskedArray

As a regular user of MaskedArray, I became increasingly frustrated with the subclassing of masked arrays (even if I can only blame my inexperience). I needed to develop a class of arrays that could store some additional information along with numerical values, while keeping the possibility for missing data (picture storing a series of dates along with measurements). I started to implement such a class, but then quickly realized that any additional information disappeared when processing these subarrays (for example, adding a constant value to a subarray would erase its dates). I ended up writing the equivalent of numpy.core.ma for my particular class, ufuncs included. Everything went fine until I needed to subclass my new class, when more problems showed up: some attributes of the new subclass were lost during processing. I identified the culprit as MaskedArray, which returns masked ndarrays when I expected masked arrays of my class. I was preparing myself to rewrite numpy.core.ma when I forced myself to learn how to subclass ndarrays. As I became more familiar with the __new__ and __array_finalize__ methods, I started to wonder why masked arrays were objects, and not ndarrays, and whether it wouldn't be more convenient for subclassing if they did behave like regular ndarrays.

The new maskedarray is what I eventually come up with. The main differences with the initial numpy.core.ma package are that MaskedArray is now a subclass of ndarray and that the _data section can now be any subclass of ndarray (well, it should work in most cases, some tweaking might required here and there). Apart from a couple of issues listed below, the behavior of the new MaskedArray class reproduces the old one. It is quite likely to be significantly slower, though: I was more interested into a clear organization than in performance, so I tended to use wrappers liberally. I'm sure we can improve that rather easily. Note that I didn't try to time any methods. I also attach a unittest suite, modeled after the standard numpy one, along with some utilities for testing. The old test_ma can also be run with the new package but it does fail in some places, see below.

Note that if the subclass has some special methods and attributes, they are not propagated to the masked version: this would require a modification of the __getattribute__ method (first trying ndarray.__getattribute__, then trying self._data.__getattribute__ if an exception is raised in the first place), which really slows things down.

Main differences

  • The _data part of the masked array can be any subclass of ndarray (but not recarray, cf below).
  • fill_value is now a property, not a function.
  • in the majority of cases, the mask is forced to nomask when no value is actually masked. A notable exception is when a masked array (with no masked values) has just been unpickled.
  • I got rid of the share_mask flag, I never understood its purpose.
  • put, putmask and take now mimic the ndarray methods, to avoid unpleasant surprises. Moreover, put and putmask both update the mask when needed.
  • if a is a masked array, bool(a) raises a ValueError, as it does with ndarrays.
  • in the same way, the comparison of two masked arrays is a masked array, not a boolean
  • filled(a) returns an array of the same subclass as a._data, and no test is performed on whether it is contiguous or not.
  • the mask is always printed, even if it's nomask, which makes things easier (for me at least) to remember that a masked array is used.
  • cumsum works as if the _data array was filled with 0. The mask is preserved, but not updated.
  • cumprod works as if the _data array was filled with 1. The mask is preserved, but not updated.

New features

  • the mr_ function mimics r_ for masked arrays.
  • the anom method returns the anomalies (deviations from the average)
  • the stdu and varu return unbiased estimates of the standard deviation and variance, respectively.

Using the new package with numpy.core.ma

I tried to make sure that the new package can understand old masked arrays. Unfortunately, there's no upward compatibility.

For example:

>>> import numpy.core.ma as old_ma
>>> import maskedarray as new_ma
>>> x = old_ma.array([1,2,3,4,5], mask=[0,0,1,0,0])
>>> x
array(data =
 [     1      2 999999      4      5],
      mask =
 [False False True False False],
      fill_value=999999)
>>> y = new_ma.array([1,2,3,4,5], mask=[0,0,1,0,0])
>>> y
array(data = [1 2 -- 4 5],
      mask = [False False True False False],
      fill_value=999999)
>>> x==y
array(data =
 [True True True True True],
      mask =
 [False False True False False],
      fill_value=?)
>>> old_ma.getmask(x) == new_ma.getmask(x)
array([True, True, True, True, True], dtype=bool)
>>> old_ma.getmask(y) == new_ma.getmask(y)
array([True, True, False, True, True], dtype=bool)
>>> old_ma.getmask(y)
False

A basic consequence is that matplotlib will not recognize new masked arrays as such. The file matplotlib/numerix/ma/__init__.py must be modified to call the new package instead of numpy.core.ma.

Revision notes

01/23/2007 : The package has been moved to the SciPy? sandbox, and is regularly updated: please check out your SVN version ! 10/28/2006 : Updated put, deleted putmask to match numpy 1.0

Masked records

Like numpy.core.ma, the ndarray-based implementation of MaskedArray is limited when working with records: you can mask any record of the array, but not a field in a record. If you need this feature, you may want to give mrecords a try (available in the sandbox/maskedarray package of Scipy), that defines a new class, MaskedRecord. An instance of this class accepts a recarray as data, and uses two masks: the recordmask has as many entries as records in the array, each entry with the same fields as a record, but of boolean types, indicating whether a field is masked or not; an entry is flagged as masked in the mask array if at least one field is masked. A few examples in the file should give you an idea of what can be done. Note that maskedrecordarray is still quite experimental...

Please note that it's still a work in progress (even if it seems to work OK when I use it). Suggestions, comments, improvements and general feedback are more than welcome ! At last, I'd like to thank Paul, Travis and Sasha for the original masked array package: without you, I would never have started that (it might be argued that I shouldn't have anyway, but that's another story...)

Attachments