Changes between Version 1 and Version 2 of MaskedArrayAlternative

Show
Ignore:
Timestamp:
08/25/07 19:00:52 (6 years ago)
Author:
pierregm
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • MaskedArrayAlternative

    v1 v2  
    33'''Note: the new implementation of MaskedArray is now available in the scipy sandbox. ''' 
    44 
    5 === History === 
     5== History == 
    66 
    77As a regular user of MaskedArray, I (Pierre G.F. Gerard-Marchant) became increasingly frustrated with the subclassing of masked arrays (even if I can only blame my inexperience). I needed to develop a class of arrays that could store some additional information along with numerical values, while keeping the possibility for missing data (picture storing a series of dates along with measurements, what would later become the {{{TimeSeries}}} package).  
     
    1414Note that if the subclass has some special methods and attributes, they are not propagated to the masked version: this would require a modification of the {{{__getattribute__}}} method (first trying {{{ndarray.__getattribute__}}}, then trying {{{self._data.__getattribute__}}} if an exception is raised in the first place), which really slows things down.  
    1515 
    16 === Main differences === 
     16== Main differences == 
    1717  * The {{{_data}}} part of the masked array can be any subclass of ndarray (but not recarray, cf below). 
    1818  * {{{fill_value}}} is now a property, not a function. 
     
    2727  * {{{cumprod}}} works as if the {{{_data}}} array was filled with 1. The mask is preserved, but not updated. 
    2828 
    29 === New features === 
     29== New features == 
     30  This list is non-exhaustive... 
    3031  * the {{{mr_}}} function mimics {{{r_}}} for masked arrays. 
    3132  * the {{{anom}}} method returns the anomalies (deviations from the average) 
    3233  * the {{{stdu}}} and {{{varu}}} return unbiased estimates of the standard deviation and variance, respectively. 
    3334 
    34 === Using the new package with numpy.core.ma === 
     35== Using the new package with numpy.core.ma == 
    3536  I tried to make sure that the new package can understand old masked arrays. Unfortunately, there's no upward compatibility. 
    3637For example: 
     
    6465}}} 
    6566   
    66 === Using maskedarray with matplotlib === 
     67== Using maskedarray with matplotlib == 
    6768By default matplotlib still uses numpy.ma, but there is an rcParams setting that you can use to select maskedarray instead.  In the matplotlibrc file you will find: 
    6869 
     
    7980}}} 
    8081 
    81 === Masked records === 
     82== Masked records == 
    8283  Like {{{numpy.core.ma}}}, the {{{ndarray}}}-based implementation of {{{MaskedArray}}} is limited when working with records: you can mask any record of the array, but not a field in a record. If you need this feature, you may want to give the {{{mrecords}}} package a try (available in the {{{maskedarray}}} directory in the scipy sandbox). This module defines a new class, {{{MaskedRecord}}}. An instance of this class accepts a {{{recarray}}} as data, and uses two masks: the {{{fieldmask}}} has as many entries as records in the array, each entry with the same fields as a record, but of boolean types: they indicate whether the field is masked or not; a record entry is flagged as masked in the {{{mask}}} array if all the fields are masked. A few examples in the file should give you an idea of what can be done. Note that {{{mrecords}}} is still experimental... 
    8384 
     85== Optimizing maskedarray == 
    8486 
    85 === Thanks === 
     87=== Should masked arrays be filled before processing or not ? === 
     88  In the current implementation, most operations on masked arrays involve the following steps: 
     89* the input arrays are filled 
     90* the operation is performed on the filled arrays 
     91* the mask is set for the results, from the combination of the input masks and the mask corresponding to the domain of the operation. 
     92 
     93For example, consider the division of two masked arrays: 
     94{{{ 
     95#!python 
     96import numpy 
     97import maskedarray as ma 
     98x = ma.array([1,2,3,4],mask=[1,0,0,0], dtype=numpy.float_) 
     99y = ma.array([-1,0,1,2], mask=[0,0,0,1], dtype=numpy.float_) 
     100}}} 
     101 
     102The division of x by y is then computed as 
     103{{{ 
     104#!python 
     105d1 = x.filled(0) # d1 = array([0., 2., 3., 4.]) 
     106d2 = y.filled(1) # array([-1.,  0.,  1.,  1.]) 
     107m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m = array([True,False,False,True]) 
     108dm = ma.divide.domain(d1,d2) # array([False,  True, False, False]) 
     109result = (d1/d2).view(MaskedArray) # masked_array([-0. inf, 3., 4.]) 
     110result._mask = logical_or(m, dm) 
     111}}} 
     112 
     113Note that a division by zero takes place. To avoid it, we can consider to fill the input arrays, taking the domain mask into account, so that: 
     114{{{ 
     115#!python 
     116d1 = x._data.copy() # d1 = array([1., 2., 3., 4.]) 
     117d2 = y._data.copy() # array([-1.,  0.,  1.,  2.]) 
     118dm = ma.divide.domain(d1,d2) # array([False,  True, False, False]) 
     119numpy.putmask(d2, dm, 1) # d2 = array([-1.,  1.,  1.,  2.]) 
     120m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m = array([True,False,False,True]) 
     121result = (d1/d2).view(MaskedArray) # masked_array([-1. 0., 3., 2.]) 
     122result._mask = logical_or(m, dm) 
     123}}} 
     124Note that the {{{.copy()}}} is required to avoid updating the inputs with {{{putmask}}}. 
     125In the previous version, the {{{.filled}}} methods involved a {{{.copy()}}}. 
     126 
     127A third possibility consists in avoid filling the arrays: 
     128{{{ 
     129#!python 
     130d1 = x._data # d1 = array([1., 2., 3., 4.]) 
     131d2 = y._data # array([-1.,  0.,  1.,  2.]) 
     132dm = ma.divide.domain(d1,d2) # array([False,  True, False, False]) 
     133m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m = array([True,False,False,True]) 
     134result = (d1/d2).view(MaskedArray) # masked_array([-1. inf, 3., 2.]) 
     135result._mask = logical_or(m, dm) 
     136}}} 
     137Note that here again the division by zero takes place. 
     138 
     139A quick benchmark gives the following results: 
     140  * {{{numpy.ma.divide}}}  : 2.84 ms per loop 
     141  * classical division     : 2.99 ms per loop 
     142  * division w/ prefilling : 2.20 ms per loop 
     143  * division w/o filling   : 1.54 ms per loop 
     144 
     145So, is it worth filling the arrays beforehand ? Yes, if we are interested in avoiding floating-point exceptions that may fill the result with infs and nans. No, if we are only interested into speed... 
     146 
     147 
     148 
     149 
     150== Thanks == 
    86151  I'd like to thank Paul Dubois, Travis Oliphant and Sasha for the original masked array package: without you, I would never have started that (it might be argued that I shouldn't have anyway, but that's another story...). 
    87152  I also wish to extend this thanks to Reggie Dugard and Eric Firing for their suggestions and numerous improvements.