Changes between Version 6 and Version 7 of MaskedArrayAlternative

Show
Ignore:
Timestamp:
02/10/08 09:23:51 (5 years ago)
Author:
jarrod.millman
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • MaskedArrayAlternative

    v6 v7  
    1 = An alternative implementation of MaskedArray = 
    2  
    3 '''Note: the new implementation of MaskedArray is now available in the scipy sandbox. ''' 
    4  
    5 == History == 
    6  
    7 As a regular user of MaskedArray, I (Pierre G.F. Gerard-Marchant) became increasingly frustrated with the subclassing of masked arrays (even if I can only blame my inexperience). I needed to develop a class of arrays that could store some additional information along with numerical values, while keeping the possibility for missing data (picture storing a series of dates along with measurements, what would later become the {{{TimeSeries}}} package).  
    8  
    9 I started to implement such a class, but then quickly realized that any additional information disappeared when processing these subarrays (for example, adding a constant value to a subarray would erase its dates). I ended up writing the equivalent of {{{numpy.core.ma}}} for my particular class, ufuncs included. Everything went fine until I needed to subclass my new class, when more problems showed up: some attributes of the new subclass were lost during processing. I identified the culprit as {{{MaskedArray}}}, which returns masked ndarrays when I expected masked arrays of my class. I was preparing myself to rewrite {{{numpy.core.ma}}} when I forced myself to learn how to subclass ndarrays. As I became more familiar with the {{{__new__}}} and {{{__array_finalize__}}} methods, I started to wonder why masked arrays were objects, and not ndarrays, and whether it wouldn't be more convenient for subclassing if they did behave like regular ndarrays. 
    10  
    11 The new {{{maskedarray}}} is what I eventually come up with. The main differences with the initial {{{numpy.core.ma}}} package are that {{{MaskedArray}}} is now a subclass of {{{ndarray}}} and that the {{{_data}}} section can now be any subclass of {{{ndarray}}}. Apart from a couple of issues listed below, the behavior of the new {{{MaskedArray}}} class reproduces the old one. Initially the {{{maskedarray}}} implementation was marginally slower than {{{numpy.ma}}} in some areas, but work is underway to speed it up; the expectation is that it can be made substantially faster than the present {{{numpy.ma}}}. 
    12  
    13  
    14 Note that if the subclass has some special methods and attributes, they are not propagated to the masked version: this would require a modification of the {{{__getattribute__}}} method (first trying {{{ndarray.__getattribute__}}}, then trying {{{self._data.__getattribute__}}} if an exception is raised in the first place), which really slows things down.  
    15  
    16 == Main differences == 
    17   * The {{{_data}}} part of the masked array can be any subclass of ndarray (but not recarray, cf below). 
    18   * {{{fill_value}}} is now a property, not a function. 
    19   * in the majority of cases, the mask is forced to {{{nomask}}} when no value is actually masked. A notable exception is when a masked array (with no masked values) has just been unpickled. 
    20   * I got rid of the {{{share_mask}}} flag, I never understood its purpose.  
    21   * {{{put}}}, {{{putmask}}} and {{{take}}} now mimic the ndarray methods, to avoid unpleasant surprises. Moreover, {{{put}}} and {{{putmask}}} both update the mask when needed. 
    22   * if {{{a}}} is a masked array, {{{bool(a)}}} raises a {{{ValueError}}}, as it does with ndarrays. 
    23   * in the same way, the comparison of two masked arrays is a masked array, not a boolean 
    24   * {{{filled(a)}}} returns an array of the same subclass as {{{a._data}}}, and no test is performed on whether it is contiguous or not. 
    25   * the mask is always printed, even if it's {{{nomask}}}, which makes things easier (for me at least) to remember that a masked array is used. 
    26   * {{{cumsum}}} works as if the {{{_data}}} array was filled with 0. The mask is preserved, but not updated. 
    27   * {{{cumprod}}} works as if the {{{_data}}} array was filled with 1. The mask is preserved, but not updated. 
    28  
    29 == New features == 
    30   This list is non-exhaustive... 
    31   * the {{{mr_}}} function mimics {{{r_}}} for masked arrays. 
    32   * the {{{anom}}} method returns the anomalies (deviations from the average) 
    33   * the {{{stdu}}} and {{{varu}}} return unbiased estimates of the standard deviation and variance, respectively. 
    34  
    35 == Using the new package with numpy.core.ma == 
    36   I tried to make sure that the new package can understand old masked arrays. Unfortunately, there's no upward compatibility. 
    37 For example: 
    38 {{{ 
    39 >>> import numpy.core.ma as old_ma 
    40 >>> import maskedarray as new_ma 
    41 >>> x = old_ma.array([1,2,3,4,5], mask=[0,0,1,0,0]) 
    42 >>> x 
    43 array(data = 
    44  [     1      2 999999      4      5], 
    45       mask = 
    46  [False False True False False], 
    47       fill_value=999999) 
    48 >>> y = new_ma.array([1,2,3,4,5], mask=[0,0,1,0,0]) 
    49 >>> y 
    50 array(data = [1 2 -- 4 5], 
    51       mask = [False False True False False], 
    52       fill_value=999999) 
    53 >>> x==y 
    54 array(data = 
    55  [True True True True True], 
    56       mask = 
    57  [False False True False False], 
    58       fill_value=?) 
    59 >>> old_ma.getmask(x) == new_ma.getmask(x) 
    60 array([True, True, True, True, True], dtype=bool) 
    61 >>> old_ma.getmask(y) == new_ma.getmask(y) 
    62 array([True, True, False, True, True], dtype=bool) 
    63 >>> old_ma.getmask(y) 
    64 False 
    65 }}} 
    66    
    67 == Using maskedarray with matplotlib == 
    68 By default matplotlib still uses numpy.ma, but there is an rcParams setting that you can use to select maskedarray instead.  In the matplotlibrc file you will find: 
    69  
    70 {{{ 
    71 #maskedarray : False       # True to use external maskedarray module 
    72                            # instead of numpy.ma; this is a temporary 
    73                            # setting for testing maskedarray. 
    74 }}} 
    75  
    76 Uncomment and set to True to select maskedarray everywhere.  Alternatively, you can test a script with maskedarray by using a command-line option, e.g.: 
    77  
    78 {{{ 
    79 python simple_plot.py --maskedarray 
    80 }}} 
    81  
    82 == Masked records == 
    83   Like {{{numpy.core.ma}}}, the {{{ndarray}}}-based implementation of {{{MaskedArray}}} is limited when working with records: you can mask any record of the array, but not a field in a record. If you need this feature, you may want to give the {{{mrecords}}} package a try (available in the {{{maskedarray}}} directory in the scipy sandbox). This module defines a new class, {{{MaskedRecord}}}. An instance of this class accepts a {{{recarray}}} as data, and uses two masks: the {{{fieldmask}}} has as many entries as records in the array, each entry with the same fields as a record, but of boolean types: they indicate whether the field is masked or not; a record entry is flagged as masked in the {{{mask}}} array if all the fields are masked. A few examples in the file should give you an idea of what can be done. Note that {{{mrecords}}} is still experimental... 
    84  
    85 == Optimizing maskedarray == 
    86  
    87 === Should masked arrays be filled before processing or not ? === 
    88   In the current implementation, most operations on masked arrays involve the following steps: 
    89  * the input arrays are filled 
    90  * the operation is performed on the filled arrays 
    91  * the mask is set for the results, from the combination of the input masks and the mask corresponding to the domain of the operation. 
    92  
    93 For example, consider the division of two masked arrays: 
    94 {{{ 
    95 #!python 
    96 import numpy 
    97 import maskedarray as ma 
    98 x = ma.array([1,2,3,4],mask=[1,0,0,0], dtype=numpy.float_) 
    99 y = ma.array([-1,0,1,2], mask=[0,0,0,1], dtype=numpy.float_) 
    100 }}} 
    101  
    102 The division of x by y is then computed as 
    103 {{{ 
    104 #!python 
    105 d1 = x.filled(0) # d1 = array([0., 2., 3., 4.]) 
    106 d2 = y.filled(1) # array([-1.,  0.,  1.,  1.]) 
    107 m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m = array([True,False,False,True]) 
    108 dm = ma.divide.domain(d1,d2) # array([False,  True, False, False]) 
    109 result = (d1/d2).view(MaskedArray) # masked_array([-0. inf, 3., 4.]) 
    110 result._mask = logical_or(m, dm) 
    111 }}} 
    112  
    113 Note that a division by zero takes place. To avoid it, we can consider to fill the input arrays, taking the domain mask into account, so that: 
    114 {{{ 
    115 #!python 
    116 d1 = x._data.copy() # d1 = array([1., 2., 3., 4.]) 
    117 d2 = y._data.copy() # array([-1.,  0.,  1.,  2.]) 
    118 dm = ma.divide.domain(d1,d2) # array([False,  True, False, False]) 
    119 numpy.putmask(d2, dm, 1) # d2 = array([-1.,  1.,  1.,  2.]) 
    120 m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m = array([True,False,False,True]) 
    121 result = (d1/d2).view(MaskedArray) # masked_array([-1. 0., 3., 2.]) 
    122 result._mask = logical_or(m, dm) 
    123 }}} 
    124 Note that the {{{.copy()}}} is required to avoid updating the inputs with {{{putmask}}}. 
    125 The {{{.filled()}}} method also involves a {{{.copy()}}}. 
    126  
    127 A third possibility consists in avoid filling the arrays: 
    128 {{{ 
    129 #!python 
    130 d1 = x._data # d1 = array([1., 2., 3., 4.]) 
    131 d2 = y._data # array([-1.,  0.,  1.,  2.]) 
    132 dm = ma.divide.domain(d1,d2) # array([False,  True, False, False]) 
    133 m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m = array([True,False,False,True]) 
    134 result = (d1/d2).view(MaskedArray) # masked_array([-1. inf, 3., 2.]) 
    135 result._mask = logical_or(m, dm) 
    136 }}} 
    137 Note that here again the division by zero takes place. 
    138  
    139 A quick benchmark gives the following results: 
    140   * {{{numpy.ma.divide}}}  : 2.69 ms per loop 
    141   * classical division     : 2.21 ms per loop 
    142   * division w/ prefilling : 2.34 ms per loop 
    143   * division w/o filling   : 1.55 ms per loop 
    144  
    145 So, is it worth filling the arrays beforehand ? Yes, if we are interested in avoiding floating-point exceptions that may fill the result with infs and nans. No, if we are only interested into speed... 
    146  
    147  
    148  
    149  
    150 == Thanks == 
    151   I'd like to thank Paul Dubois, Travis Oliphant and Sasha for the original masked array package: without you, I would never have started that (it might be argued that I shouldn't have anyway, but that's another story...). 
    152   I also wish to extend these thanks to Reggie Dugard and Eric Firing for their suggestions and numerous improvements. 
    153  
    154  
    155 === Revision notes === 
    156   * 08/25/2007 : Creation of this page 
    157   * 01/23/2007 : The package has been moved to the SciPy sandbox, and is regularly updated: please check out your SVN version! 
    158  
    159  
    160  
     1[[ReST(/trunk/numpy/ma/README.txt)]]