| 1 | | = An alternative implementation of MaskedArray = |
| 2 | | |
| 3 | | '''Note: the new implementation of MaskedArray is now available in the scipy sandbox. ''' |
| 4 | | |
| 5 | | == History == |
| 6 | | |
| 7 | | As a regular user of MaskedArray, I (Pierre G.F. Gerard-Marchant) became increasingly frustrated with the subclassing of masked arrays (even if I can only blame my inexperience). I needed to develop a class of arrays that could store some additional information along with numerical values, while keeping the possibility for missing data (picture storing a series of dates along with measurements, what would later become the {{{TimeSeries}}} package). |
| 8 | | |
| 9 | | I started to implement such a class, but then quickly realized that any additional information disappeared when processing these subarrays (for example, adding a constant value to a subarray would erase its dates). I ended up writing the equivalent of {{{numpy.core.ma}}} for my particular class, ufuncs included. Everything went fine until I needed to subclass my new class, when more problems showed up: some attributes of the new subclass were lost during processing. I identified the culprit as {{{MaskedArray}}}, which returns masked ndarrays when I expected masked arrays of my class. I was preparing myself to rewrite {{{numpy.core.ma}}} when I forced myself to learn how to subclass ndarrays. As I became more familiar with the {{{__new__}}} and {{{__array_finalize__}}} methods, I started to wonder why masked arrays were objects, and not ndarrays, and whether it wouldn't be more convenient for subclassing if they did behave like regular ndarrays. |
| 10 | | |
| 11 | | The new {{{maskedarray}}} is what I eventually come up with. The main differences with the initial {{{numpy.core.ma}}} package are that {{{MaskedArray}}} is now a subclass of {{{ndarray}}} and that the {{{_data}}} section can now be any subclass of {{{ndarray}}}. Apart from a couple of issues listed below, the behavior of the new {{{MaskedArray}}} class reproduces the old one. Initially the {{{maskedarray}}} implementation was marginally slower than {{{numpy.ma}}} in some areas, but work is underway to speed it up; the expectation is that it can be made substantially faster than the present {{{numpy.ma}}}. |
| 12 | | |
| 13 | | |
| 14 | | Note that if the subclass has some special methods and attributes, they are not propagated to the masked version: this would require a modification of the {{{__getattribute__}}} method (first trying {{{ndarray.__getattribute__}}}, then trying {{{self._data.__getattribute__}}} if an exception is raised in the first place), which really slows things down. |
| 15 | | |
| 16 | | == Main differences == |
| 17 | | * The {{{_data}}} part of the masked array can be any subclass of ndarray (but not recarray, cf below). |
| 18 | | * {{{fill_value}}} is now a property, not a function. |
| 19 | | * in the majority of cases, the mask is forced to {{{nomask}}} when no value is actually masked. A notable exception is when a masked array (with no masked values) has just been unpickled. |
| 20 | | * I got rid of the {{{share_mask}}} flag, I never understood its purpose. |
| 21 | | * {{{put}}}, {{{putmask}}} and {{{take}}} now mimic the ndarray methods, to avoid unpleasant surprises. Moreover, {{{put}}} and {{{putmask}}} both update the mask when needed. |
| 22 | | * if {{{a}}} is a masked array, {{{bool(a)}}} raises a {{{ValueError}}}, as it does with ndarrays. |
| 23 | | * in the same way, the comparison of two masked arrays is a masked array, not a boolean |
| 24 | | * {{{filled(a)}}} returns an array of the same subclass as {{{a._data}}}, and no test is performed on whether it is contiguous or not. |
| 25 | | * the mask is always printed, even if it's {{{nomask}}}, which makes things easier (for me at least) to remember that a masked array is used. |
| 26 | | * {{{cumsum}}} works as if the {{{_data}}} array was filled with 0. The mask is preserved, but not updated. |
| 27 | | * {{{cumprod}}} works as if the {{{_data}}} array was filled with 1. The mask is preserved, but not updated. |
| 28 | | |
| 29 | | == New features == |
| 30 | | This list is non-exhaustive... |
| 31 | | * the {{{mr_}}} function mimics {{{r_}}} for masked arrays. |
| 32 | | * the {{{anom}}} method returns the anomalies (deviations from the average) |
| 33 | | * the {{{stdu}}} and {{{varu}}} return unbiased estimates of the standard deviation and variance, respectively. |
| 34 | | |
| 35 | | == Using the new package with numpy.core.ma == |
| 36 | | I tried to make sure that the new package can understand old masked arrays. Unfortunately, there's no upward compatibility. |
| 37 | | For example: |
| 38 | | {{{ |
| 39 | | >>> import numpy.core.ma as old_ma |
| 40 | | >>> import maskedarray as new_ma |
| 41 | | >>> x = old_ma.array([1,2,3,4,5], mask=[0,0,1,0,0]) |
| 42 | | >>> x |
| 43 | | array(data = |
| 44 | | [ 1 2 999999 4 5], |
| 45 | | mask = |
| 46 | | [False False True False False], |
| 47 | | fill_value=999999) |
| 48 | | >>> y = new_ma.array([1,2,3,4,5], mask=[0,0,1,0,0]) |
| 49 | | >>> y |
| 50 | | array(data = [1 2 -- 4 5], |
| 51 | | mask = [False False True False False], |
| 52 | | fill_value=999999) |
| 53 | | >>> x==y |
| 54 | | array(data = |
| 55 | | [True True True True True], |
| 56 | | mask = |
| 57 | | [False False True False False], |
| 58 | | fill_value=?) |
| 59 | | >>> old_ma.getmask(x) == new_ma.getmask(x) |
| 60 | | array([True, True, True, True, True], dtype=bool) |
| 61 | | >>> old_ma.getmask(y) == new_ma.getmask(y) |
| 62 | | array([True, True, False, True, True], dtype=bool) |
| 63 | | >>> old_ma.getmask(y) |
| 64 | | False |
| 65 | | }}} |
| 66 | | |
| 67 | | == Using maskedarray with matplotlib == |
| 68 | | By default matplotlib still uses numpy.ma, but there is an rcParams setting that you can use to select maskedarray instead. In the matplotlibrc file you will find: |
| 69 | | |
| 70 | | {{{ |
| 71 | | #maskedarray : False # True to use external maskedarray module |
| 72 | | # instead of numpy.ma; this is a temporary |
| 73 | | # setting for testing maskedarray. |
| 74 | | }}} |
| 75 | | |
| 76 | | Uncomment and set to True to select maskedarray everywhere. Alternatively, you can test a script with maskedarray by using a command-line option, e.g.: |
| 77 | | |
| 78 | | {{{ |
| 79 | | python simple_plot.py --maskedarray |
| 80 | | }}} |
| 81 | | |
| 82 | | == Masked records == |
| 83 | | Like {{{numpy.core.ma}}}, the {{{ndarray}}}-based implementation of {{{MaskedArray}}} is limited when working with records: you can mask any record of the array, but not a field in a record. If you need this feature, you may want to give the {{{mrecords}}} package a try (available in the {{{maskedarray}}} directory in the scipy sandbox). This module defines a new class, {{{MaskedRecord}}}. An instance of this class accepts a {{{recarray}}} as data, and uses two masks: the {{{fieldmask}}} has as many entries as records in the array, each entry with the same fields as a record, but of boolean types: they indicate whether the field is masked or not; a record entry is flagged as masked in the {{{mask}}} array if all the fields are masked. A few examples in the file should give you an idea of what can be done. Note that {{{mrecords}}} is still experimental... |
| 84 | | |
| 85 | | == Optimizing maskedarray == |
| 86 | | |
| 87 | | === Should masked arrays be filled before processing or not ? === |
| 88 | | In the current implementation, most operations on masked arrays involve the following steps: |
| 89 | | * the input arrays are filled |
| 90 | | * the operation is performed on the filled arrays |
| 91 | | * the mask is set for the results, from the combination of the input masks and the mask corresponding to the domain of the operation. |
| 92 | | |
| 93 | | For example, consider the division of two masked arrays: |
| 94 | | {{{ |
| 95 | | #!python |
| 96 | | import numpy |
| 97 | | import maskedarray as ma |
| 98 | | x = ma.array([1,2,3,4],mask=[1,0,0,0], dtype=numpy.float_) |
| 99 | | y = ma.array([-1,0,1,2], mask=[0,0,0,1], dtype=numpy.float_) |
| 100 | | }}} |
| 101 | | |
| 102 | | The division of x by y is then computed as |
| 103 | | {{{ |
| 104 | | #!python |
| 105 | | d1 = x.filled(0) # d1 = array([0., 2., 3., 4.]) |
| 106 | | d2 = y.filled(1) # array([-1., 0., 1., 1.]) |
| 107 | | m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m = array([True,False,False,True]) |
| 108 | | dm = ma.divide.domain(d1,d2) # array([False, True, False, False]) |
| 109 | | result = (d1/d2).view(MaskedArray) # masked_array([-0. inf, 3., 4.]) |
| 110 | | result._mask = logical_or(m, dm) |
| 111 | | }}} |
| 112 | | |
| 113 | | Note that a division by zero takes place. To avoid it, we can consider to fill the input arrays, taking the domain mask into account, so that: |
| 114 | | {{{ |
| 115 | | #!python |
| 116 | | d1 = x._data.copy() # d1 = array([1., 2., 3., 4.]) |
| 117 | | d2 = y._data.copy() # array([-1., 0., 1., 2.]) |
| 118 | | dm = ma.divide.domain(d1,d2) # array([False, True, False, False]) |
| 119 | | numpy.putmask(d2, dm, 1) # d2 = array([-1., 1., 1., 2.]) |
| 120 | | m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m = array([True,False,False,True]) |
| 121 | | result = (d1/d2).view(MaskedArray) # masked_array([-1. 0., 3., 2.]) |
| 122 | | result._mask = logical_or(m, dm) |
| 123 | | }}} |
| 124 | | Note that the {{{.copy()}}} is required to avoid updating the inputs with {{{putmask}}}. |
| 125 | | The {{{.filled()}}} method also involves a {{{.copy()}}}. |
| 126 | | |
| 127 | | A third possibility consists in avoid filling the arrays: |
| 128 | | {{{ |
| 129 | | #!python |
| 130 | | d1 = x._data # d1 = array([1., 2., 3., 4.]) |
| 131 | | d2 = y._data # array([-1., 0., 1., 2.]) |
| 132 | | dm = ma.divide.domain(d1,d2) # array([False, True, False, False]) |
| 133 | | m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m = array([True,False,False,True]) |
| 134 | | result = (d1/d2).view(MaskedArray) # masked_array([-1. inf, 3., 2.]) |
| 135 | | result._mask = logical_or(m, dm) |
| 136 | | }}} |
| 137 | | Note that here again the division by zero takes place. |
| 138 | | |
| 139 | | A quick benchmark gives the following results: |
| 140 | | * {{{numpy.ma.divide}}} : 2.69 ms per loop |
| 141 | | * classical division : 2.21 ms per loop |
| 142 | | * division w/ prefilling : 2.34 ms per loop |
| 143 | | * division w/o filling : 1.55 ms per loop |
| 144 | | |
| 145 | | So, is it worth filling the arrays beforehand ? Yes, if we are interested in avoiding floating-point exceptions that may fill the result with infs and nans. No, if we are only interested into speed... |
| 146 | | |
| 147 | | |
| 148 | | |
| 149 | | |
| 150 | | == Thanks == |
| 151 | | I'd like to thank Paul Dubois, Travis Oliphant and Sasha for the original masked array package: without you, I would never have started that (it might be argued that I shouldn't have anyway, but that's another story...). |
| 152 | | I also wish to extend these thanks to Reggie Dugard and Eric Firing for their suggestions and numerous improvements. |
| 153 | | |
| 154 | | |
| 155 | | === Revision notes === |
| 156 | | * 08/25/2007 : Creation of this page |
| 157 | | * 01/23/2007 : The package has been moved to the SciPy sandbox, and is regularly updated: please check out your SVN version! |
| 158 | | |
| 159 | | |
| 160 | | |
| | 1 | [[ReST(/trunk/numpy/ma/README.txt)]] |