| | 151 | ---- |
| | 152 | = An alternative implementation of MaskedArray = |
| | 153 | |
| | 154 | As a regular user of MaskedArray, I became increasingly frustrated with the subclassing of masked arrays (even if I can only blame my inexperience). I needed to develop a class of arrays that could store some additional information along with numerical values, while keeping the possibility for missing data (picture storing a series of dates along with measurements). I started to implement such a class, but then quickly realized that any additional information disappeared when processing these subarrays (for example, adding a constant value to a subarray would erase its dates). I ended up writing the equivalent of numpy.core.ma for my particular class, ufuncs included. Everything went fine until I needed to subclass my new class, when more problems showed up: some attributes of the new subclass were lost during processing. I identified the culprit as MaskedArray, which returns masked ndarrays when I expected masked arrays of my class. I was preparing myself to rewrite numpy.core.ma when I forced myself to learn how to subclass ndarrays. As I became more familiar with the {{{__new__}}} and {{{__array_finalize__}}} methods, I started to wonder why masked arrays were objects, and not ndarrays, and whether it wouldn't be more convenient for subclassing if they did behave like regular ndarrays. |
| | 155 | |
| | 156 | The attachment is what I eventually come up with. The main differences with the initial {{{numpy.core.ma}}} package are that {{{MaskedArray}}} is now a subclass of {{{ndarray}}} and that the {{{_data}}} section can now be any subclass of {{{ndarray}}} (well, it should work in most cases, some tweaking might required here and there). Apart from a couple of issues listed below, the behavior of the new {{{MaskedArray}} class reproduces the old one. It is quite likely to be significantly slower, though: I was more interested into a clear organization than in performance, so I tended to use wrappers liberally. I'm sure we can improve that rather easily. Note that I didn't try to time any methods. |
| | 157 | I also attach a unittest suite (here), modeled after the standard numpy one, along with some utiliies for testing (here). The old {{{test_ma}}} can also be run with the new package but it does fail in some places, see below. |
| | 158 | |
| | 159 | === Main differences === |
| | 160 | * {{{fill_value}}} is now a property, not a function. |
| | 161 | * in the majority of cases, the mask is forced to {{{nomask}}} when no value is actually masked. A notable exception is when a masked array (with no masked values) has just been unpickled. |
| | 162 | * I got rid of the {{{share_mask}}} flag, I never understood its purpose. |
| | 163 | * {{{put}}}, {{{putmask}}} and {{{take}}} now mimic the ndarray methods, to avoid unpleasant surprises. Moreover, {{{put}}} and {{{putmask}}} both update the mask when needed. |
| | 164 | * if {{{a}}} is a masked array, {{{bool(a)}}} raises a {{{ValueError}}}, as it does with ndarrays. |
| | 165 | * in the same way, the comparison of two masked arrays is a masked array, not a boolean |
| | 166 | * {{{filled(a)}}} returns an array of the same subclass as {{{a._data}}}, and no test is performed on whether it is contiguous or not. |
| | 167 | * the mask is always printed, even if it's {{{nomask}}}, which makes things easier (for me at least) to remember that a masked array is used. |
| | 168 | * {{{cumsum}}} works as if the {{{_data}}} array was filled with 0. The mask is preserved, but not updated. |
| | 169 | * {{{cumprod}}} works as if the {{{_data}}} array was filled with 1. The mask is preserved, but not updated. |
| | 170 | |
| | 171 | === New features === |
| | 172 | * the {{{mr_}}} function mimics {{{r_}}} for masked arrays. |
| | 173 | * the {{{anom}}} method returns the anomalies (deviations from the average) |
| | 174 | * the {{{stdu}}} and {{{varu}}} return unbiased estimates of the standard deviation and variance, respectively. |
| | 175 | |
| | 176 | === Using the new package with numpy.core.ma === |
| | 177 | I tried to make sure that the new package can understand old masked arrays. Unfortunately, there's no upward compatibility. |
| | 178 | For example: |
| | 179 | {{{ |
| | 180 | >>> import numpy.core.ma as old_ma |
| | 181 | >>> import maskedarray as new_ma |
| | 182 | >>> x = old_ma.array([1,2,3,4,5], mask=[0,0,1,0,0]) |
| | 183 | >>> x |
| | 184 | array(data = |
| | 185 | [ 1 2 999999 4 5], |
| | 186 | mask = |
| | 187 | [False False True False False], |
| | 188 | fill_value=999999) |
| | 189 | >>> y = new_ma.array([1,2,3,4,5], mask=[0,0,1,0,0]) |
| | 190 | >>> y |
| | 191 | array(data = [1 2 -- 4 5], |
| | 192 | mask = [False False True False False], |
| | 193 | fill_value=999999) |
| | 194 | >>> x==y |
| | 195 | array(data = |
| | 196 | [True True True True True], |
| | 197 | mask = |
| | 198 | [False False True False False], |
| | 199 | fill_value=?) |
| | 200 | >>> old_ma.getmask(x) == new_ma.getmask(x) |
| | 201 | array([True, True, True, True, True], dtype=bool) |
| | 202 | >>> old_ma.getmask(y) == new_ma.getmask(y) |
| | 203 | array([True, True, False, True, True], dtype=bool) |
| | 204 | >>> old_ma.getmask(y) |
| | 205 | False |
| | 206 | }}} |
| | 207 | A basic consequence is that {{{matplotlib}}} will not recognize new masked arrays as such. The file {{{matplotlib/numerix/ma/__init__.py}}} must be modified to call the new package instead of {{{numpy.core.ma}}}. |
| | 208 | |
| | 209 | |
| | 210 | Please note that it's still a work in progress (even if it seems to work quite OK when I use it). Suggestions, comments, improvements and general feedback are more than welcome ! |
| | 211 | |
| | 212 | |
| | 213 | |
| | 214 | |