{{{ #!rst ============================================================================= A proposal to handle nan in core numpy functions relying on proper ordering ============================================================================= :Author: Numpy developers :Contact: numpy-discussion@scipy.org :Date: 2008-09-24 Executive summary ================= NaN (Not a Number) are an important feature of IEEE-754 floating point standard. NaN have the special property to lack any ordering (any order operation involving a NaN is defined as false by the standard). As such, without any additional conventions, the behavior of any numpy functionality relying on ordering like max or sort is ambiguous. The goal of this proposal is to suggest several approaches for dealing with NaN in core numpy operations in a consistant and expected manner. The following function in numpy namespaces (as well as their C counterpart in the array API) are considered for changes: - max and min, amax and amin, the maximum and minimum ufuncs - argmax and argmin - sort and argsort A consistent approach is not easy to define, because NaN are used for at least two distinct purposes: as a missing value, or as a invalid operation. The 'right' behavior in one case may cause trouble for the other case; people using NaN as a missing value will want to ignore NaN most of the time. People using it as an invalid value want to be able to propagate the error to detect an error at the end of the operation. The Problem =========== Currently, there is no convention on basic functions relying on ordering, and their behavior is undefined. For example:: >>>import numpy as np >>>a = np.array([0, np.nan, -1]) >>>np.max(a) -1 >>>a = np.array([0, np.nan, 1]) >>>np.min(a) 1 This has caused problems to numerous people [1, 2]. In particular, when NaN are caused by some invalid floating point operation (0/0, sqrt(-1.)), this means NaN are silenced *and* give a not meaningful answer. How other softwares deal with NaN ? =================================== matlab: - max and min: NaN are explicitely ignored - sort: NaN are put at the end (not specified in the doc); the sort itself ignore NaN - argmax, argmin: do not exist, index are an optional output of max, min, and are such NaN are ignored in argmin/argmax as well. R: - max and min: return NaN by default, but ignoring NaN is possible by using an argument to the functions. - sort: NaN are ignored and removed. Can be put at the end/beginning as well. Behavior with matrices if they are removed ? - argmax/argmin: which.max/which.min in R. Missing values are ignored (note that this is not consistent with max/min behavior). AFAICS, there is no other choice possible. Alternatives to using NaN ========================= NaNs that appear during floating-point calculations can be avoided by using seterr(invalid='raise'), which has the effect of raising an exception whenever one occurs. However, they were put into the IEEE standard precisely because this is often not convenient behaviour. Using NaNs as invalid values is standard idiom in MATLAB and R, but numpy has an additional tool that neither of those packages has: masked arrays. Masked arrays allow users to explicitly flag invalid values, and all functions that act on masked arrays have sensible handling of invalid values. Masked arrays also allow users to mark invalid elements in arrays with types (for example integers) that do not have a NaN. Cost of handling NaN ==================== It is outside the scope of this document to deal with NaN in a general manner in numpy. Only the core function mentioned will be dealt with. Several people, in particular Travis O. [1], have some concern with respect to the speed impact of dealing with NaN for arrays without NaN values at all. One systematic cost at the C level is to detect whether the number is a NaN. The C99 standard define isnan as a macro; most runtimes define this macro, or a similar function (named _isnan on MS). Arguably, the function should be optimized, but it was found that at least on glibc, the C99 macro is twice slower than a trivial isnan defined by using the x != x test (glibc isnan avoids branching, but this looks more costly than the simple test, at least on tested CPU: a Pentium 4 and a Core 2 Duo). Once the tested value is in cache, the function itself can be quite fast: I (cdavid) measured a speed of 8 cycles/test on Core 2 duo on dataset (without NaN) which can be kept in cache. Since the value has to be taken from memory anyway, I think this cost of 8 cycles/test is the relevant cost. This means that even for arrays which are aligned, and C continuous, the cost should not be significant when no Nan is in the arrays. There can also be a substantial cost to calculating with NaNs: even operations like addition will, on some common CPUs, have to fall back to software floating point. This means, in particular, that calculating with arrays containing many NaNs can be quite slow. Solutions ========= Several solutions are possible. Each of them, as well as their consequences are discussed below. Raising an error ---------------- Since by definition, NaN are not ordered, we can consider that it is an invalid input, raise an error whenever a NaN is encountered as an input. Implementation ~~~~~~~~~~~~~~ NaN input would be checked for relevant dtypes in relevant PyArray functions. Another solution would be to deal with NaN comparison in the core C loops, and signal an error if a NaN is detected. API breakage ~~~~~~~~~~~~ This would break any python call to one of the function with a NaN inside. It may not be significant, since the use of the function with a NaN was undefined. Cost ~~~~ To be investigated. Returning NaN ------------- Since NaN are often used for invalid operations, returning NaN would be consistent with the "propagate NaN" approach. Although this approach is easy to implement for min, max and sort, an expected behavior for arg versions is not obvious. They could fail as in the "raise an error" approach. Implementation ~~~~~~~~~~~~~~ NaN input would be dealt with in the core ufunc loops: if nan is detect, a NaN is put in the output. This works for max, min and sort. For the arg versions, an error code would be returned in the core ufunc loops (in arraytaypes.inc.src), and the error dealt with in the array API methods. API breakage ~~~~~~~~~~~~ This would break any python call to one of the function with a NaN inside. It may not be significant, since the use of the function with a NaN was undefined. Cost ~~~~ Dealing with NaN in the core ufunc loops for max/min was found to be almost as fast as not dealing with them on recent CPU, as long as the isnan function itself is efficient. It should be the same for sort and argmin/argmax/argsort, although those cases were not tested. Supporting general invalid data operations ------------------------------------------ Implementation ~~~~~~~~~~~~~~ The masked array implementation of all the operations in question could be duplicated. This would involve both providing additional keyword arguments to control handling of invalid values (for example, should sorts leave NaNs in place, sort them to the end, or sort them to the beginning) and changing the defaults to one of the above schemes. API Breakage ~~~~~~~~~~~~ Redefining the default behaviour breaks the API as above. Adding keyword arguments should not produce additional API breakage. Cost ~~~~ In addition to the cost required by the two preceding approaches, this requires numpy developers to keep numpy's NaN handling consistent with its handling of masked arrays. Since masked arrays are a less stable part of numpy, maintaining this code duplication may require substantial effort. References ========== * [1] http://scipy.org/scipy/numpy/ticket/241 * [2] http://projects.scipy.org/pipermail/numpy-discussion/2008-September/037413.html }}}