The Numpy and Scipy GSoC projects are run under the umbrella of the Python Software Foundation (PSF). The PSF info for GSoC 2013 can be found at http://wiki.python.org/moin/SummerOfCode/2013. In particular, look at what is expected of students: http://wiki.python.org/moin/SummerOfCode/Expectations. Note that it's important to start discussing your project idea with the community and your potential mentor early, don't wait till the last week! Also don't wait till close to the application deadline with submitting your first pull request to Numpy/Scipy on Github; getting used to the workflow and reworking your patch for review comments may take more time than you expect. Having your first pull request merged before the !GSoC application deadline (May 3) is required for your application to be accepted.

If you're new to NumPy and SciPy, the way to contribute and what is expected of code contributions is described in https://github.com/scipy/scipy/blob/master/HACKING.rst.txt

Summer of Code 2013 Ideas for Numpy & Scipy

Full support for building Scipy with Bento

Scipy can currently be built in two ways: with numpy.distutils and with Bento. Building Scipy with Bento is the way of the future: it's much faster and has better support for Scipy's complex build requirement. However Bento support is relatively new and not yet complete. Possible goals of this project are:

  1. Robust builds on at least Windows, Linux and OS X.
  2. Support straightforward building against multiple BLAS/LAPACK implementations (Netlib BLAS/LAPACK, ATLAS, Intel MKL, OpenBLAS).
  3. Unify templating tools in build scripts. This could mean adding a small templating library to Bento itself.
  4. Setting up a continuous integration server which tests Bento builds for Python 2.x & 3.x, on various platforms.
  5. Improved reporting for user build configurations. This would help a lot in diagnosing build-related problems that users encounter.

Performance parity between numpy arrays and Python scalars

Small numpy arrays are very similar to Python scalars -- but numpy incurs a fair amount of extra overhead for simple operations. For large arrays this doesn't matter, but for code that manipulates a lot of small pieces of data, it can be a serious bottleneck. For example:

In [1]: x = 1.0

In [2]: numpy_x = np.asarray(x)

In [3]: timeit x + x
10000000 loops, best of 3: 61 ns per loop

In [4]: timeit numpy_x + numpy_x
1000000 loops, best of 3: 1.66 us per loop

This project would involve profiling simple operations like the above, determining where the bottlenecks are, and devising improved algorithms to solve them, with the goal of getting the numpy time as close as possible to the Python time. Not only would this make all numpy-using code faster, but it would pave the way for future simplifications in numpy's core, which currently has a lot of duplicate code that attempts to work around these slow paths instead of fixing them properly.

Some possible concrete changes:

  1. numpy's "ufunc loop lookup code" (which is used to determine, e.g., whether to use the integer or floating-point versions of "+") is slow and inefficient.
  2. Checking for floating point errors is very slow; we can and should do it less often.
  3. When allocating the return value, the "+" for Python floats calls malloc() only once; numpy calls it twice (once for the array object itself, and a second time for the array data). Stashing both objects within a single allocation would be more efficient.
  4. ...see what profiling says! We know 61 ns is possible.

Pythonic dtypes

A numpy "dtype" is an object that knows how to work with different sorts of values, represented as fixed-length packed binary values. For example, the int32 dtype knows how to convert the Python object '-1' to the four-byte buffer 0xff 0xff 0xff 0xff.

Conceptually, dtype objects are arranged into a nice type hierarchy: http://docs.scipy.org/doc/numpy/_images/dtype-hierarchy.png

But implementation-wise, dtypes don't use the Python class system at all. There's just a single Python class (numpy.dtype), and all dtypes are instances of it. (This is because when numpy was first designed, they only expected there to be maybe 20 dtype objects total.) This turns out to cause a number of problems -- you can't define new dtypes from Python, only from C; you can't use isinstance to compare dtypes (you have to use a hacky numpy-specific API instead); different dtypes can't easily contain state (instead, the single dtype class has gradually sprouted new fields as new dtypes turned out to need them); etc. Basically we've been reinventing the Python class system, poorly.

The goal for this project is to turn dtype classes into regular Python classes with a proper type hierarchy and using the standard Python mechanisms.

Longer term goals (at least the first of which is probably achievable within the !GSoC timeline):

  1. Allow for defining new dtypes using pure Python.
  2. There are a bunch of special cases in the ufunc code for handling strings and record arrays; we should make the appropriate extensions to the dtype API so that they can become regular dtypes.
  3. A proper categorical data dtype. (This is trivial once the above is done.)
  4. NA dtypes

Consistent empty array handling in Numpy and Scipy

Empty arrays are obtained in various ways: explicit construction by a user, indexing with a boolean array only containing False values, loading data from a file that's empty or misses data, etc. If it's possible to pass these empty arrays to numpy/scipy functions and have them propagate consistently, then the user doesn't have to check for empty arrays all over his own code. Currently functions don't handle empty arrays consistently, some return empty arrays, some raise warnings or errors, some just crash. Possible goals of this project are:

  1. Formulate a consistent description on how empty array input to functions and methods should work.
  2. Write a set of functions that can be used throughout numpy/scipy for empty array handling. Should be callable from C and Python.
  3. Provide a way to test (sets of) functions for empty array handling without adding separate test cases for each function (so a test generator or test class that can be used as a mixin for example).
  4. Use (2) and (3) to improve the way numpy/scipy functions handle empty arrays now.

Improvements to the sparse package of Scipy: support for bool dtype and better interaction with Numpy

Scipy ships with a package dedicated to the creation of sparse matrices in a variety of formats and their manipulation (including linear algebra). Sparse matrices are ubiquitous in many fields of science and engineering, most notably for solving partial differential equations (e.g. using a finite element method). This sparse package is also used extensively by other projects such as scikit-learn (machine learning in Python). Yet many things remain to be done in this package. The purpose of this project is two fold:

1. Better support for bool dtype. The goals of this action would be to:

  • Formulate a specification of Boolean data type handling in spmatrix objects and operations with other matrix/ndarray objects from Numpy
  • Produce a test suite according to the specification and implement.
  • Work on the following tickets:
    • #1533 - toarray() method does not work if dtype==bool;
    • #639 - support for logical operations for any combination of spmatrix, matrix and ndarray objects (see also #991);
    • any other issues reported on the mailing lists.

2. Improve interaction between spmatrix objects and other types, Numpy's types in particular. Operations between a spmatrix and other kinds of objects are often not very consistent and sometimes produce unexpected results. See in particular #1598, which reports bad behavior of sparse matrices in a binary ufunc.

Ideas from previous years that may still be relevant

  • Adding Automatic Differentiation functionality to SciPy. See [SummerofCodeIdeas/AlgorithmicDifferentiation] and #1510 for more details.
  • Improve DataSource and integrate it into all the numpy/scipy I/O.
  • scipy.ndimage: Rewrite in Python where possible, port to Cython elsewhere. Decide on a consistent coordinate framework. As a bonus, fix boundary issues.
    • Leverage patches from CellProfiler developers for this, see e.g. patches with ticket #945