[SciPy-dev] Sparse data structures.
Viral Shah
vshah@interactivesupercomputing....
Mon Apr 7 14:15:19 CDT 2008
Sparse folks,
I had a couple of other general questions/comments about Python's
sparse matrix infrastructure. Coming from the Matlab land, I am used
to the system providing me just one data storage for sparse matrices.
Python provides 5. The rationale for Matlab's sparse matrix design is
described in great detail in:
http://www.hpl.hp.com/personal/Robert_Schreiber/papers/Sparse%20Matrices%20in%20Matlab/simax.pdf
For linear algebra operations such as direct solvers and iterative
methods, it doesn't really matter what format you start out in. The
cost of conversion is small compared to the cost of the operation.
Most of these are provided in libraries, and hence not a huge source
of worry. The real questions are array operations that are performed
in code that a Python programmer may write. These include operations
such as assembly, indexing, assignment, and concatenation. This is
where the choice of data structure becomes important.
Offering 5 choices to the naive programmer who doesn't understand the
details about sparse matrix implementations may make Scipy look
daunting. I am wondering whats the best way to address this ? I
summarize my understanding of the different formats below. I'd love to
have a conversation with someone on how to make this useful in a
coherent way (one default data structure, for example), or at the very
least, update the documentation with some information to guide someone.
For row/col indexing, CSR and CSC formats can be decent - especially
if you are extracting columns from CSC and rows from CSR. Random
access is decent, but not the greatest. These formats are also good
for matrix-vector multiplication, the workhorse of many iterative
solvers - but one could use a library like OSKI for that. CSR/CSC are
terrible for assembly and assignment. They can be decent for
concatenation along the preferred dimension.
A dictionary representation is great for random access, assembly, and
concatenation, but not very efficient for submatrix indexing and
assignment. Co-ordinate representation can be good for operations,
given that the right rule for combining duplicates is used. The good
thing about both these representations is that they can be extended to
N-d sparse arrays.
The list of lists approach lies somewhere in the middle. I am assuming
these lists are kept sorted. The main drawback is the excessive memory
usage to store all the pointers. I haven't yet looked at the code, and
I could be wrong.
Thanks,
-viral
More information about the Scipy-dev
mailing list