[SciPy-dev] Binary i/o package
Erin Sheldon
erin.sheldon@gmail....
Sun Jun 3 15:42:04 CDT 2007
On 6/3/07, Anne Archibald <peridot.faceted@gmail.com> wrote:
> On 01/06/07, Erin Sheldon <erin.sheldon@gmail.com> wrote:
> > The overwhelming silence tells me that either no one here thinks this
> > is relevant or no one bothered reading the email. I feel like the
> > functionality I have written into this package is so basic it belongs
> > in scipy io if not in numpy itself. Please give me some feedback one
> > way or another.
> >
> > If it just seems irrelevant then I may just look into making it a
> > scikits package.
>
> I'm not trying to knock your work, but it's not clear to me that
> there's enough room between readarray/writearray/tofile/fromfile and
> pytables to accommodate another package. Maybe I don't see what your
> package does, but why wouldn't I just install pytables instead? What
> are its advantages and disadvantages compared to pytables?
Anne -
fromfile works on the whole file or nothing (or contiguous chunks of
rows). read_array can read certain fields and rows from ascii. It is
pure-python which means it is rather slow, but that OK because ascii
files are rarely large. PyTables or a database like postgres are at a
different level but are build on complex libraries and have complex
interfaces.
The need to random-access into a binary file with fixed-length records
is basic for most data storage and retrieval. For example most
standardized file formats are self-describing binary tables which
require no previous knowledge of the data other than the format (e.g.
FITS in astronomy). But in scripting languages one is usually limited
to a read all or nothing approach because all you have is the
equivalent of fromfile. I included a working example of such a
self-describing format in the simple_format sub-module of readfields.
Another example is a simple relational database which is a group of
tables, with each table in a flat file or spread across flat files
(again no variable length fields). For efficiency one needs to random
access the files at a low level.
This package fills the niche and is the backbone of such systems. And
it is a small chunk of code. You can extract what you want from the
file and store it directly into a numpy array in the most efficient
manner possible.
I can speak for myself that with the larger astronomical data sets
that have come online it has become useful to write big files in a
standardized format and treat them as a simple database. One does
not have to install and administer a database system like postgres or
pytables (cdf), and one does not have to learn a new system beyond
numpy. But one gets most of the performance benefits of low-level
random-access to the data.
Erin
More information about the Scipy-dev
mailing list