Changeset 450

Show
Ignore:
Timestamp:
07/22/07 01:44:44 (5 years ago)
Author:
cdavid
Message:

Split pendigits into training and testing datasets, and convert return value of load to the package conventions.

Location:
trunk/learn/scikits/learn/datasets/pendigits
Files:
14 added
1 removed
3 modified

Legend:

Unmodified
Added
Removed
  • trunk/learn/scikits/learn/datasets/pendigits/__init__.py

    r181 r450  
    11#! /usr/bin/env python 
    2 # Last Change: Thu Jun 21 06:00 PM 2007 J 
     2# Last Change: Sun Jul 22 03:00 PM 2007 J 
    33import data as _pendigit 
    44__doc__     = _pendigit.DESCRSHORT 
     
    66source      = _pendigit.SOURCE 
    77 
    8 load        = _pendigit.load 
     8import testing, training 
  • trunk/learn/scikits/learn/datasets/pendigits/data.py

    r181 r450  
    11#! /usr/bin/env python 
    22# -*- coding: utf-8 -*- 
    3 # Last Change: Fri Jun 22 05:00 PM 2007 J 
     3# Last Change: Sun Jul 22 03:00 PM 2007 J 
    44 
    55# The code and descriptive text is copyrighted and offered under the terms of 
     
    102102(tablet input box resolution).  
    103103 
    104 pendigits-orig contain original, unnormalized data. pendigits is the  
    105 normalized and resampled version where all inputs are of the same 
    106 length. Here because of speed or the digit, feature vectors may be of 
    107 different lengths, e.g., '1' is shorter than '8'. 
     104In order to train and test our classifiers, we need to represent  
     105digits as constant length feature vectors. A commonly used technique 
     106leading to good results is resampling the ( x_t, y_t) points.  
     107Temporal resampling (points regularly spaced in time) or spatial 
     108resampling (points regularly spaced in arc length) can be used here.  
     109Raw point data are already regularly spaced in time but the distance 
     110between them is variable. Previous research showed that spatial 
     111resampling to obtain a constant number of regularly spaced points  
     112on the trajectory yields much better performance, because it provides  
     113a better alignment between points. Our resampling algorithm uses  
     114simple linear interpolation between pairs of points. The resampled 
     115digits are represented as a sequence of T points ( x_t, y_t )_{t=1}^T, 
     116regularly spaced in arc length, as opposed to the input sequence,  
     117which is regularly spaced in time. 
     118 
     119So, the input vector size is 2*T, two times the number of points 
     120resampled. We considered spatial resampling to T=8,12,16 points in our 
     121experiments and found that T=8 gave the best trade-off between  
     122accuracy and complexity. 
    108123""" 
    109124 
     
    128143-------------------- 
    129144 
    130 Input size depends on writing speed and time and is not fixed +1 class 
    131 attribute 
     14516 (8 (x, y) coordinates) 
    132146 
    133147For Each Attribute: 
    134148------------------- 
    135149 
    136 The data is in the UNIPEN format. See 
    137 I. Guyon UNIPEN 1.0 Format Definition,  
    138 ftp://ftp.cis.upenn.edu/pub/UNIPEN-pub/definition/unipen.def 
    139 1994 
     150All input attributes are integers in the range 0..100. 
    140151 
    141152Missing Attribute Values 
    142153------------------------ 
    143154 
    144 Class Distribution 
    145 ------------------ 
    146  
    147           classes 
    148           0    1    2    3    4    5    6    7    8    9 
    149 tra     384  390  392  370  391  375  351  375  363  357 Tot 3748  
    150 cv      209  201  201  163  185  163  191  196  178  186 Tot 1873  
    151 wdep    187  188  187  186  204  182  178  207  178  176 Tot 1873  
    152 windep  363  364  364  336  364  335  336  364  336  336 Tot 3498  
     155None  
    153156""" 
    154  
    155 def load(): 
    156     """load the actual data and returns them. 
    157      
    158     :returns: 
    159         data: dictionary. 
    160             data['training'] and data['testing'] both are dictionaries which 
    161             contains the following keys: x, y and class. x and y are the 
    162             coordinates of the eight resampled points, and class is an integer 
    163             between 0 and 9, indicating the number label. 
    164  
    165     example 
    166     ------- 
    167  
    168     Let's say you want to plot the first sample of the training set with 
    169     matplotlib. You would do something like plot(data['training']['x'][0], 
    170     data['training']['y'][0], '-') 
    171     """ 
    172     import numpy 
    173     from pendigits import training, testing 
    174     assert len(training) == 7494 
    175     assert len(testing) == 3498 
    176  
    177     def raw_to_num(dt): 
    178         coordinates = numpy.empty((len(dt), 16), numpy.int) 
    179         digclass = numpy.empty(len(dt), dtype = numpy.int) 
    180         for i in range(len(coordinates)): 
    181             coordinates[i] = dt[i][:-1] 
    182             digclass[i] = dt[i][-1] 
    183         xcor = coordinates[:, ::2] 
    184         ycor = coordinates[:, 1::2] 
    185         return xcor, ycor, digclass 
    186  
    187     xcor, ycor, digclass = raw_to_num(training) 
    188     training = {'x' : xcor, 'y' : ycor, 'class' : digclass} 
    189     xcor, ycor, digclass = raw_to_num(testing) 
    190     testing = {'x' : xcor, 'y' : ycor, 'class' : digclass} 
    191     return {'testing' : testing, 'training' : training} 
  • trunk/learn/scikits/learn/datasets/pendigits/src/convert.py

    r211 r450  
    11#! /usr/bin/env python 
    2 # Last Change: Tue Jul 17 05:00 PM 2007 J 
     2# Last Change: Sun Jul 22 01:00 PM 2007 J 
    33 
    44# This script generates a python file from the txt data 
     
    1717 
    1818# Write the data in pendigits.py 
    19 a = open("../pendigits.py", "w") 
     19ftra = open("../pendigits_tra.py", "w") 
     20ftes = open("../pendigits_tes.py", "w") 
    2021 
    21 a.writelines(dumpvar(tra, 'training')) 
    22 a.writelines(dumpvar(tes, 'testing')) 
    23 a.close() 
     22ftra.writelines(dumpvar(tra, 'training')) 
     23ftra.close() 
     24ftes.writelines(dumpvar(tes, 'testing')) 
     25ftes.close()