Changeset 450
- Timestamp:
- 07/22/07 01:44:44 (5 years ago)
- Location:
- trunk/learn/scikits/learn/datasets/pendigits
- Files:
-
- 14 added
- 1 removed
- 3 modified
-
__init__.py (modified) (2 diffs)
-
data.py (modified) (3 diffs)
-
pendigits.py (deleted)
-
src/convert.py (modified) (2 diffs)
-
testing (added)
-
testing/__init__.py (added)
-
testing/__init__.pyc (added)
-
testing/data.py (added)
-
testing/data.pyc (added)
-
testing/pendigits_tes.py (added)
-
testing/pendigits_tes.pyc (added)
-
training (added)
-
training/__init__.py (added)
-
training/__init__.pyc (added)
-
training/data.py (added)
-
training/data.pyc (added)
-
training/pendigits_tra.py (added)
-
training/pendigits_tra.pyc (added)
Legend:
- Unmodified
- Added
- Removed
-
trunk/learn/scikits/learn/datasets/pendigits/__init__.py
r181 r450 1 1 #! /usr/bin/env python 2 # Last Change: Thu Jun 21 06:00 PM 2007 J2 # Last Change: Sun Jul 22 03:00 PM 2007 J 3 3 import data as _pendigit 4 4 __doc__ = _pendigit.DESCRSHORT … … 6 6 source = _pendigit.SOURCE 7 7 8 load = _pendigit.load 8 import testing, training -
trunk/learn/scikits/learn/datasets/pendigits/data.py
r181 r450 1 1 #! /usr/bin/env python 2 2 # -*- coding: utf-8 -*- 3 # Last Change: Fri Jun 22 05:00 PM 2007 J3 # Last Change: Sun Jul 22 03:00 PM 2007 J 4 4 5 5 # The code and descriptive text is copyrighted and offered under the terms of … … 102 102 (tablet input box resolution). 103 103 104 pendigits-orig contain original, unnormalized data. pendigits is the 105 normalized and resampled version where all inputs are of the same 106 length. Here because of speed or the digit, feature vectors may be of 107 different lengths, e.g., '1' is shorter than '8'. 104 In order to train and test our classifiers, we need to represent 105 digits as constant length feature vectors. A commonly used technique 106 leading to good results is resampling the ( x_t, y_t) points. 107 Temporal resampling (points regularly spaced in time) or spatial 108 resampling (points regularly spaced in arc length) can be used here. 109 Raw point data are already regularly spaced in time but the distance 110 between them is variable. Previous research showed that spatial 111 resampling to obtain a constant number of regularly spaced points 112 on the trajectory yields much better performance, because it provides 113 a better alignment between points. Our resampling algorithm uses 114 simple linear interpolation between pairs of points. The resampled 115 digits are represented as a sequence of T points ( x_t, y_t )_{t=1}^T, 116 regularly spaced in arc length, as opposed to the input sequence, 117 which is regularly spaced in time. 118 119 So, the input vector size is 2*T, two times the number of points 120 resampled. We considered spatial resampling to T=8,12,16 points in our 121 experiments and found that T=8 gave the best trade-off between 122 accuracy and complexity. 108 123 """ 109 124 … … 128 143 -------------------- 129 144 130 Input size depends on writing speed and time and is not fixed +1 class 131 attribute 145 16 (8 (x, y) coordinates) 132 146 133 147 For Each Attribute: 134 148 ------------------- 135 149 136 The data is in the UNIPEN format. See 137 I. Guyon UNIPEN 1.0 Format Definition, 138 ftp://ftp.cis.upenn.edu/pub/UNIPEN-pub/definition/unipen.def 139 1994 150 All input attributes are integers in the range 0..100. 140 151 141 152 Missing Attribute Values 142 153 ------------------------ 143 154 144 Class Distribution 145 ------------------ 146 147 classes 148 0 1 2 3 4 5 6 7 8 9 149 tra 384 390 392 370 391 375 351 375 363 357 Tot 3748 150 cv 209 201 201 163 185 163 191 196 178 186 Tot 1873 151 wdep 187 188 187 186 204 182 178 207 178 176 Tot 1873 152 windep 363 364 364 336 364 335 336 364 336 336 Tot 3498 155 None 153 156 """ 154 155 def load():156 """load the actual data and returns them.157 158 :returns:159 data: dictionary.160 data['training'] and data['testing'] both are dictionaries which161 contains the following keys: x, y and class. x and y are the162 coordinates of the eight resampled points, and class is an integer163 between 0 and 9, indicating the number label.164 165 example166 -------167 168 Let's say you want to plot the first sample of the training set with169 matplotlib. You would do something like plot(data['training']['x'][0],170 data['training']['y'][0], '-')171 """172 import numpy173 from pendigits import training, testing174 assert len(training) == 7494175 assert len(testing) == 3498176 177 def raw_to_num(dt):178 coordinates = numpy.empty((len(dt), 16), numpy.int)179 digclass = numpy.empty(len(dt), dtype = numpy.int)180 for i in range(len(coordinates)):181 coordinates[i] = dt[i][:-1]182 digclass[i] = dt[i][-1]183 xcor = coordinates[:, ::2]184 ycor = coordinates[:, 1::2]185 return xcor, ycor, digclass186 187 xcor, ycor, digclass = raw_to_num(training)188 training = {'x' : xcor, 'y' : ycor, 'class' : digclass}189 xcor, ycor, digclass = raw_to_num(testing)190 testing = {'x' : xcor, 'y' : ycor, 'class' : digclass}191 return {'testing' : testing, 'training' : training} -
trunk/learn/scikits/learn/datasets/pendigits/src/convert.py
r211 r450 1 1 #! /usr/bin/env python 2 # Last Change: Tue Jul 17 05:00 PM 2007 J2 # Last Change: Sun Jul 22 01:00 PM 2007 J 3 3 4 4 # This script generates a python file from the txt data … … 17 17 18 18 # Write the data in pendigits.py 19 a = open("../pendigits.py", "w") 19 ftra = open("../pendigits_tra.py", "w") 20 ftes = open("../pendigits_tes.py", "w") 20 21 21 a.writelines(dumpvar(tra, 'training')) 22 a.writelines(dumpvar(tes, 'testing')) 23 a.close() 22 ftra.writelines(dumpvar(tra, 'training')) 23 ftra.close() 24 ftes.writelines(dumpvar(tes, 'testing')) 25 ftes.close()
