| 104 | | pendigits-orig contain original, unnormalized data. pendigits is the |
| 105 | | normalized and resampled version where all inputs are of the same |
| 106 | | length. Here because of speed or the digit, feature vectors may be of |
| 107 | | different lengths, e.g., '1' is shorter than '8'. |
| | 104 | In order to train and test our classifiers, we need to represent |
| | 105 | digits as constant length feature vectors. A commonly used technique |
| | 106 | leading to good results is resampling the ( x_t, y_t) points. |
| | 107 | Temporal resampling (points regularly spaced in time) or spatial |
| | 108 | resampling (points regularly spaced in arc length) can be used here. |
| | 109 | Raw point data are already regularly spaced in time but the distance |
| | 110 | between them is variable. Previous research showed that spatial |
| | 111 | resampling to obtain a constant number of regularly spaced points |
| | 112 | on the trajectory yields much better performance, because it provides |
| | 113 | a better alignment between points. Our resampling algorithm uses |
| | 114 | simple linear interpolation between pairs of points. The resampled |
| | 115 | digits are represented as a sequence of T points ( x_t, y_t )_{t=1}^T, |
| | 116 | regularly spaced in arc length, as opposed to the input sequence, |
| | 117 | which is regularly spaced in time. |
| | 118 | |
| | 119 | So, the input vector size is 2*T, two times the number of points |
| | 120 | resampled. We considered spatial resampling to T=8,12,16 points in our |
| | 121 | experiments and found that T=8 gave the best trade-off between |
| | 122 | accuracy and complexity. |
| 154 | | |
| 155 | | def load(): |
| 156 | | """load the actual data and returns them. |
| 157 | | |
| 158 | | :returns: |
| 159 | | data: dictionary. |
| 160 | | data['training'] and data['testing'] both are dictionaries which |
| 161 | | contains the following keys: x, y and class. x and y are the |
| 162 | | coordinates of the eight resampled points, and class is an integer |
| 163 | | between 0 and 9, indicating the number label. |
| 164 | | |
| 165 | | example |
| 166 | | ------- |
| 167 | | |
| 168 | | Let's say you want to plot the first sample of the training set with |
| 169 | | matplotlib. You would do something like plot(data['training']['x'][0], |
| 170 | | data['training']['y'][0], '-') |
| 171 | | """ |
| 172 | | import numpy |
| 173 | | from pendigits import training, testing |
| 174 | | assert len(training) == 7494 |
| 175 | | assert len(testing) == 3498 |
| 176 | | |
| 177 | | def raw_to_num(dt): |
| 178 | | coordinates = numpy.empty((len(dt), 16), numpy.int) |
| 179 | | digclass = numpy.empty(len(dt), dtype = numpy.int) |
| 180 | | for i in range(len(coordinates)): |
| 181 | | coordinates[i] = dt[i][:-1] |
| 182 | | digclass[i] = dt[i][-1] |
| 183 | | xcor = coordinates[:, ::2] |
| 184 | | ycor = coordinates[:, 1::2] |
| 185 | | return xcor, ycor, digclass |
| 186 | | |
| 187 | | xcor, ycor, digclass = raw_to_num(training) |
| 188 | | training = {'x' : xcor, 'y' : ycor, 'class' : digclass} |
| 189 | | xcor, ycor, digclass = raw_to_num(testing) |
| 190 | | testing = {'x' : xcor, 'y' : ycor, 'class' : digclass} |
| 191 | | return {'testing' : testing, 'training' : training} |