A Heuristic for Free Parameter Optimization
with Support Vector Machines (IJCNN 2006)

Matt Boardman and Thomas Trappenberg, Faculty of Computer Science



Are you looking at Support Vector Machines (SVM) to perform a classification or regression problems, but are a bit intimidated by the literature on how they work? Perhaps you would like to compare them with your own well-tuned artificial neural network (ANN), decision tree algorithm or other classifier, but are not sure how to tune an SVM for practical tasks?

We hope this page may help ... some of our group's current research is geared towards creating heuristics for blind application of SVM to practical classification and regression problems, requiring little to no knowledge on the part of the user about how the actual SVM works. A self-tuning SVM! Some MATLAB source code is provided at the bottom of this page for your enjoyment.


NEWS:



High-resolution grid search from the Iris Plant Database (UCI Machine Learning Repository). The left grid considers only 10-fold cross-validation. The centre grid considers 10-fold cross-validation and extrinsic regularization based on the number of support vectors, giving equal weight to each (lambda=1). In both diagrams, the black square is the best point found; the black circles are the group of surrounding points within a radius of 1 that have within 2% of the score of this best point; and the white triangle is the suggested point, calculated from an intensity-weighted centre of mass of this group of points. The right plot shows the same surface, with extrinsic regularization, explored by the simulated annealing heuristic: each dot corresponds to a single evaluation. All three plots find different solutions with about the same cross-validation accuracy: which is best?




















Our Goal

Our goal is not to achieve 100% cross-validation accuracy: on many finite data sets this is not achievable, due to overlapping prior distributions or misclassified samples. Rather, we will maximize the cross-validation accuracy within the bounds that are possible for a given finite data set, but importantly, we will allows some tradeoffs in accuracy for the sake of a "safer," more general solution. We especially target those cases where we have a low number of training samples, but high input dimensionality, common with medical waveform analyses such as electroretinography.

Essentially, we have three suggestions:

We have found excellent results by combining these elements into a working heuristic for SVM parameter optimization. A list of publications appears at the end of this page to detail some of these results.

A Brief, Non-Technical Introduction

SVM have many advantages over ANN, such as a fast and efficient training algorithm, a sparse model representation, and (most importantly here!) fewer free parameters, or hyperparameters. One of the most important advantages is the greater insensitivity to training set errors or outliers: SVM have excellent regularization capabilities, which gives them the ability to find a general solution to a problem with a minimum of available training samples.


How do you identify a tree?


This regularization capability is important in practical situations, to avoid overfitting the model to a particular set of training data. For example, in [Burges, 1998] we consider a biologist teaching his students to identify trees: he shows them a maple tree, a birch tree and a hemlock tree as examples. An overfitting student may see an oak tree and conclude, "This is not a tree, as it is not one of the these three example trees." In an extreme case, the overfitting student may see another, different maple tree, but conclude, "This is not a tree, as it has the wrong number of leaves." An underfitting student, however, may overgeneralize the solution to simply, "Everything green is a tree," concluding that hedges, grass and small frogs are all trees because they are green.

Although SVM do generalize well, they are still dependent on appropriate selection of several free parameters, or hyperparameters. Inappropriate selection of these parameters can fool the SVM into giving solutions which overfit or underfit, sometimes severely, or have poor accuracy even on the training examples.


Four SVM solutions to a regression problem. The upper left case uses our heuristic, modified for noise-insensitive support vector regression.


For example, these four solutions to a simple regression problem were all given by an SVM: the only difference between these four solutions is the selection of parameters. The dark crosses (x) represent the 20 training samples. The solid line represents the value predicted by the model for each timepoint. The dashed lines show the noise-insensitivity of the model, determined by the free parameter epsilon. In the upper left, we see a smooth, clean solution which exhibits the periodic behaviour we would expect from this data [Rustici et al, 2004]. In the upper right, we see an example of underfitting that results in a loss of detail: in this solution, the epsilon noise-insensitivity parameter is too large. In the lower left, we see a possible example of mild overfitting resulting from setting the epsilon parameter to too small a value: but how do we quantitatively determine that this is not the ideal solution? In the lower right, the overfitting is clear: this model predicts only the training samples themselves, and everything else is given a value of zero. But this model achieves exactly zero mean squared error (MSE): shouldn't that mean that it is the best model of these four?

Extrinsic Regularization

Cross-validation is one solution to the problem of overfitting, and works well. This technique randomly partitions the data into several sections, trains on all sections but one, and tests the accuracy on the remaining section of data. This process is repeated so that each partition becomes the testing partition exactly once. However, cross-validation alone is not enough! Even with cross-validation, there may be a bias of the model towards the training data.

We therefore propose what we have termed extrinsic regularization, to clearly show the relationship with the SVM's intrinsic regularization which is part of the SVM training algorithm. That is, when comparing different sets of parameters to see which gives the best-performing model, don't consider only the cross-validated accuracy or cross-validated mean squared error: consider an external factor, such as the model's complexity. An easy measure of the SVM model's complexity is the number of support vectors it uses in the model: a complex, overfitting model will have many support vectors whereas a simple, general model will have comparatively few. This technique is commonly used to regularize neural networks by considering the norm of the weight matrix as a measure of complexity [Haykin, 1999]. Other extrinsic regularization measures could be the model's performance on a separate data set drawn from the same source, or an estimate of the capacity of the model such as the VC-dimension, but we have found that the number of support vectors is a good measure of complexity and gives reasonable results in a variety of situations. Another advantage is that the number of support vectors is very easy to return directly from the resulting model, without any additional computation.

So, for each set of parameters we wish to compare, a score is computed by adding the cross-validation error on the training data (from zero to one) to the model complexity (the ratio of the number of support vectors to the number of available training samples), in the simplest case giving equal weight to each. Whichever set of parameters has the lowest score wins. Simple, right?

Simulated Annealing


A typical grid search for parameters. Brighter areas yield better cross-validation performance.


Well, it may not be quite so simple: the problem is, how do you choose a set of parameters to compare in the first place, assuming you have no prior knowledge of the problem itself? A common method is to use a grid search, that is, to try a range of values for each parameter. So for example, if we use the RBF kernel for a classification task, we would have two free parameters to tune: the cost parameter (C) and the RBF width parameter (gamma). A grid search would compare the resulting performance when a range of values for each parameter is used. To quickly explore a wide range of values, a log base may be used for each parameter as in the figure to the right.

This grid search works well, but the precision of the search is limited to the chosen granularity of the grid. In addition, if a third parameter is added as in the case of noise-insensitive support vector regression (epsilon-SVR), this search space becomes a cube rather than a grid, requiring many more calculations: this is the dreaded curse of dimensionality [Bellman, 1961]. Other proposed methods include a gradient descent approach [Chapelle, 1999], an iterative grid search which increases resolution to narrow in on the area of interest [Staelin, 2003], and a geometric pattern search method [Momma and Bennett, 2002]. The noise induced by the stochastic (random) partitioning in cross-validation, however, may fool such methods. In addition, the surface is fraught with local extrema which may fool gradient descent methods or limit the efficiency of geometric searches.


A uniformly random approach (left) compared to simulated annealing (right), both with the same number of evaluations.


We therefore suggest a stochastic search method, such as simulated annealing which is based on the mechanical analogy of slowly cooling metals in nature [Kirkpatrick et al, 1983]. Simulated annealing is a well-known method for combinatorial optimization, and we have found that even a simple continuous variation of the original Manhatten algorithm yields excellent results. The surface we wish to traverse is stochastic: so it makes sense that our method for traversing it should also be stochastic. We have found that this allows the heuristic to be less sensitive to the noise induced by cross-validation, and it does not get stuck in local extrema as may happen with other methods. In the figure on the right, we compare the points in parameter space evaluated by the simulated annealing algorithm to a uniform random search: both searches contain exactly 660 evaluations, but the simulated annealing approach concentrates those evaluations in the area of interest to a much greater extent.

Intensity-Weighted Centre of Mass


An intensity-weighted centre-of-mass operation, added to a grid search, moves the optimum point away from a dropoff in accuracy.


One of the problems with gradient descent, geometric pattern searches or a stochastic approach such as we use here is that if the search is executed several times, even on the same training data, the final results may be somewhat different: the "optimum" points shift around due to the stochastic nature of cross-validation. A simple solution proposed in [Momma and Bennett, 2002] is to run the algorithm several times, and use the mean of the results: however, having to run the algorithm several times means your algorithm is several times less efficient.

Rather, we propose to keep the best points found by the algorithm (we keep all of them for display purposes, but you could keep only the best results in a moving window to save space). When the algorithm completes, the overall optimum point is selected as that with the highest score. Any surrounding points (within an arbitrary log radius) that have a performance score close to the best point (say, within 1 or 2 percent) are grouped together. Borrowing from the field of astronomy, we calculate the intensity-weighted centre of mass of these "best" points, where the intensity is simply the performance score: those points yielding higher performance will contribute more to the solution. We find that such an approach greatly reduces the volatility of the end solution point, and enhances the "safety" of the result by tending to move points away from areas of sharp dropoffs in accuracy (see the figure to the right).

Wait, enhancing the safety of the solution? What does that mean? Well, this optimization takes some time to perform. If you need a new model on a regular basis, say as part of a daily workflow, you may wish to spin your processors on more important tasks such as rebuilding database indices. So rather than optimizing the parameters each time, you can optimize your parameters once, or at least less regularly, and train the final SVM using the previously optimized set of parameters. Since we may therefore add or remove training samples, the error surface may have shifted somewhat: if your solution is near a sharp dropoff, this new finite training set may "fall" off the edge to a region of lower accuracy. Another example is where a subset of the training data is used to optimize parameters, say 100 training samples, but then the final SVM is trained using all training samples: this would save computations, but again the problem is that the new training samples might cause the classifier to fall off an edge. By using a centre of mass operation, these situations are largely prevented, which makes this heuristic more useful in practical situations.

Noise-Insensitive Regression

We have expanded this heuristic to three-dimensional parameter space, to simultaneously optimize the cost parameter (C), the RBF width parameter (gamma), and the noise-insensitivity tube width (epsilon) in noise-insensitive support vector regression (epsilon-SVR). Since the dimensionality of the search space has increased, it may be advantageous to allow slower cooling in the simulated annealing algorithm: this way more of the space is evaluated. We have used this approach in two regression analyses: a predictive uncertainty in environmental modelling competition hosted by Gavin Cawley, and a semi-multivariate approach to noise-reduction and missing data imputation for periodic gene expression from microarray data based on fission yeast cell-cycle data gathered by the Wellcome Trust Sanger Institute by [Rustici et al, 2004]. Results from these analyses are found in Matthew Boardman's thesis at the bottom of this page (Chapter 5 and 6 respectively).

A Few More Suggestions

SVM classifiers work best when they have balanced, i.i.d. (independently and identically distributed) training data sets. So, try to ensure that you have the same number of positive samples as you have negative samples, and scale your inputs such that each dimension in the training set has more or less the same distribution: say, with values ranging from +1 to -1 and a zero mean, or with zero mean and unit variance.

What if your data set is not quite balanced? We faced this issue with the retinal electrophysiology data set in the papers below: we had six positive samples, and eight negative samples. To create a balanced training set, we selected all six positive samples, and six of the eight negative samples. Rather than selecting them at random, we selected each possible combination of six samples in turn and ran the experiment 28 times (8 choose 6 = 28). The final accuracy is then the mean of all 28 experiments. If we desire a single set of parameters to be used for subsequent experiments, we simply chose the medians of the cost and gamma parameters that were selected by tuning each experiment independently, then re-ran the experiments with these new median free parameters.

What if you have many more input dimensions than you do training samples? Well, for any other classifier or regression analysis, this would be a problem. But SVM's handle this common situation beautifully: it is, in fact, part of their design. Since the training algorithm for a non-linear SVM works in a feature space defined by the kernel, which may even be infinite-dimensional in the case of the commonly-used RBF kernel, SVM are (nearly?) immune to excessive dimensionality. They will be no more or less efficient with an excessive number of input dimensions that include irrelevant or redundant features. Depending on the nature of your problem, feature construction methods such as PCA, FFT or wavelets may well be completely irrelevant! So, at least to start with, give your SVM all possible input dimensions and let it sort out what's important during training. Later, you may find that removing some dimensions (for example, using a model-dependent sensitivity assessment method, such as [Rueda et al, 2004]) may yield higher accuracy, but this is not necessarily true (in fact, we haven't yet found a situation where feature construction or feature selection have helped to any statistically valid extent, but it's much too early to say it's never true!).

For More Information

Currently, we have published a single paper based on this work, based on Matt Boardman's MCSc. thesis. More are in progress, and will appear here or here as appropriate.

If you find an SVM useful for your work (and we hope that you do!), here are some pointers to more information that may help you to learn more about them.

The articles and books referenced above are:

Acknowledgements

This work was supported in part by the NSERC grant RGPIN 249885-03. We would also like to thank Dr. François Tremblay and Dr. Denis Falvey, Dalhousie University, for the retinal electrophysiology data analyzed in this thesis; Dr. Christian Blouin, Dalhousie University, for access to the sequence alignment data analyzed in this thesis; the genomic researchers at the Wellcome Trust Sanger Institute, Hinxton, Cambridge, for making their gene expression data publicly available; Dr. Gavin Cawley, University of East Anglia, for suggesting and holding the Predictive Uncertainty in Environmental Modelling Competition for the 2006 IEEE World Congress on Computational Intelligence; the anonymous reviewers of the IEEE, whose feedback for the published work above provided many invaluable suggestions.

Source Code

The following MATLAB script files are available for download. They are provided as open source programs, with no expressed or implied warranties of any kind (see the GNU public license for more information). You are free to use or distribute the files as you like, so long as you include a reference to this page or to the IJCNN 2006 paper above as an acknowledgement of our contribution. For subsequent published works, please reference the following conference paper:

Matthew Boardman and Thomas Trappenberg (2006). A Heuristic for Free Parameter Optimization with Support Vector Machines. Proceedings of the 2006 IEEE International Joint Conference on Neural Networks (IJCNN 2006), Vancouver, BC, July 16-21, 2006, pp. 1337-1344.

A recent version of MATLAB® is required, we performed our tests mostly on MATLAB 7.1 (R14) SP3 for Windows® and MATLAB 7.0 (R14) SP1 for Linux. Compatibility with other versions of MATLAB may also be possible.

You will also need an installed and working copy of a MATLAB-compatible Support Vector Machine implementation, such as LIBSVM (the default in the scripts above) or SVMlight. Others should also work, but you'll need to modify the scripts for the correct calling protocol.

Download now (25k)

The following files are included in this archive:

Please note: a version of this heuristic used for noise-insensitive support vector regression is currently in progress. Please contact the authors above for more information. Preliminary results are available in Chapters 5 and 6 of Matthew Boardman's thesis, above.

Good luck and have fun!


Back to top