Kernel Regression and Classification in a Nutshell
In this note we will demonstrate the application of kernels for linear regression. We start exploiting the solution within a standard adjustment model in primary. We modify this approach transforming it to the dual form. Finally we introduce kernel notation and illustrate the advantages of this method for higher order regression.
Contents
Introduction to the regression model
Given are N training points in D dimensional space. Here the points are randomly generated and uniformly distributed within a certain region. We use homogeneous representation for all coordinates.
N = 100; D = 2; X = [1 + 9 * rand(D, N); ones(1, N)];
For each point there is a target value stored within the target vector y of length N. In our example the targets are a linear function of the point locations plus some random error e:
y = [0.2; -0.3; 0]' * X + 0.1 * randn(1, N);
We can plot this synthetic data in three dimensions by assign each point X_n a height y_n:
plot3(X(1, :), X(2, :), y, 'r.'); hold on;

We want to find a model that approximates the surface of the given training points and predicts the height of some test points T.
[u, v] = meshgrid(1 : 10); T = [u(:)'; v(:)'; ones(1, 100)];
Least squares solution in primary form
In order to determine the unknown weights w for the linear regression function we have to find the derivatives of the targets y with respect to the weights w. In the simple case of a linear function these are the point coordinates X themselves:
We can compute the least squares solution by inverting the normal matrix of size DxD. (In adjustment computations we would call the transposed point coordinates X' the design matrix of our functional model and the transposed target vector y' our observations.)
w = (X * X')^-1 * X * y' %#ok<NOPTS>
w = 0.2027 -0.3061 0.0115
We apply these weights to our test points T and plot the results.
y_t = w' * T; surf(reshape(y_t, 10, 10));

This problem is regular as long as there are more training points than unknown parameters. Later we will deal with problems with much more parameters than training points. Therefore we add a regularization term that regularizes the possibly singular matrix X X' and obtain nearly identical results.
w = (0.01 * eye(D + 1) + X * X')^-1 * X * y' %#ok<NASGU,NOPTS>
w = 0.2027 -0.3060 0.0115
Least squares solution in dual form
By applying the following matrix identity we can rewrite the previous solution:
This rewritten solution yields identical results although inverting a matrix not of size DxD but NxN.
w = X * (0.01 * eye(N) + X' * X)^-1 * y' %#ok<NOPTS>
w = 0.2027 -0.3060 0.0115
Kernel regression
Now we are no longer interested in the weights w but only the heights of the test points T.
In the field of kernel methods the Gramian matrix X' X is denoted with K and only consists of dot products between training points (or between transformed training points). The same holds for the matrix X' T denoted with k.
K = X' * X; k = X' * T; y_t = y * (0.01 * eye(N) + K)^-1 * k; surf(reshape(y_t, 10, 10));

Higher order regression
Assume N training points X representing a quadratic surface:
y = diag(X' * diag([0.2, 0.1, -1]) * X)' + 0.1 * randn(1, N); clf; plot3(X(1, :), X(2, :), y, 'r.'); hold on;

This model is quadratic in the coordinates X but linear in the elements of Q. We can rewrite it using the Kronecker product of the coordinates X and T respectively:
[i, j] = ndgrid(1 : 3); XX = X(i(:), :) .* X(j(:), :); TT = T(i(:), :) .* T(j(:), :);
Now we can apply kernel regression to the transformed coordinates.
K = XX' * XX; k = XX' * TT; y_t = y * (0.01 * eye(N) + K)^-1 * k; surf(reshape(y_t, 10, 10));

Note that we did not explicitely computed the weights vec(Q) of the quadratic surface. Furthermore the kernel matrix K did not grow in any dimension.
Imagine a problem of order L in D dimensional space. Then there are O(D^L) coefficients to be estimated. The normal matrix for the standard least squares estimation will be huge. The kernel matrix, however, will be of constant size NxN.
A very useful trick s to skip the computation of the transformed coordinates and to compute the kernel matrix directly from the scalar products, e. g. for the quadratic kernel:
K = (X' * X).^2; k = (X' * T).^2; y_t = y * (0.01 * eye(N) + K)^-1 * k; surf(reshape(y_t, 10, 10));

Linear kernels for classification
Assume N training points X annotated with labels y (-1 or 1).
N = 100; theta = pi / 2 * rand(1, N); rho = [4 * ones(1, N / 2), 10 * ones(1, N / 2)] + 0.5^2 * randn(1, N); X = [rho .* cos(theta); 10 - rho .* sin(theta); ones(1, N)]; y = [-ones(1, N / 2), ones(1, N / 2)]; clf; plot3(X(1, :), X(2, :), y, 'r.'); hold on;

Using these training data we now want to classify some test points T. This classification task can be written in terms of a kernel regression. First we restrict to a linear kernel.
K = X' * X; k = X' * T; y_t = y * (0.01 * eye(N) + K)^-1 * k;
The intersection line of the regression surface with the horizontal coordinate plane y_t = 0 defines the decision boundary.
surf(reshape(y_t, 10, 10), 'FaceAlpha', 0.5); contour(reshape(y_t, 10, 10), [0 0], 'LineWidth', 3, 'Color', 'r');

Higher order kernels for classification
The classification with kernels can easily be extended to quadratic or higher order decision boundaries. Therefore, we need to
K = (X' * X).^2; k = (X' * T).^2; clf; plot3(X(1, :), X(2, :), y, 'r.'); hold on; y_t = y * (0.01 * eye(N) + K)^-1 * k; surf(reshape(y_t, 10, 10), 'FaceAlpha', 0.5); contour(reshape(y_t, 10, 10), [0 0], 'LineWidth', 3, 'Color', 'r');
