$\renewcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}$ $\renewcommand{\m}[1]{\mathbf{#1}}$ $\renewcommand{\ve}[1]{\mathbf{#1}}$

Fitting locally

Recall the Facebook data on like and Impressions where we fit a cubic to capture the average of the $y$s as a function of $x$. \[y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + r. \]

Here $y$ was log(like + 1) and $x$ was log(Impressions)

There are a few things worth noting about this model

it is a polynomial
it is defined for any $x \in \Re$, that is for any $x$ within the range of the $x_i$s and any $x$ outside of the model
the corresponding generative model has \[Y = \mu(x) +R\] with $E(R)= 0$. That is the mean or expected value of $Y$ is a function of $x$
that function is \[ \mu(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3. \]

We might reasonably ask which, if any, of these implicit assumptions generalize?

It is not clear, for example, whether a cubic is appropiate here? Perhaps a lo degree polynomial would be better? Or lower degree? Or perhaps some non-polynomial curve might make more sense? We might want to be guided by whatever prior information we might have about the functional form of the model. Barring the availability of such prior information, we might prefer to have a less prescriptive model for the functional form of $\mu(x)$.

In most cases, we would rather like to “let the data speak for themselves” in suggesting what shape the dependence of $y$ on $x$ might take. We would rather not have to specify that shape in advance.

A less prescriptive approach might be to take the generative model at face value, namely that we are modelling the mean of $Y$ at any given point $x$. Having data values $(x_1, y_1), \ldots, (x_N, y_N)$ it might make sense to simply estimated $\mu(x)$ by the arithmetic average of the $y_i$ values for those points whose corresponding $x_i$ is either equal to the $x$ of interest or which is nearly so.

For example, a plot pf log(impressions) versusPost.Month` might simply “connect the dots” of the monthly averages:

plot(fb$Post.Month, fb$x,
     main = "Facebook",
     xlab = "Post.Month)",
     ylab = "log(Impressions)",
     pch=19,
     col=adjustcolor("firebrick", 0.7)
     )
averages <- numeric(length=12)
medians <- numeric(length=12)
for (i in 1:12) {
  averages[i] <- mean(fb$x[fb$Post.Month==i])
  medians[i] <- median(fb$x[fb$Post.Month==i])
}
abline(lm(x ~ Post.Month, data=fb), col="blue", lwd=5,lty=3)
lines(1:12, averages,col="red", lwd=3)
lines(1:12, medians,col="darkgrey", lwd=3)
legend("topright", 
       legend=c("LS line", "Averages", "Medians"), 
       col= c("blue","red","darkgrey"), 
       lty=c(3,1,1), lwd=3)

The averages might better reflect the month to month differences than does the least-squares fitted line, though the latter is simpler. The medians on the other hand produce a smoother curve that is not so influenced by outlying $y$ values.

With the monthly posts, we have many observations available for each month, that is many $y$ values for every unique $x$. More generally, we might choose to have \[\widehat{\mu}(x) = \frac{1}{k_x} \sum_{x_i~ \in ~ Nbhd(x)} y_i\] where $Nbhd(x)$ denotes a neighbourhood of $x$ and $k_x$ is the number of points in that neighbourhood. It might, for example, be a fixed number of neighbours neighbourhood or a distance based neighbourhood. Either way, $\mu(x)$ is being determined locally, in this case by the average of all of the $y_i$s in the local neighbourhood.

For example, consider the following artifically generated set of point pairs:

Piecewise fitting

We could, for example, cut the range of the $x$ values up into several fixed neighbourhoods and use the average $y$ in each neighbourhood as $\widehat{\mu}(x)$ for any value of $x$ in the neighbourhood. This might be accomplished as follows:

# varying numbers, constant width
breaks_v <- seq(min(x), max(x), length.out=11)
nbhd_v <- cut(x,
              breaks= breaks_v,
              include.lowest=TRUE
              )

# varying size, constant proportions
breaks_p <- c(quantile(x, seq(0,1,0.1) ))
nbhd_p <- cut(x,
              breaks=breaks_p,
              include.lowest=TRUE
              )

local_v <- levels(nbhd_v)
local_p <- levels(nbhd_p)

# Compute the local averages
get_ave <- function(locals, nbhds) {
  mu <- vector(mode="numeric", length=length(x))
  for (i in 1:length(locals)) {
    nbhd_i <- nbhds == locals[i]
    mu[nbhd_i] <- mean(y[nbhd_i])
    }
  mu
}

mu_v <- get_ave(local_v, nbhd_v)
mu_p <- get_ave(local_p, nbhd_p)


# A quick and dirty way to draw the mu by neighbourhood

plot_ave <- function(locals, 
                     nbhds, 
                     x, 
                     mu, 
                     ...) 
  {for (i in 1:length(locals)) {
    nbhd_i <- nbhds == locals[i]
    newx <- x[nbhd_i]
    newmu <- mu[nbhd_i]
    Xorder <- order(newx)
    if(length(newx)==1){
       lines(rep(newx[Xorder],2), 
             rep(newmu[Xorder],2),
             ...)
    } else {
       lines(newx[Xorder], newmu[Xorder], ...)
    }
   
    }
}

# plot
plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "Constant width nbhd")
plot_ave(local_v, nbhd_v, x, mu_v,
         col="blue", lwd=5)

plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "Constant proportion nbhd")
plot_ave(local_p, nbhd_p, x, mu_p,
         col="red", lwd=5)

# Note that the first neighbourhoods had constant widths, 
# and hence varying numbers of points
local_v

##  [1] "[-0.306,-0.154]"   "(-0.154,-0.00201]" "(-0.00201,0.15]"  
##  [4] "(0.15,0.302]"      "(0.302,0.454]"     "(0.454,0.606]"    
##  [7] "(0.606,0.758]"     "(0.758,0.91]"      "(0.91,1.06]"      
## [10] "(1.06,1.21]"

# The second neighbourhoods had approximately constant number
# of points, this being about 10% of the x values in each.
# Hence they had varying widths.
# 
local_p

##  [1] "[-0.306,0.0714]" "(0.0714,0.19]"   "(0.19,0.291]"   
##  [4] "(0.291,0.38]"    "(0.38,0.45]"     "(0.45,0.539]"   
##  [7] "(0.539,0.636]"   "(0.636,0.773]"   "(0.773,0.869]"  
## [10] "(0.869,1.21]"

And now both in the one plot.

Each plot has a separate mean value cover each entire neighbourhood. Note however - they visibly have different neighbourhood sizes depending on the density of the points (in $x$) - they are in some agreement, particularly about the coarse shape of $\mu(x)$ - they disagree in places, some regions have quite different values - they suggest that $\mu(x)$ is - locally flat (zero slope) - discontinuous

We might for example look at larger, or smaller neighbourhoods. First, let’s make the neighbourhoods larger.

And now smaller neighbourhoods:

(Note that the intervals were not drawn here to cover the entirel neighbourhood but only the points in each neighbourhood.)

Not surprisingly, the smaller neighbourhoods capture finer structure and the larger neighbourhoods coarser structure.

But we still have these “flats”" everywhere. This, together with the obvious discontinuities for $\mu(x)$, also suggests something about the derivative $\mu^\prime(x)$, namely that $\mu^\prime(x) = 0$ almost everywhere except the discontinuites where it is not defined (or essentially infinite).

We could replace the flats by local lines. This allows $\mu(x)$ to change somewhat more smoothly and lets the derivative take non-zero values within the neighbourhoods.

We need only change the way we calculate the estimates of $\mu(x)$.

#
# Now we need to adapt the local average function
# to fit local lines.
# 
get_ave <- function(locals, nbhds) {
  mu <- vector(mode="numeric", length=length(x))
  for (i in 1:length(locals)) {
    nbhd_i <- nbhds == locals[i]
    local_x <- x[nbhd_i]
    local_y <- y[nbhd_i]
    local_data <- data.frame(x=local_x, y=local_y)
    local_fit <- lm(y~x, data=local_data)
    mu[nbhd_i] <- predict(local_fit)
    }
  mu
}

mu_v <- get_ave(local_v, nbhd_v)
mu_p <- get_ave(local_p, nbhd_p)

And the plot.

We could use a polynomial of any degree on the neighbourhoods, say a cubic.

# Cubic
degree <- 3

#
# Now we need to adapt the local average function
# to fit local polynomials
# 
get_ave <- function(locals, nbhds, degree) {
  mu <- vector(mode="numeric", length=length(x))
  for (i in 1:length(locals)) {
    nbhd_i <- nbhds == locals[i]
    local_x <- x[nbhd_i]
    local_y <- y[nbhd_i]
    local_data <- data.frame(x=local_x, y=local_y)
    # change the model here
    local_fit <- lm(y~poly(x,degree=degree), data=local_data)
    mu[nbhd_i] <- predict(local_fit)
    }
  mu
}

To fit a polynomial of degree $p$, there needs to be at least $p+1$ distinct points. This means that each neighbourhood has to have at least this many points. This didn’t happen for our constant-width neighbourhoods.

# Get the neighbourhoods as before
# varying numbers, constant width
breaks_v <- seq(min(x), max(x), length.out=11)
nbhd_v <- cut(x,
              breaks= breaks_v,
              include.lowest=TRUE
              )
local_v <- levels(nbhd_v)
#
# Let's see if we have enough points in each
count <- vector("numeric", length = length(breaks_v) -1)
for (i in 1:length(count)) {
  count[i] <- sum(nbhd_v==local_v[i])
}
# Looking at the counts, we will be in trouble.
count

##  [1]  5  6 40 41 60 42 41 45 18  2

So we will only use the k nearest neighbourhoods, each having 10% of the data:

##  [1] 30 30 30 30 30 30 30 30 30 30

Clearly, this follows the data more closely and more smoothly in each neighbourhood. Unfortunately at the neighbourhood boundaries discontinuities can occur.

How do we rid ourselves of the discontinuities?

Splines

First, we note that within each neighbourhood, we fit a linear model, say a polynomial. For example, in the $j$th neighbourhood a $p$ degree polynomial would look like \[\mu^{(j)}(x) = \beta_0^{(j)} + \beta_1^{(j)} x + \ldots + \beta_p^{(j)} x^p ~~~~ \forall ~~ x ~ \in ~ Nbhd_j.\]

Since no $x$ is in more than one neighbourhood, we could fit all parameters at once in a single model involving all of the data. If there were $J$ neighbourhoods, this means a linear model having a total of $J(p+1)$ parameters. The entire model could be written as \[\mu(x) = \sum_{j=1}^J I_{Nbhd_j}(x) \mu^{(j)}(x)\] where $I_A(x)$ is the indicator function evaluating to 1 whenever $x \in A$ and to zero otherwise.

To rid ourselves of discontinuities, we need only restrict the parameters so that the discontinuities disappear. For example, suppose the discontinuities occur at $x=k_j$ for $j=1, \ldots, K$ with $K=J-1$. (The boundary points $k_j$ are sometimes called knots.) Then we would restrict the parameters by forcing the curves to meet at every boundary. That is \[\beta_0^{(j)} + \beta_1^{(j)} k_j + \ldots + \beta_p^{(j)} k_j^p = \beta_0^{(j+1)} + \beta_1^{(j+1)} k_{j} + \ldots + \beta_p^{(j+1)} k_{j}^p \] for $j=1,\ldots, K$. This introduces $K=J-1$ restrictions so that there are really only $J(p+1) -(J-1) = pJ+ 1$ parameters.

The curve $\mu(x)$ would be continuous but could now have “kinks” at the joins. We could make these go away by forcing the first derivatives to match at the boundaries as well. This would introduce another set of $K=J-1$ restrictions giving only $pJ+1 - (J-1) = (p-1)J +2$ free parameters. We could then match the second derivative, giving a still smoother function that is constrained by further $K$ restrictions. If we match the function an also match on $d$ derivatives, the total number of parameters which remain are \[\begin{array}{rcl} J(p+1) - (d+1)K &= &pJ + J - (d+1)J + (d+1) \\ &&\\ &=& (p-d)J + d + 1 \\ &&\\ &=& (p-d)K + p+1. \end{array} \] If we choose $d=p-1$ this becomes simply $K + p + 1$. A piecewise polynomial of degree $p$ that matches on the function and on $p-1$ of its derivatives is called a $p$-degree spline or a $p$th order spline. (The name comes from the flexible tool called the spline that was used in drafting to draw smooth curves by hand.)

A popular choice is the cubic spline, having $p=3$. These provide enough flexibility for most purposes and, at least according to statistical folklore, are smooth enough that the human visual system cannot detect the locations of the knots!

The linear model that would result, say for a cubic spline having $p=3$, would be written as a linear combination like \[\mu(x) = \beta_0 + \beta_1 b_1(x) + \ldots + \beta_{K + 3} b_{K + 3}(x)\] for some particular choices of $b_j(x)$. These functions can be thought of as a set of basis functions in the same way that the $N$-dimensional vectors \[{\bf b}_j = \left(b_j(x_1), \ldots, b_j(x_N) \right)^T\] that result from evaluating the functions at the points $x_1, \ldots, x_N$ form a set of basis vectors (together with the ${\bf 1}$ vector multiplying $\beta_0$) for the vector space spanned by the column vectors of an $N \times (K+4)$ regression matrix ${\bf X} = \left[{\bf 1},{\bf b}_1, \ldots, {\bf b}_{K+3} \right]$.

As a basis the set of functions generate a space (having $K+4$ dimensions) and, conversely, for that space there are any number of bases which would generate it. For example, let \[h(x,k_j) = \left(x-k_j\right)^3_+ = \left\{\begin{array}{lcl} \left(x-k_j\right)^3 &~~~~~~&\mbox{when } ~ x > k_j \\ &&\\ 0&&\mbox{otherwise}\\ \end{array} \right. \] where $k_j$ is the location of the $j$th knot. Using these functions, it can be shown that the cubic spline can be written as \[\mu(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 h(x,k_1) + \ldots + \beta_{K + 3} h(x,k_K).\] In this way, our basis has been formed by starting with the basis functions for a simple cubic and then adding a truncated power basis function for each knot.

This holds more generally for a $p$-degree spline where we can write \[\mu(x) = \beta_0 + \beta_1 x + \ldots + \beta_p x^p + \beta_{p+1} h_1(x,k_1) + \ldots + \beta_{K + p} h_K(x,k_K) \] in terms of the basis functions being those of a $p$ degree polynomial plus the truncated power basis functions: \[h_j(x,k_j) = \left(x-k_j\right)^p_+ ~~~~\mbox{for}~~ j = 1, \ldots, K.\]

Note again that this choice is just one possible choice of basis functions. This particular choice is fairly easy to understand conceptually but it is unfortunately a poor choice for computation. The problem is that powers of large numbers can lead to numerical instability and round-off error. Instead, an equivalent set of basis functions but one which is better computationally is the so-called B-spline basis, which also allows for efficient computation even when $K$, the number of knots, is large.

The $p$-degree splines with fixed knots are a very flexible set of functions for $\mu(x)$. Because we are free to choose the knots, we can choose how many and place them where please. More knots means more flexible and so we could even choose to add more in $x$ regions where we think that $\mu(x)$ varies more. These fixed knots splines are also sometimes called regression splines and, as in any polynomial regression problem where we might add more terms to better fit the data, with these regression splines we could add more knots for the same reason and in a more targetted fashion.

This can be used for any $x$; all we need are the coefficient estimates and the knot values to make a prediction at any $x$.

Fitting splines in R

There is a splines package in R that contains the function bs() which will calculate an ${\bf X}$ matrix (excluding the intercept term unless it is specifically requested) corresponding to a B-spline basis for a $p$-degree spline at fixed knots. Unless the value of the argument degree is provided, the default spline will be cubic.

For example:

library(splines)
p <- 3
# Note that the knots here are the interior knots 
# To match our previous fits, we might choose
knots_p <- quantile(x, seq(0.1, 0.9, 0.1))
Xmat <- bs(x, degree= p, knots=knots_p)
#
# This will be an N by (p + length(knots_p)) matrix
# the first few rows of which are
head(Xmat)

##      1            2            3         4           5         6
## [1,] 0 0.000000e+00 0.0008621395 0.2181317 0.671703715 0.1093024
## [2,] 0 0.000000e+00 0.0000000000 0.0000000 0.000000000 0.0000000
## [3,] 0 0.000000e+00 0.0000000000 0.0000000 0.000000000 0.0000000
## [4,] 0 0.000000e+00 0.0000000000 0.0000000 0.000000000 0.0000000
## [5,] 0 6.548316e-10 0.1361537090 0.6580432 0.205803081 0.0000000
## [6,] 0 3.487194e-02 0.6124390694 0.3499475 0.002741471 0.0000000
##               7          8         9           10          11 12
## [1,] 0.00000000 0.00000000 0.0000000 0.0000000000 0.000000000  0
## [2,] 0.00000000 0.02492778 0.6901975 0.2816912550 0.003183454  0
## [3,] 0.13117816 0.62430012 0.2435688 0.0009528792 0.000000000  0
## [4,] 0.05366229 0.53224850 0.4052864 0.0088027686 0.000000000  0
## [5,] 0.00000000 0.00000000 0.0000000 0.0000000000 0.000000000  0
## [6,] 0.00000000 0.00000000 0.0000000 0.0000000000 0.000000000  0

To see what these basis functions look like for our fitted model, we can plot them as a function of $x$.

Xorder <- order(x)
blim <- extendrange(Xmat)
parOptions <- par(mfrow = c(2,2))
for (j in 1:ncol(Xmat)) {
  plot(x[Xorder], Xmat[Xorder,j],
       type="l", 
       ylim=blim, 
       xlim = extendrange(x), 
       xlab="x", ylab="Basis",
       main=paste("Basis vector", j),
       col="steelblue")
}

par(parOptions)

Clearly, the basis functions are not polynomials. The estimated smooth $\widehat{\mu}(x)$ will be a linear combination of these functions.

With this ${\bf X}$ matrix in hand, we can now fit the cubic spline to this data:

fit <- lm(y ~ bs(x, degree= p, knots=knots_p))
# Let's get predictions at a lot of x-values equi-spaced 
# across the range of x so that we can see how smooth
# the fit is
xrange <- extendrange(x)
xnew <- seq(min(xrange), max(xrange), length.out=500)
ypred <- predict(fit,newdata=data.frame(x=xnew))

which can be plotted as

The linear combination of the basis functions (excluding an intercept) is given by the fitted coefficient estimates, namely:

coef(fit)

##                          (Intercept)  bs(x, degree = p, knots = knots_p)1 
##                           -1.4686086                            0.6263294 
##  bs(x, degree = p, knots = knots_p)2  bs(x, degree = p, knots = knots_p)3 
##                            0.3132570                            0.4536932 
##  bs(x, degree = p, knots = knots_p)4  bs(x, degree = p, knots = knots_p)5 
##                            0.6135543                            0.5744901 
##  bs(x, degree = p, knots = knots_p)6  bs(x, degree = p, knots = knots_p)7 
##                            0.6375489                            2.6419658 
##  bs(x, degree = p, knots = knots_p)8  bs(x, degree = p, knots = knots_p)9 
##                            1.7455348                            2.0888355 
## bs(x, degree = p, knots = knots_p)10 bs(x, degree = p, knots = knots_p)11 
##                            2.3340356                            2.8483187 
## bs(x, degree = p, knots = knots_p)12 
##                            2.6743683

Note also that we need not have fit this via least-squares:

library(robust)

fit <- rlm(y ~ bs(x, degree= p, knots=knots_p),
           psi=psi.bisquare)
# Let's get predictions at a lot of x-values equi-spaced 
# across the range of x so that we can see how smooth
# the fit is
ypred <- predict(fit,newdata=data.frame(x=xnew))

Note that in specifying so many interior knots, the cubic spline model has used up $1 + 3 + 9 = 13$ degrees of freedom for the model. Another way to specify the B-spline basis is to provide the number of degrees of freedom that the basis should have. When no intercept is included, this will simply be the degree $p$ plus the number of interior knots. So, as long as the degrees of freedom for the model are greater than the degree of the piecewise polynomial, the number of knots can be determined. By default, these will be placed at equispaced quantiles.

For example, suppose we choose the degrees of freedom to be 4 for the model (excluding the intercept) there will be exactly $4 - 3 = 1$ interior knot and it will be located at the median of the $x$ values.

Here’s what that would look like for a few values of the degrees of freedom:

fit1 <- lm(y ~ bs(x, degree= 3, df=4))
fit2 <- lm(y ~ bs(x, degree= 3, df=5))
fit3 <- lm(y ~ bs(x, degree= 3, df=8))
ypred1 <- predict(fit1,newdata=data.frame(x=xnew))
ypred2 <- predict(fit2,newdata=data.frame(x=xnew))
ypred3 <- predict(fit3,newdata=data.frame(x=xnew))

plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "Varying degrees of freedom")
lines(xnew, ypred1, col="steelblue", lwd=2, lty=1)
lines(xnew, ypred2, col="firebrick", lwd=2, lty=2)
lines(xnew, ypred3, col="darkgreen", lwd=2, lty=2)

legend(min(xrange), max(extendrange(y)),
       legend=c("df=4","df=5", "df=8"),
       lty=c(1,2,2), lwd=2, 
       col=c("steelblue", "firebrick", "darkgreen")
       )

Natural splines

As we have seen in several examples already, polynomial functions can be very wild at the end of the range of the $x$ data and beyond. With $p$-degree splines, the polynomials fit to the outside edge of the range of the $x$ data are likely to be even wilder since they are based on many fewer points.

To help address this problem, the spline is sometimes constrained to be only a linear function beyond the largest and smallest knots. For a $p$-degree spline (with odd $p$), we define a natural spline with knots at $k_1 < k_2 < \cdots < k_K$ to be a function $\mu(x)$ such that

$\mu(x)$ is a polynomial of degree $p$ in each interior neighbouthood $[k_1, k_2], \ldots [k_{K-1}, k_K]$,
$\mu(x)$ is a polynomial of degree $(p-1)/2$ on $(- \infty, k_1]$ and on $[k_K, \infty)$, and
$\mu(x)$ is continuous and has continuous derivatives of order $1, \ldots, (p-1)$ at its knots $k_1, \ldots, k_N$.

This forces the polynomials at either end to be fluctuate less by severely reducing their degree. Perhaps the most common natural spline used is the natural cubic spline which forces the fit on either end to be a straight line.

Recall that the model degrees of freedom associated with a $p$-degree spline was $K + p +1$, where $K$ was the number of knots. In the case of a natural $p$ degree spline the model degrees of freedom are \[\underbrace{(p + 1) (K -1)}_{\begin{array}{c}interior \\ neighbourhoods \end{array}} + \underbrace{2 \left(1 + \frac{p-1}{2} \right)}_{\begin{array}{c}exterior \\ neighbourhoods \end{array}} - \underbrace{Kp}_{\begin{array}{c}continuity \\ constraints \end{array}} = K.\] This means that the model degrees of freedom for natural splines depends only on the number of knots! Which is, you have to admit, pretty amazing.

Fitting natural splines

In splines package a basis matrix is provided by the function ns(...) and only for natural cubic splines. Degree 3 polynomials are by far the most common and, as it turns out, are all we really need in most circumstances (see “smoothing splines” below).

In R, in addition to any interior knots provided by the user (or determined automatically from a user supplied arguments) two additional boundary knots, $k_0$ and $k_{K+1}$ may be supplied ($k_0 < k_i$ and $k_{K+1} > k_K$). These determine the points beyond which the lower degree polynomials are fit. By default, ns chooses the boundary knots at the minimum and maximum $x$ values in the data.

Previously we fit a cubic spline to this data with knots at

knots_p

##        10%        20%        30%        40%        50%        60% 
## 0.07138955 0.18966397 0.29106870 0.38030525 0.45015999 0.53857863 
##        70%        80%        90% 
## 0.63629916 0.77296504 0.86884995

And considered what the basis functions looked like. Analogously, we can plot the natural spline basis functions for this data as well.

Xmat.ns <- ns(x, knots=knots_p)
blim <- extendrange(Xmat.ns)
parOptions <- par(mfrow = c(2,2))
for (j in 1:ncol(Xmat.ns)) {
  plot(x[Xorder], Xmat.ns[Xorder,j],
       type="l", 
       ylim=blim, 
       xlim = extendrange(x), 
       xlab="x", ylab="Basis",
       main=paste("ns basis vector", j),
       col="firebrick")
}

par(parOptions)

We can also compare the two fits that would result for our fake data.

fit.bs <- lm(y ~ bs(x, degree= p, knots=knots_p))
fit.ns <- lm(y ~ ns(x, knots=knots_p))  # no degree argument since
                                        # these must be cubic

# And now the predicted values for plotting
ypred.bs <- predict(fit.bs, newdata= data.frame(x=xnew))
ypred.ns <- predict(fit.ns, newdata= data.frame(x=xnew))

The straight line fits at the end of the natural cubic splines can be seen in the plot. Note the essential agreement between the two nearly everywhere else.

Now, fit.bs used model fit.bs$rank = 13 degrees of freedom in building its fit whereas the natural spline fit.ns used only fit.ns$rank = 11 degrees of freedom. Note that ns added two boundary knots to the 9 we provided it, hence the 11 degrees of freedom for this model. The natural spline has two fewer degrees of freedom. We might “spend” these two degrees of freedom on the placement of two more interior knots in the natural spline.

Suppose we use these extra model degrees of freedom by having two more interior knots, say between the 60% and 70% quantile. this should allow us to fit the abrupt change in the middle a little better.

knots_p2 <- quantile(x, 
                     c(seq(0.1,0.6,0.1), 
                       0.63, 0.65,
                       seq(0.7,0.9,0.1)
                       )
                    )

fit.ns2 <- lm(y ~ ns(x, knots=knots_p2))  

# And now the predicted values for plotting
ypred.ns2 <- predict(fit.ns2, 
                     newdata= data.frame(x=xnew))

Note how we were able to match degrees of freedom with the cubic spline and get a little higher peak (and consequent dip) where we placed our extra knots.

Smoothing splines

Knot selection can be avoided by recasting the problem. We still would like to our estimated mean function $\widehat{\mu}(x)$ to produce a small residual sum of squares \[\sum_{i=1}^N \left(y_i - \mu(x_i) \right)^2 \] but rather than specify a parametric form for the allowable shapes for $\mu(x)$ (e.g. polynomial in $x$, a cubic spline on $K$ specified knots, etc.), we might prefer to just somehow constrain $\mu(x)$ to be relatively “smooth”. To do so we need some measure of “smoothness”.

Rather than smoothness, which is what we want, it might be easier to define what we don’t want, namely a non-smooth or “rough” function. But how to define “rough”?

Suppose that $\mu(x)$ is at least twice differentiable. Then the first derivative $\mu^\prime(x)$ measures the slope of the function at any point $x$. A function of any slope can be smooth. However, if that slope changes frequently and or abruptly, then the function would not be smooth but rather it would be rough.

The second derivative $\mu^{\prime\prime}(x)$ measures how quickly the slope changes at any given point $x$. If this is large (positive or negative) then the slope is changing quickly. Large values of $\left(\mu^{\prime\prime}(x) \right)^2$ indicate that there is an abrupt change in slope at the point $x$. One possible measure of roughness then might be \[\int \left(\mu^{\prime\prime}(t) \right)^2 dt\] as the average of $\left(\mu^{\prime\prime}(x) \right)^2$ over all $x$. The smaller is the average, the smoother is $\mu(x)$.

One way to proceed would be find $\widehat{\mu}$ that minimizes \[RSS({\boldsymbol \mu}, \lambda) = \sum_{i=1}^N \left(y_i - \mu(x_i) \right)^2 + \lambda \int \left(\mu^{\prime\prime}(t) \right)^2 dt\] with $\lambda \ge 0$. This is a penalized residual sum of squares - the first term is the residual sum of squares, the second term a penalty function which is larger the rougher is $\mu(x)$.

Alternatively, had we been using a Gaussian generative model for the residuals, then we might cast the penalty function in probablistic terms. For example, we have \[ Y_i \sim N(\mu(x_i), \sigma^2) \] as the generative model for the response $Y$ conditional on the explanatory variate $x_i$. Seeing that $x_i$ appears only through the function $\mu(x_i)$, this is in some sense the conditional distribution of $Y$ given both $x_i$ and $\mu(\cdot)$. We might still want to condition on $x_i$ but would like to constrain the functional form of $\mu(\cdot)$ in some way. A useful fiction that helps us think about this might be to imagine that the $\mu(\cdot)$ that has generated the observed responses was itself randomly drawn from some collection of possible functions $\mu(\cdot)$. Because we think that we are more likely to have been served up a smooth $\mu(\cdot)$ than a rough one, we might imagine that the probability of any particular function $\mu(\cdot)$ is proportional to \[ e^{-\lambda \int \left(\mu^{\prime\prime}(t) \right)^2 dt}. \] Clearly functions with larger (average over the whole line) changes in the slope have lower probability; smoother $\mu(\cdot)$ have higher probability. If this were indeed a probability, then we might apply Bayes’s theorem and find the conditional distribution of $\mu(x_i)$ given $Y_i=y_i$ and $x_i$. We could then choose $\widehat{\mu}(x)$ to maximize this probability. This turns out to be equivalent to minimizing the penalized residual sum of squares.

Given the Gaussian generative model, the penalized residual sum of squares may be thought of as a so-called “Bayesian” method in that it turns out to be the objective function which results from maximizing the posterior probability of $\mu(\cdot)$ (in this language the marginal probability distribution of $\mu(\cdot)$ is called the prior distribution because it is available prior to any data). Alternatively, the objective function is simply a penalized log-likelihood function whose penalty might have been constructed (or not) by imagining a prior distribution for $\mu(\cdot)$.

We might also recognize the penalized residual sum of squares in this case as the objective function that would result from using $\lambda$ as a Lagrange multiplier when minimizing the residual sum of squares subject to the constraint that $\int \left(\mu^{\prime\prime}(t) \right)^2 dt = 0$. The constraint would enforce complete smoothness (i.e. a straight line).

Note, for example, that however one might think of the penalized residual sum of squares, it is clear that the value taken by $\lambda$ (which is still our choice) determines the smoothness of our estimated $\mu(x)$. If, for example, $\lambda = \infty$ then no change in the slope is allowed and we have the smoothest possible function, a line. Indeed, we would have the least-squares fitted line had by minimizing the first term alone. At the other extreme, if we had $\lambda=0$, then only the residual sum of squares would need to be minimized. With no further restrictions on $\mu$, we would have $\widehat{\mu}(x_i) = y_i$ which amounts to connecting the dots in the scatterplot from left to right (assuming no ties among the $x_i$, otherwise we average $y$ at those points having the same value of $x$).

It turns out that for any fixed $\lambda$, the solution to this penalized residual sun of squares problem, is a natural cubic spline with knots at every unique value of the $x_i$! That is the solution requires $\mu(x)$ to be of the form \[\mu(x) = \sum_{j=1}^N N_j(x) \beta_j\] where the $N_j(x)$ are a set of $N$ basis functions for this family of natural cubic splines.

Knowing that the solution must be a natural cubic spline of the above form, we can rewrite the penalized residual sum of squares as \[RSS({\boldsymbol \mu}, \lambda) = \left( {\bf y} - {\bf N} {\boldsymbol \beta} \right)^T\left( {\bf y} - {\bf N} {\boldsymbol \beta} \right) + \lambda ~{\boldsymbol \beta}^T {\boldsymbol \Omega}_N {\boldsymbol \beta} \] where ${\bf N} = \left[N_{ij} \right]$ is an $N \times N$ matrix whose $(i,j)$ element is $N_{ij} = N_j(x_i)$, the $j$ th natural cubic spline basis function evaluated at $x_i$ and ${\boldsymbol \Omega} = \left[\omega_{ij} \right]$ is an $N \times N$ matrix whose $(i,j)$ element is $\omega_{ij} = \int N^{\prime \prime}_i(t) N^{\prime \prime}_j(t) dt$. The solution is now easily found to be \[\widehat{\boldsymbol \beta} = \left({\bf N}^T{\bf N} + \lambda {\boldsymbol \Omega}_N \right)^{-1} {\bf N}^T {\bf y}\] and the fitted smoothing spline is \[\widehat{\mu}(x) = \sum_{j=1}^N N_j(x) \widehat{\beta}_j.\]

Looking at this solution, it would appear to be overparameterized – there are as many $\beta_j$s as there are observations $y_i$. This solution however has been constrained and to a more or less amount according to the value of $\lambda$. No longer are the number and location of knots chosen to make the function smoother or rougher (for the smoothing spline knots must be at the unique $x_i$) but we now choose a value for the smoothing parameter $\lambda$. The larger is $\lambda$ the more smooth is the resulting estimated function $\widehat{\mu}(x)$.

Choosing $\lambda$ via degrees of freedom

But how should we choose $\lambda$? One way would be to somehow connect the value of $\lambda$ to a measure of the complexity (or roughness) of the fitted model.

A traditional measure of the complexity of a linear model is the number of linear parameters in that model. This in turn was equivalent to the rank of the $\bf X$ matrix of the “hat-matrix” ${\bf H} = {\bf X} \left({\bf X}^T{\bf X}\right)^{-1}{\bf X}^T$.

Recall that the role of the hat matrix is to determine the $N$ dimensional fitted mean vector $\widehat{\boldsymbol \mu} ={\bf H}{\bf y}$, which it does by orthogonally projecting the vector ${\bf y}$ onto the space spanned by the columns of ${\bf X} = colsp({\bf X})$.

Note that $colsp({\bf H}) = colsp({\bf X})$, that is it is the same space. The difference is that the columns of ${\bf X}$ form a basis for that space (assuming that ${\bf X}$ has full column rank) whereas the columns ${\bf H}$ are generators of the space but there are too many to be a basis. There are $N$ generators in ${\bf H}$ when we need only $r = rank({\bf X})$ which is typically smaller than $N$.

One way to get a set of $r$ basis vectors from ${\bf H}$ is to find its eigen-decomposition. Suppose we do that and that the ordered eigen-values are $\rho_1 \ge \rho_2 \ge \cdots \ge \rho_N \ge 0$. Let the corresponding eigen-vectors be ${\bf u}_1, \ldots, {\bf u}_N$. Then the decomposition gives \[{\bf H} = \sum_{i=1}^N \rho_i {\bf u}_i {\bf u}_i^T \] and so \[ \begin{array}{rcl} \widehat{\boldsymbol \mu} & = & \sum_{i=1}^N \rho_i {\bf u}_i {\bf u}_i^T {\bf y}\\ && \\ &=& \sum_{i=1}^N {\bf u}_i \rho_i <{\bf u}_i ~,~ {\bf y}> \\ \end{array} \] where $<\cdot, \cdot>$ indicates the inner product of its vector arguments. An interpretation of the last piece is that the vector ${\bf y}$ is first decomposed with respect to the orthonormal basis $\{ {\bf u}_1, \ldots, {\bf u}_N \}$ for $\Re^N$. Each $\rho_i$ moderates the contribution of each piece to the corresponding basis vector ${\bf u}_i$. Now, the nature or ${\bf H}$ is that \[\rho_i = \left\{ \begin{array}{lcl} 1 &~~~~~~& i=1, \ldots, r \\ && \\ 0&& i=(r+1), \ldots, N. \end{array} \right. \] Being a projection matrix (idempotent) the eigen-values $\rho_i$ of ${\bf H}$ either select a basis vector ($\rho_i =1$) or do not ($\rho_i=0$). Note that the dimension of this space is the sum of the eigenvalues or the trace, $tr({\bf H})$, of ${\bf H}$.

When fitting cubic splines, instead of ${\bf X}$ we had a basis matrix ${\bf B}$, and corresponding projection matrix ${\bf H}_B = {\bf B} \left({\bf B}^T{\bf B}\right)^{-1}{\bf B}^T$. When using bs(...) or ns(...) in R we could specify the degrees of freedom df and have the function choose the appropriate number (and location) of the knots with which to build ${\bf B}$. The question is whether we can parameterize a smoothing spline in a similar way?

A smoothing spline estimates the mean value at the points $x_1, \ldots, x_N$ by \[\begin{array}{rcl} \widehat{\boldsymbol \mu} & = & \left(\widehat{\mu}(x_1), \widehat{\mu}(x_2), \ldots \widehat{\mu}(x_N) \right)^T \\ &&\\ & = & {\bf N} \widehat{\boldsymbol \beta} \\ &&\\ &= & {\bf N}\left({\bf N}^T{\bf N} + \lambda {\boldsymbol \Omega}_N \right)^{-1} {\bf N}^T {\bf y} \\ &&\\ &=& {\bf S}_{\lambda} {\bf y}~, ~~~\mbox{say.}\\ \end{array} \] So ${\bf S}_{\lambda}$ acts here like the hat-matrix ${\bf H}_B$ for a cubic spline. Unfortunately, it is not idempotent and hence not a projection matrix.

We can however see what it’s made of. Note that ${\bf N}$ is $N \times N$ and of full rank $N$.
\[\begin{array}{rcl} {\bf S}_{\lambda} & = & {\bf N}\left({\bf N}^T{\bf N} + \lambda {\boldsymbol \Omega}_N \right)^{-1} {\bf N}^T \\ &&\\ & = & \left( {\bf N}^{-T} \left( {\bf N}^T{\bf N} + \lambda {\boldsymbol \Omega}_N \right) {\bf N}^{-1} \right)^{-1} \\ &&\\ & = & \left( I_N + \lambda {\bf N}^{-T}{\boldsymbol \Omega}_N {\bf N}^{-1} \right)^{-1} \\ &&\\ & = & \left( I_N + \lambda {\bf K} \right)^{-1} \\ \end{array} \] where ${\bf K} = {\bf N}^{-T}{\boldsymbol \Omega}_N {\bf N}^{-1}$. Note that ${\bf K}$ does not involve the smoothing parameter $\lambda$. Note also that the objective function being minimized can now be rewritten as \[ RSS({\boldsymbol \mu}, \lambda) = \left( {\bf y} - {\boldsymbol \mu} \right)^T\left( {\bf y} - {\boldsymbol \mu} \right) + \lambda ~{\boldsymbol \mu}^T {\bf K} {\boldsymbol \mu}. \] Consequently, ${\bf K}$ is sometimes called the penalty matrix.

If ${\bf K} = {\bf V}{\bf D}{\bf V}^T$ is the eigen decomposition of the (real symmetric) matrix ${\bf K}$ with ${\bf D} = diag(d_1, \ldots, d_N)$ and $d_1 \ge \cdots \ge d_N \ge 0$, then it would appear that the components of ${\boldsymbol \mu}$ in directions of eigen-vectors of ${\bf K}$ are to be penalized more when they correspond to large eigen-values $d_i$ than when they correspond to small eigen-values $d_i$.

Note that both any constant function $\widehat{\mu}_1(x) = a$ and any straight line function $\widehat{\mu}_1(x) = a + bx$ are in the space spanned by the natural spline basis functions $N_i(x)$. Hence there is one non-zero linear combination of these basis functions that leads to the constant and another that leads to a straight line. This implies that the two smallest eigen-values of ${\bf K}$ are $d_{N-1}=d_N=0$.

This can be made more apparent by examining the solution vector $\widehat{\boldsymbol \mu} = {\bf S}_{\lambda} {\bf y}$. First, note that \[{\bf S}_{\lambda} = {\bf V} \left( I_N + \lambda {\bf D}\right)^{-1} {\bf V}^T\] is the eigen decomposition of ${\bf S}_{\lambda}$ having eigen-values \[\rho_i(\lambda) = \frac{1}{1 + \lambda d_{N-i+1}} \] for $i=1, \ldots, N$. Large values of $d_i$ will produce small values of $\rho_i(\lambda)$. Similarly, large values of the smoothing parameter $\lambda$ will produce small values of $\rho_i(\lambda)$.

A closer look at $\widehat{\boldsymbol \mu}$ reveals \[ \begin{array}{rcl} \widehat{\boldsymbol \mu} & = & \sum_{i=1}^N \rho_i(\lambda) {\bf v}_i {\bf v}_i^T {\bf y}\\ && \\ &=& \sum_{i=1}^N {\bf v}_i \rho_i(\lambda) <{\bf v}_i ~,~ {\bf y}> \\ \end{array} \] where the $i$th eigen-vector ${\bf v}_i$ here is of ${\bf S}_{\lambda}$ and corresponds to its $i$th largest eigen-value $\rho_i(\lambda)$. This is the reverse order of the eigen-vectors of ${\bf K}$ in that the largest $\rho_i(\lambda)$ corresponds to the smallest $d_i$.

This is a lot like the relationship we saw for $p$-order splines. The difference here is that the eigen-values $\rho_i(\lambda)$ are not just zero or one. The two largest are 1 (corresponding to $d_1=d_2=0$) and the rest are less than one.

With ${\bf H}_B$ of the ordinary $p$th order splines, the eigen-values of ${\bf H}_B$ selected the components of ${\bf y}$ in the directions of the eigen-vectors corresponding to the largest eigen-values (namely 1) and dropped the components in the directions corresponding to the smallest eigen-values (namely 0). That is the nature of a projection operator and so this kind of spline is sometimes called a projection smoother.

In contrast, the effect of ${\bf S}_{\lambda}$ on ${\bf y}$ is to shrink the components of ${\bf y}$ in the directions its eigen-vectors. It shrinks more in those directions with small eigen-values ($\rho_i(\lambda)$) and less in directions with large ones. Unlike ${\bf H}_{\lambda}$, ${\bf S}_{\lambda} \times {\bf S}_{\lambda} \ne {\bf S}_{\lambda}$ and hence not a projection matrix. Instead, ${\bf S}_{\lambda} \times {\bf S}_{\lambda} \preceq {\bf S}_{\lambda}$ – the product has smaller eigen-values than the original. Because of this shrinkage, the smoothing spline is sometimes called a shrinking smoother.

Analogous to the model degrees of freedom of a projection smoother being equivalent to the trace of its projection matrix, $tr({\bf H}_B)$, we will take the effective degrees of freedom to be the trace of the smoother matrix, $tr({\bf S}_{\lambda})$. Both are the sums of their respective eigen-values.

For the smoothing spline, then, the effective degrees of freedom can be expressed as \[df_{\lambda} = tr({\bf S}_{\lambda}) = \sum_{i=1}^N \frac{1}{1 + \lambda d_i}.\] Which in turn means that rather than specify the smoothing parameter $\lambda$, we could specify the effective degrees of freedom $df_{\lambda}$ and solve the above equation for $\lambda$.

Now as $\lambda \rightarrow 0$, $df_{\lambda} \rightarrow N$ and ${\bf S}_{\lambda} \rightarrow I_N$. The result is a perfect (and likely very non-smooth) fit. Similarly, as $\lambda \rightarrow \infty$, $df_{\lambda} \rightarrow 2$ (corresponding to $d_{N-1}=d_N=0$) and ${\bf S}_{\lambda} \rightarrow {\bf H}$, the hat matrix for a straight line regression of $y$ on $x$.

For $\lambda$ values in between these two extremes, as $\lambda$ increases we have greater shrinkage of the eigen-values $\rho_i(\lambda)$ and a lower value of the effective degrees of freedom $df_{\lambda}$. Again, the shrinkage is greatest in the direction of the eigen-vectors ${\bf v}_i$ corresponding to the smallest eigen-values of ${\bf S}_{\lambda}$.

To get some sense of what basis functions these correspond to, we could get the eigen decomposition of ${\bf S}_{\lambda}$ for our fake data example.

In R we will first fit the smoothing spline on the data set using $df_{\lambda} = 11$ to try to match our earlier natural spline fit. This is accomplished by these using the function smooth.spline(...) as follows.

df <- 11
sm <- smooth.spline(x, y, df = df)

# To plot this we can look at the predicted values for
# any vector x, say our earlier xnew.
ypred.sm <- predict(sm, x=xnew)$y
plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = paste("Smoothing spline, df =", df)
     )
lines(xnew, ypred.sm, col="steelblue", lwd=2)

Unfortunately the smoother matrix is not directly available from the fitted smooth in R. We can however access it a little less directly by recognizing that we can pick off its $i$ the column by choosing the ${\bf y}$ vector to be fitted to be the vector ${\bf e}_i = (0, \ldots,0, 1, 0, \ldots,0)^T$ which is zero everywhere but in its $i$th coordinate where it is 1.

A little R function can do this for us.

smootherMatrix <- function(x,df) {
  n <- length(x)
  S <- matrix(0, n, n)
  for (i in 1:n) {
    ei = rep(0, n)
    ei[i] <- 1
    # insert the fit into the i'th column of S
    S[,i]  <- predict(smooth.spline(x, ei, df=df), x)$y
  }
  # To make sure the result is (numerically) symmetric
  S <- (S + t(S))/2
  # and return the symmetric matrix
  S
}

We are now in a position to get ${\bf S}_{\lambda}$ for our example and investigate its eigen properties.

# First get S
S <- smootherMatrix(x, 11)
# and then its eigen decomposition
eigS <- eigen(S)

# We can look at its eigen values in a plot
plot(eigS$values, type="b", col="steelblue",
     main="Plot of the eigen values",
     pch=19,
     ylab="rho", xlab="i")

We see that the $\rho_i(\lambda)$ values drop of very quickly indicating that the components of ${\bf y}$ in the direction of the smallest eigen-values are shrunk a great deal (effectively made zero). Note also that the eigen-values do seem to sum to our intended effective degrees of freedom $df_{\lambda}$ or sum(eigS$values) = 11.0018443.

So, the question is, what do the eigen-vectors corresponding to the largest eigen-values look like? And how do they compare to those corresponding to the small eigen-values? To that end, we plot them as a curve evaluated at the $x$ values in our data set.

First, let’s look at the curve for some small values.

plotEigenBases <- function(x, 
                           eigenDecomp, 
                           indices){
  
  Xorder <- order(x)
  orderedX <- x[Xorder]
  vectors <- eigenDecomp$vectors
  ylim <- range(vectors)
  values <- eigenDecomp$values
  for (i in indices){
    plot(orderedX, vectors[Xorder, i],
         type="l", col=adjustcolor("steelblue", 0.7),
         xlab="x", ylab="eigen function",
         ylim = ylim,
         main=paste("i = ", i, "; rho =", round(values[i],4)))
    }
  }

parOptions <- par(mfrow=c(2,2))
plotEigenBases(x, eigS, 1:4)

par(parOptions)

The next few are

parOptions <- par(mfrow=c(2,2))
plotEigenBases(x, eigS, 5:8)

par(parOptions)

And how about a fewer farther out.

parOptions <- par(mfrow=c(2,2))
plotEigenBases(x, eigS, c(10, 20, 30, 40))

par(parOptions)
parOptions <- par(mfrow=c(2,2))
plotEigenBases(x, eigS, c(50, 60, 150, 200))

par(parOptions)

As these plots indicate, the eigen-vectors contribute bumpier and bumpier basis vectors as the associated eigen-values diminish. The smoother radically downweights according to the corresponding eigen-value $\rho_i(\lambda)$.

It does seem that as the effective degrees of freedom get smaller, the smoother is the fitted function $\widehat{\boldsymbol \mu}(x)$ just as is the case with the usual model degrees of freedom.

More formally, axioms for an effective dimension of a finite collection of vectors, say the $N$ column vectors of a matrix ${\bf M} = \left[ {\bf m}_1, \ldots, {\bf m}_N \right]$ have been proposed (Oldford 1985) which are satisfied by the function \[d_{\alpha}\left(\left\{{\bf m}_1, \ldots, {\bf m}_N\right\}\right) = d_{\alpha}\left({\bf M}\right) = \sum_{i=1}^{N}\left( \frac{\gamma_i}{\gamma_{max}}\right)^{\alpha}\] for at least $\alpha=1,2$. Here $\gamma_i$ is the $i$th largest singular value of ${\bf M}$ and $\gamma_{max} = max_i ~ \gamma_i$. This “$d_\alpha$ effective dimension” also arises in a variety of estimation and diagnostic problems related to linear response models. For example, $d_{\alpha}({\bf H}) = p$ for all $\alpha \ne 0$.

The effective degrees of freedom are therefore also an effective dimension of the set of column (or row) vectors of ${\bf S}_{\lambda}$ when $\alpha=1$.

Multidimensional splines

The splines we have used above have been designed to fit a curve to a single explanatory variate $x$. But what if we have more than one explanatory variate?

Suppose for example that we have only two explanatory variates $x$, and $z$ say.

One way to proceed would be to use an additive model \[ \mu(x,z) = \beta_0 + f_1(x) + f_2(z) \] and use a penalty function \[\left(\int f_1^{\prime \prime}(t)^2 dt \right) + \left(\int f_2^{\prime \prime}(t)^2 dt \right).\] The resulting minimization is obtained when each $f_i(\cdot)$ is itself a univariate spline. This result extends to any number of explanatory variates.

Another suggestion would be to choose a set of basis functions from existing univariate ones. For example, if we have a set of $m$ basis functions $b_1(x), \ldots, b_m(x)$ for $x$ and another set of $n$ basis functions $c_1(x), \ldots, c_n(x)$ for $z$, then we could introduce basis functions \[g_{jk}(x,z) = b_j(x) \times c_k(z)\] for $j=1, \ldots, m$ and $k=1, \ldots, n$, a so-called tensor product basis. Then \[\mu(x,z) = \beta_0 + \sum_{j=1}^m \sum_{k=1}^n \beta_{jk} g_{jk}(x,z)\] would be fitted via least-squares as before.

Clearly both of these methods generalize to any number of dimensions but the effective model degrees of freedom grows multiplicatively with the number of explanatory variates using the tensor product bases and only linearly using additive bases.

One dimensional splines can also be generalized to higher dimensions via an appropriate penalty for high curvature in $\mu({\bf x})$ with ${\bf x} \in \Re^d$.

For example, when $d=2$ we could choose our roughness penalty as \[ \int \int \left[ \left(\frac {\partial^2 \mu(x,z)} {\partial x^2} \right)^2 + 2 \left(\frac {\partial^2 \mu(x,z)} {\partial x \partial z} \right)^2 + \left(\frac {\partial^2 \mu(x,z)} {\partial z^2} \right)^2 \right] dx dz. \] Minimizing the residual sum of squares plus $\lambda$ times this penalty function leads to a smooth two dimensional surface called a thin-plate spline.

Local neighbourhoods

Compared to fitting splines with prespecified knots, smoothing splines seem to fit the data more locally in that they had knots at every unique point $x_i$ in the data. Another approach to finding a flexible function to fit the data would be focus on each point $x$ and fit a mean function to that point based on its local neighbourhood.

K nearest neighbour fitting

The simplest way to proceed might again be to fit a local average at every value $x=x_i$ in the data set. Values for other values of $x$ might be found by simple linear interpolation between these fitted values (i.e. simply “connect the dots”).

One way to define a local neighbourhood would be to find the $k$ nearest neighbours of that $x$ value. There exists a function called knn.reg in the R package FNN that will compute this average for every point $x_i$ in the data.

require(FNN)

# Let's try a few values for k
# 
knn.fit5 <- knn.reg(x, y=y, k=5)
knn.fit21 <- knn.reg(x, y=y, k=21)
knn.fit51 <- knn.reg(x, y=y, k=51)

Xorder <- order(x)
plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "5 nearest neighbours")

lines(x[Xorder], knn.fit5$pred[Xorder], 
      col=adjustcolor("firebrick", 0.5),
      lwd=2)

plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "21 nearest neighbours")

lines(x[Xorder], knn.fit21$pred[Xorder],
      col=adjustcolor("firebrick", 0.75),
      lwd=2, lty=2)

plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "51 nearest neighbours")

lines(x[Xorder], knn.fit51$pred[Xorder],
      col=adjustcolor("firebrick", 0.75), 
      lwd=2, lty=5)

As was the case with piecewise neighbourhoods, as the size of these local neighbourhoods increases, the smoother becomes the fitted function.

Just as we did for the case of piecewise defined neighbourhoods, we might also replace the averages with any fitted model based on the $k$ nearest neighbours of any location $x$.

To that end, here is a little function that will allow us to experiment a bit.

library(FNN)
library(robust)

# This function allows us to see how
# the local fits behave as we change 
# various elements
# 

KNNfit <- function(x, y,      # the raw data
                   xloc=NULL, # the location where we want
                              # to fit.
                   k=5,       # number of nearest neighbours
                   method=c("lm",       # choice of fitting
                            "huber",    # procedure
                            "hampel", 
                            "bisquare", 
                            "lms", 
                            "lts"),
                   newplot=TRUE,      # create a new plot or
                                      # add to and existing one
                   showPoints=FALSE,  # highlight points used?
                   pcol="red",        # highlight colour
                   showWeights=FALSE, # the weights used to
                                      # define the nbhd for
                                      # the fit
                   wcol="pink",       # weight colour
                   showLine=TRUE,     # show the fitted 
                                      # line at xloc
                   fullLine=TRUE,     # full or partial line
                   lcol="steelblue",  # line colour
                   cex=0.5,           # point size
                   pch=19,            # point character
                   col="grey80",      # point colour
                   lwd=2,             # line width
                   ...                # other plot parameters
                   ) 

 {
  if (newplot) {
    plot(x=x, y=y, cex=cex, pch=pch, col=col, ...)
  }
  data <- data.frame(x=x, y=y)
  xrange <- extendrange(x)
  yrange <- extendrange(y)
  
  # default location will be the median
  if (is.null(xloc)) {
    xloc <- median(x)
  }
  
  # Get the k nearest neighbours here
  neighbours <- get.knnx(x, query=xloc, k=k)$nn.index
  
  # Get the fit
  method <- match.arg(method)
  fit <- switch(method,
                lm = {lm(y~x, data=data, subset=neighbours)},
                huber = {rlm(y~x, data=data, subset=neighbours, psi=psi.huber)},
                hampel = {rlm(y~x, data=data, subset=neighbours, psi=psi.hampel)},
                bisquare = {rlm(y~x, data=data, subset=neighbours, psi=psi.bisquare)},
                lms = {lmsreg(y~x, data=data, subset=neighbours)},
                lts = {ltsreg(y~x, data=data, subset=neighbours)},
                stop(paste(method, "method not supported."))
                )
  # Predicted values
  pred <- predict(fit, newdata=data.frame(x=xloc))
  
  # We can select these points via weights
  if (showWeights){
    N <- length(y)
    weights <- rep(0, N)
    weights[neighbours] <- 1
    weights <- weights/max(weights)
    ymin <- min(yrange)
    ymax <- diff(yrange)/3
    
    for (i in 1:length(weights)) {
      lines(x=rep(x[i],2),
            y=c(ymin, ymin + weights[i] * ymax),
            col=wcol,
            lty=1, lwd=1)
    }
  }
  # Add the line if requested
  if (showLine){
    if (fullLine){
      abline(a=fit$coefficients[1],
             b=fit$coefficients[2],
             col= lcol,
             lwd=lwd)
      } else 
        { 
          xline <- range(x[neighbours])
          yline <- predict(fit, newdata=data.frame(x=xline))
          lines(xline, yline,
                col= lcol,
                lwd=lwd)
        }
    }
  
  # Colour points that were used
  if (showPoints) {
    points(x[neighbours], y[neighbours], col=pcol)
  }
  
  # show the points fitted
  points(xloc, pred, pch=19, col=lcol)
  
  
  # invisibly return the location of the fitted point
  invisible(list(x=xloc, y=pred))
}

The above function is meant to show how the fit would be determined, and on which points it was constructed.

For example, suppose we want to consider the fitted values at the $x$ location $x=0.25$. Then we would call

# Fit at x= 0.25 using least-squares (lm) on k=5 nearest neighbours
xloc <- 0.25
k <- 5
# Show the fitted line and the points on which it was based.
parOptions <- par(mfrow = c(1,2))
KNNfit(x, y, xloc=xloc, k=k, method="lm",
       fullLine=TRUE, showPoints=TRUE,
       main = paste("least-squares", "fit at x =", xloc, "with k =", k))

KNNfit(x, y, xloc=xloc, k=k, method="lms",
       fullLine=TRUE, showPoints=TRUE,
       main = paste("LMS", "fit at x =", xloc, "with k =", k))

par(parOptions)

Note that the fit might produce a good location, even though the slope of the line used to get that location is somewhat surprising. That is because it is responding to very local structure (it is based only on the red points in the plot).

So what happens if we increase the size of the neighbourhood? How about $k=30$?

xloc <- 0.25
k <- 30
# Show the fitted line and the points on which it was based.
parOptions <- par(mfrow = c(1,2))
KNNfit(x, y, xloc=xloc, k=k, method="lm",
       fullLine=TRUE, showPoints=TRUE,
       main = paste("least-squares", "fit at x =", xloc, "with k =", k))

KNNfit(x, y, xloc=xloc, k=k, method="lts",
       fullLine=TRUE, showPoints=TRUE,
       main = paste("LTS", "fit at x =", xloc, "with k =", k))

par(parOptions)

Although the lines are still different, there is much better agreement in this case. When the neighbourhood is sized right so that a few points do not influence the outcome, these fitting methods should be in near agreement (especially as regards the fitted line at the single location xloc). If it is too small, a few points could dominate a least-squares fit. If it is too large, the underlying function may have changed enough that a simple straight line model might be too simple.

We might compare the fits over a wide range of $x$ locations. We’ll do that now, but using only a short line segment to represent the fitted line at each location. The horizontal range of each line segment covers the neighbourhood on which the line was constructed.

First for least-squares and $k=30$:

k <- 30
method <- "lm"
xrange <- range(x)
xlocs <- seq(min(xrange), max(xrange), length.out = 30)
# Get the first plot:
KNNfit(x, y, xloc=xlocs[1], k=k, method=method,
       fullLine=FALSE, showPoints=FALSE,
       lcol=adjustcolor("steelblue", 0.5),
       main = paste(method, "fit with k =", k))
# Now add the rest in turn
for (i in 2:length(xlocs)) {
  KNNfit(x, y, xloc=xlocs[i], k=k, method=method,
         fullLine=FALSE, showPoints=FALSE,
         lcol=adjustcolor("steelblue", 0.5),
         # Note that we are adding to the existing plot now
         newplot = FALSE,
         main = paste(method, "fit with k =", k))
}

And now for lts and $k=30$:

We can see that both sets of fits find the middle peak of the plot, but that the peak is higher for “lts” here because it ignores the rightmost points in that neighbourhood as outliers.

Local weighting

Another way to think about this local fitting is to imagine that all points in the data set are being used, except that those outside of the local neighbourhood have a zero weight (in the sense of weighted least squares).

For example, our least-squares fit only on the $k$ nearest neighbours looks like a least-squares fit on all of the data but with weights that are 1 for points in the neighbourhood and zero for points outside the neighbourhood. The two lines could be seen as follows:

# 
# Our data here are simply x and y
# 
ourdata <- data.frame(x=x, y=y)
#  The least-squares fit to all of the data
fit <- lm(y~x, data=ourdata)
alpha_hat <- fit$coefficients[1]
beta_hat <- fit$coefficients[2]

plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "Locally weighted least-squares")

abline(a=alpha_hat, b=beta_hat,
       col=adjustcolor("steelblue", 0.75), 
       lty=2, lwd=3)

KNNfit(x, y, xloc=xloc, k=30, method="lm",
       fullLine=TRUE, showPoints=FALSE,
       showWeights = TRUE, newplot=FALSE,
       lcol=adjustcolor("steelblue", 0.75),
       lwd=3)

points(x,y, col="grey80",pch=19, cex=0.5)

legend("topleft",
       legend=c("Weight 1 for all data","Non-zero red weights"),
       lty=c(2,1), lwd=3, 
       col=c(adjustcolor("steelblue", 0.75), 
             adjustcolor("steelblue", 0.75))
       )

When we plot the weights like this a few things become obvious. $\widehat{\mu}(x)$ is seen to be the value that minimizes a weighted sum of squares \[\sum_{i=1}^N w(x, x_i)~ r_i^2 = \sum_{i=1}^N w(x, x_i) (y_i - \mu(x_i))^2 \] where \[w(x, x_i) = \left\{ \begin{array}{lcl} 1 & ~~~~~& x_i \in Nbhd(x) \\ &&\\ 0 && \mbox{otherwise}. \end{array} \right.\] And second that the choice of a weight function is a bit peculiar, namely the indicator function $I_{Nbhd(x)}(x_i)$.

Note also that the neighbourhood of the indicator function is defined by the $k$ nearest neighbours of $x$ which has some obvious consequences. For example, for some $x$, the neighbourhood could be very unbalanced in as much as most (even all) neighbours could be on one side of $x$ and fewer (even none) could be on the other. If $x$ happens to be located in a place where there are few data points, the neighbourhood could stretch some distance to find its $k$ near neighbours. Clearly this would stabilize $\widehat{\mu}(x)$ there but it would be at the expense of relying on points possibly quite far away.

Another approach we might try would be to base our neighbourhood on the Euclidean distance from the location of interest, $x$. For example, we could choose our neighbourhood to be to choose all points $x_i$ within a distance $h$ say of $x$, that is \[ Nbhd(x, h) = \left\{x_i ~:~ | x - x_i | \le h \right\}.\] Just as with $k$ nearest neighbours then we might give a weight of one to all points $x_i$ within this neighbourhood and a weight of zero to points outside the neighbourhood.

One problem with this neighbourhood is that depending on its size it might include many points or few points. We’ll return to this later.

Alternatively, we might not worry so much about the neighbourhood but rather simply choose the weights more judiciously. Since we are trying to fit locally, we could choose higher weights for the closer points and lower weights for those farther away.

For example, we might consider weights that are proportional to a function, say $K(t)$ having the following properties: \[\int K(t) dt = 1, ~~~~~ \int t~K(t) dt = 0, ~~~~~\mbox{and}~~~ \int t^2 ~ K(t) dt < \infty. \] The first two of these standardize the $K(t)$; the last makes sure that there is some spread in the weight along the real line but also that there not be too much weight in the extremes. The function $K(t)$ maps $K:\Re \rightarrow \Re$ and is called a kernel function. (Aside: this is not to be confused with the “kernel” functions you may have seen in other courses – of reproducing kernel Hilbert space methods which map $K: \Re \times \Re \rightarrow \Re$.)

To help get some intuition on these, imagine that we also have $K(t) \ge 0$ for all $t$. Then $K(t)$ could be a density function (integrating to 1), with mean 0, and finite variance. Note that it need be symmetric, though that is all that we will consider.

A gaussian kernel would be one such example defined as \[K(t) = \frac{1}{\sqrt{2 \pi}}~exp\left({-\frac{t^2}{2}}\right).\]

Some other examples include:

An Epanechnikov kernel \[K(t) = \left\{ \begin{array}{lcl} \frac{3}{4}\left(1 - t^2\right) & ~~~~~~ & \mbox{for }~~ |t| < 1 \\ &&\\ 0&& \mbox{otherwise.} \end{array} \right. \]
Tukey’s tri-cube weight \[K(t) = \left\{ \begin{array}{lcl} \left(1 - |t|^3 \right)^3 & ~~~~~~ & \mbox{for }~~ |t| \le 1 \\ &&\\ 0&& \mbox{otherwise.} \end{array} \right. \]

Since we are interested in applying these functions locally, the kernel functions $K(\cdot)$ is applied not to each $x_i$, but rather to the difference $x_i -x$. Similarly, a means of controlling how quickly the weights diminsh for any kernel is to introduce a scale parameter, say $h >0$.

Our weight function would be calculated as \[w(x, x_i) = \frac{ K \left( \frac{x_i - x}{h} \right) } {\sum_{j=1}^N K \left( \frac{x_j - x}{h} \right) } \]

Illustration

To illustrate how this might work, we can construct a naive locally weighted sum of squares estimator as follows.

# Construct a weight function  shaped like
# a Gaussian or Normal density.
# 
GaussWeight <- function(xloc, x, h=1) {
  # Normal density
  dnorm(x, mean=xloc, sd=h) 
}

Now we choose a point in the $x$ range, say xloc = 0.5 and fit a straight line using all of the data but using the above weight function to determine the weight that will be given to each point in the estimation.

# location at which we are estimating.
xloc <- 0.5 
# The bandwidth corresponds to the standard 
# deviation to be used in
# the Gaussian density function.
# For our data (whose x values lie between 0 and 1)
# a bandwidth of 0.25 will give non-zero weights to all of the 
# (xi, yi) pairs.
weights <- GaussWeight(xloc, x, h=0.4)
#
# And use these to fit a line 
fitw <- lm(y~x, data=ourdata, weights=weights)

# yielding coefficient estimates
# 
alpha_w_hat <- fitw$coefficients[1]
beta_w_hat <- fitw$coefficients[2]
#
# Here's the plot again
plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "Gaussian weights")
#
#  With the new fit added
#  
abline(fitw$coefficients, col="steelblue", lwd=2)

# The line was fitted with weights centred around x = xloc
# We can get the fitted value there as
pred <- predict(fitw, newdata=data.frame(x=xloc))
#
# And plot it for emphasis
# 
points(xloc, pred, col="red", pch=19)

# We can add the weights to the display as little 
# a vertical line at each point

#
# First, we'll rescale the weights to use them for plotting
wts <- weights/max(weights)
# We need to know the range of the ys
yrange <- extendrange(y)
ybottom <- min(yrange)
yheight <- diff(yrange)/3  # only use the bottom third of the plot
#
# Draw the weight lines
for (i in 1:length(wts)) {
      lines(x=rep(x[i],2),
            y=c(ybottom, ybottom + wts[i] * yheight),
            col="pink",
            lty=1)
    }

Which doesn’t look that different from the least-squares lines. For the original least-squares fit we had $\widehat{\alpha} =$ -1.3343939 and $\widehat{\beta} =$ 2.4593346. With our Gaussian weights, the weighted least-squares estimates were $\widehat{\alpha} =$ -1.4725965 and $\widehat{\beta} =$ 2.7143856, which is different, but not that different.

This is to be expected when you look at the weights for each point shown across the bottom of the plot. These are not that different from one another and no point has very small weight.

Scaling also determines locality

To be more responsive to the local structure, we need only consider changing the size of the scale parameter h that determines the standard deviation used in the Gaussian weight function.

To see the effect of changing this, and to illustrate a few other points, we’ll first wrap all of the above local fitting and drawing in a single demo function:

# This demo function allows us to see how
# the local fits behave as we change various elements
# 

demoLoWeSS <- function(x, y, xloc=NULL,
                       weightFn=GaussWeight,
                       span=0.1,
                       newplot=TRUE,
                       plot=TRUE,
                       showWeights=FALSE,
                       fullLine=TRUE,
                       cex=0.5,
                       col="grey80",
                       pcol="red",
                       wcol="pink",
                       lcol="steelblue",
                       pch=19,
                       lwd=2
                       ) {
  data <- data.frame(x=x, y=y)
  xrange <- extendrange(x)
  yrange <- extendrange(y)
  bandwidth <- (diff(xrange)*span)/4
  
  if (is.null(xloc)) {
    xloc <- median(x)
  }
 
  weights <- weightFn(xloc, x, bandwidth)
  fit <- lm(y~x, data=data, weights=weights)
  pred <- predict(fit, newdata=data.frame(x=xloc))
  
  # logical plot allows us to skip the plotting if FALSE
  if (plot) {
    if (newplot) {
      plot(x=x, y=y,
           col=col, pch=pch, cex=cex)
    }
    if (fullLine){
      abline(a=fit$coefficients[1], 
             b=fit$coefficients[2],
             col= lcol,
             lwd=lwd)
    } else 
    {
      xline <- xloc + c(-bandwidth, bandwidth)
      yline <- predict(fit, newdata=data.frame(x=xline))
      lines(xline, yline,
            col= lcol,
            lwd=lwd)
    }
    
    points(xloc, pred, pch=19, col=pcol)
    
    if (showWeights){
      weights <- weights/max(weights)
      ymin <- min(yrange)
      ymax <- diff(yrange)/3
      
      for (i in 1:length(weights)) {
        lines(x=rep(x[i],2),
              y=c(ymin, ymin + weights[i] * ymax),
              col=wcol,
              lty=1)
      }
    }
  }
  # invisibly return the location of the fitted point
  invisible(list(x=xloc, y=pred))
  
}

In the above, the argument span translates into the parameter h for the Gaussian weight function. Here span is roughly calibrated to be that proportion of the range of the x data that has non-neglible weights.

So let’s see what happens as we change the value of the span.

xloc <- 0.5  # as before
#
parOptions <- par(mfrow=c(2,2))
spans <- seq(0.95, 0.01, length.out=8)
for (s in spans) {
  demoLoWeSS(x,y, xloc=xloc, 
             span=s, showWeights = TRUE 
             )
  title(main = paste("span =", round(s,2)))
}

par(parOptions)

Putting the pieces together

We can now easily imagine fitting lines locally at a whole series of $x$ values.

#
#
demoPieces <- function(x,y, 
                       nlocs=10, 
                       span=0.1,
                       plot=TRUE, 
                       ...){
  
  xrange <- range(x)
  locs <- seq(xrange[1], xrange[2], length.out = nlocs)
  dotLocsY <- vector("numeric", length=nlocs)
  
  for (i in 1:nlocs) {
    xlocs <- locs[1:i]
    for (j in 1:i) {
       
      xloc <- xlocs[j]
  
      newplot <- if (j==1)  TRUE else FALSE
      if (j==i) {
        lcol <- "red"
        pcol <- "red"
        showWeights <- TRUE
      } else {
        lcol <- "steelblue"
        pcol <- "steelblue"
        showWeights <- FALSE}
      
      dotLoc <- demoLoWeSS(x,y, 
                           xloc=xloc, span=span,
                           newplot=newplot,
                           lcol=lcol,
                           pcol=pcol,
                           showWeights = showWeights,
                           fullLine = FALSE,
                           plot=plot,
                           ...)
      
    }
    
      dotLocsY[i] <- dotLoc$y
  }
  dotLocs <- data.frame(x=locs, y=dotLocsY)
  invisible(dotLocs)
}


parOptions <- par(mfrow = c(2,2))
dotLocs <- demoPieces(x,y,nlocs=8,span=0.8)

par(parOptions)

# Here's the plot again
plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "Local linear smooth: span=0.8")
#
lines(dotLocs, pch=19,col="steelblue", type="b", lwd=2)

#
# Decrease the span
parOptions <- par(mfrow=c(2,2))
dotLocs <- demoPieces(x,y,nlocs=10,span=0.4)

par(parOptions)

# Connect the dots
plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "Local linear smooth: span=0.4")
#
lines(dotLocs, pch=19,col="steelblue", type="b", lwd=2)

#
parOptions <- par(mfrow=c(2,2))
dotLocs <- demoPieces(x,y,nlocs=20,span=0.1)

par(parOptions)

# Connnect the dots
plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "Local linear smooth: span=0.1")
#
lines(dotLocs, pch=19,col="steelblue", type="b", lwd=2)

#
#

#
#  And finally, let's look at the small bandwidth
#  with many more points
#  Get the locations BUT do not plot lines yet.
dotLocs <- demoPieces(x,y,
                      nlocs=100,span=0.1, plot=FALSE)

# Here's the plot we want
plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "100 locations: span = 0.1")
#
lines(dotLocs, pch=19,col="steelblue", type="b", lwd=2)

So the thing to do is make a function that just does this.

Putting it all together

Here is one such function that will produce a smooth curve from minimizing the Locally Weighted Sum of Squares, or LoWeSS.

#
# Given a weight function the corresponding 
# weighted least squares estimate at any point(s) x 
# is easily constructed.
# 
# It requires: 
#   x, y      - the data
#   xloc      - x locations at which the estimate
#               is to be calculated
#   span      - a bandwidth
#   weightFn  - a weighting function, default will be GaussWeight
#   nlocs     - number of equi-spaced locations at which estimate
#               will be calculated if xloc=NULL, ignored otherwise.
#    
LoWeSS <- function(x, y, 
                   xloc=NULL, 
                   span=0.1,
                   weightFn=GaussWeight,
                   nlocs=200) {
  
  data <- data.frame(x=x, y=y)
  xrange <- extendrange(x)
  bandwidth <- (diff(xrange)*span)/4
  
  if (is.null(xloc)) {
    xloc <- seq(xrange[1], xrange[2], 
                length.out=nlocs)
  }
  
  estimates <- vector("numeric", length=length(xloc))
 
  for (i in 1:length(xloc)) {
    weights <- weightFn(xloc[i], x, bandwidth)
    fit <- lm(y~x, data=data, weights=weights)
    pred <- predict(fit, newdata=data.frame(x=xloc[i]))
    estimates[i] <- pred
  }
  # return the estimates
  list(x = xloc, y = estimates)
}

We can now apply this function to ourdata

smooth <- LoWeSS(x,y)

plot(x,y, 
     col="grey80", pch=19, cex=0.5, 
     main = "Our data")
lines(smooth$x, smooth$y, col="steelblue", lwd=2)

Now ourdata was actually generated as \[ Y_i = \mu(x_i) + R_i \] with $R_i \sim N(0, 0.2)$

So we are in the completely artificial position of being able to compare the fitted smooth with the true $\mu(x_i)$ that was used to produce these data. Here’s how the data were constructed.

# The fake data 
# 
# Get some x's
# 
set.seed(12032345)
N <- 300
x <- runif(150, 0,1)
x <- c(x, rnorm(150, 0.5, 0.3))
#
#  A function for the mean of ys
#  
mu <- function(x) {
  xrange <- range(x)
  vals <- vector("numeric", length=length(x))
  breaks <- quantile(x, probs=c(1/3,2/3))
  first <- x <= breaks[1]
  second <- (x > breaks[1]) & (x <= breaks[2])
  third  <- x > breaks[2]
  vals[first]  <- -(1 - x[first])^0.5 -0.1
  vals[second] <- sin(x[second] * 4 * pi) + x[second]/10
  vals[third]  <- x[third]^2
  vals
}
#
# generate some ys
# 
y <- mu(x) + rnorm(N, 0, .2)
ourdata <- data.frame(x=x, y=y)

# Here's the data plotted
plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "Our LoWeSS smoother")


# Add our naive "LoWeSS" smoother (previously calculated)
# 
lines(smooth$x, smooth$y, col="steelblue", lwd=2)

# And overlay the true mean of Y, namely  mu(x)
# 
xordered <- sort(x)
lines(xordered, mu(xordered), col="firebrick", lwd=2, lty=3)

What we have constructed here is a naive locally weighted sum of squares estimator. Clearly, there are some difficulties that require attention, such as:

choosing the bandwidth. What value? Also, ours only looked at $x$ locations; another might choose a proportion of the nearest $x$ values.
what weight function? Perhaps a more robust choice that actually gives zero weight to points that are far away.
what about the ends? Seems that you can only estimate using data from one side. Should that effect the choice of bandwidth there? Or the weight function?

Using both knn and local weights

The naive locally weighted sum of squares above did not define neighbourhoods, but rather used the scale parameter $h$ to determine how weights would diminish. Again this means that where the data is densest in $x$, more points will appear in the estimation than where it is sparser.

So we now return to asserting that there will be the same specified number of points, $k$, in a neighbourhood. Only points in that neighbourhood may have non-zero weights – all points outside of the neighbourhood will have zero points. We will also use some kernel function to downweight points in the neighbourhood. The kernel will again be evaluated at $(x_i - x) / h$ , but now we will choose $h$ to be a function of the maximum distance $|x_i -x|$ over all points in the neighbourhood.

loess: locally weighted sums of squares

In R there is a function called loess that fits a LOcally wEighted Sum of Squares estimate that pays a little more attention to some of these problems. For example,

fit <- loess(y~x, data=ourdata)

plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "The loess smoother: default span")
#
# Get the values predicted by the loess
pred <- predict(fit)
#
# Get the order of the x
Xorder <- order(x)
# 
lines(x[Xorder],pred[Xorder], lwd=2, col="steelblue")

We could also choose the span being used here:

fit <- loess(y~x, data=ourdata, span=0.15)

plot(x,y, 
     col="grey80", pch=19, cex=0.5,
     main = "The loess smoother: span=0.15")
#
# Get the values predicted by the loess
pred <- predict(fit)
#
# 
lines(x[Xorder],pred[Xorder], lwd=2, col="steelblue")

Notes on loess

The loess function uses a local neighbourhood determined by either - its argument span (default=0.75) as the proportion of points, say $a$, taken as the local neighbourhood (roughly knn with $k = \approx N \times a$, probably the ceiling of this?) - or its argument enp.target, this being the “number of equivalent parameters” in the model – like the effective model degrees of freedom, a measure of the complexity of the model. This too produces an equivalent proportion $a$ of points in the local neighbourhood.

Given the parameter $a < 1$, the default weighting is given Tukey’s tricube weight \[K(t) = \left\{ \begin{array}{lcl}(1 - |t|^3)^3 &~~~~~~~~& \mbox{for } ~~ |t| \le 1\\ 0 && \mbox{elsewhere} \end{array} \right.\] For the $i$th point in the neighbourhood, we take \[t_i = \frac{|x_i - x|}{\max_{j \in Nbhd(x)}|x_j - x|}.\]

If $a > 1$, the maximum distance in the above denominator is taken to be $a^{1/p}$ times the maximum distance to $x$ of all the $x_i$ s. Note that here $p$ is the number of explanatory variates in case there is more than 1. (in this case, $|x_i - x|$ is everywhere replaced by the Euclidean distance $||{\bf x}_i -{\bf x}||$)

Loess is not restricted to fitting local lines. loess can fit any degree polynomial locally (though typically only degree 1 or 2 is used in practice). Its default is 2.

The fitting mechanism is given by the parameter family and can be either gaussian (the default), which will use least-squares to fit the local polynomial, or symmetric which will begin with least squares and then perform a few iterations of an M-estimation using Tukey’s bisquare weight function.

Multidimensional x

As was suggested in the above discussion of loess, all of these local fitting methods are easily extended to cases with more than one explanatory variate simply by using distances $||{\bf x}_i -{\bf x}||$ in place of $|x_i - x|$ wherever the latter appears. For example the kernel weighting functions all become \[K\left(\frac{||{\bf x}_i -{\bf x}||}{h} \right)\] for some span parameter $h$.

Linear smoothers

A smoother is sometimes called a linear smoother if its vector of fitted values $\widehat{\boldsymbol \mu}$ at the $x_i$s can be written as
\[\widehat{\boldsymbol \mu} = {\bf S} {\bf y}\] for some $N \times N$ matrix ${\bf S}$ whose elements depend only on the values of $x_1, \ldots, x_N$. The smoothers are called linear because they are linear in $y$.

Nearly all of the smoothers discussed above are linear smoothers. The only exceptions are those where the fitting (locally or globally) has been done using weights that depend on the values of $y$. This would be the case, for example, if any of the iteratively reweighted M-estimates or the high breakdown estimates were used in the fitting – the resulting smooths would no longer strictly linear in $y$. There are other examples as well, but we will not be considering them here.

We really only have three different classes of smoother that we have considered. First are the splines based on a fixed set of $K$ knots (or equivalently a specified number of effective degrees of freedom). Of these we considered two possibilities ($p$-degrees splines and natural splines).

For $p$-degree splines we had \[\widehat{\boldsymbol \mu} = {\bf S}_b {\bf y} = {\bf B}\left({\bf B}^T{\bf B}\right)^{-1}{\bf B}^T ~ {\bf y}\] and similarly for natural splines, we had \[\widehat{\boldsymbol \mu} = {\bf S}_n {\bf y} = {\bf N}\left({\bf N}^T{\bf N}\right)^{-1}{\bf N}^T ~ {\bf y}.\] Both of these were called “projection smoothers” since they orthogonally project ${\bf y}$ onto the space spanned by the columns of ${\bf B}$ for $p$-degree splines or onto the space spanned by the columns of ${\bf N}$ for natural splines.

The second class is the set of smoothing splines was that of the “smoothing splines”. For these we had \[\widehat{\boldsymbol \mu} = {\bf S}_{\lambda} {\bf y} \] and showed that
\[ {\bf S}_{\lambda} = {\bf N}\left({\bf N}^T{\bf N} + \lambda \Omega_N \right)^{-1}{\bf N}^T = \left(I_N + \lambda {\bf K} \right)^{-1} \]

The third class is the set of local polynomial regression smoothers. We include kernel smoothers in this class (where the local polynomial is simply a constant). In this case, at each point $\bf x$ we fit a polynomial of some degree (typically 0, 1, or 2) by weighted least-squares with weights depending on the distance of the observed points ${\bf x}_i$ from the point of interest $\bf x$. When we look at the fitted values of at the observed $x_i$ we have \[\widehat{\mu}_i = \widehat{\mu}({\bf x}_i) = {\bf x}_i^T {\widehat{\boldsymbol \beta}_i}\] where ${\widehat{\boldsymbol \beta}_i} = \left({\bf X}^T{\bf W}_i{\bf X}\right)^{-1}{\bf X}^T{\bf W}_i {\bf y}$ is the weighted least-squared estimate of the coefficient vector with a diagonal matrix ${\bf W}_i$ of weights that are peculiar to each observation $i$. This means that each element $\widehat{\mu}_i$ of $\widehat{\boldsymbol \mu}$ can be written as \[\widehat{\mu}_i = {\bf s}_i^T {\bf y}\] where \[ {\bf s}_i^T = {\bf x}_i^T \left({\bf X}^T{\bf W}_i{\bf X}\right)^{-1}{\bf X}^T{\bf W}_i\] is a $1 \times N$ vector dependent only on the values of the $x_i$. This in turn means we can write any local polynomial regression smoother’s estimate of the fitted values as \[\widehat{\mu}_i = {\bf S}_w {\bf y}\] for \[{\bf S}= \left[\begin{array}{c}{\bf s}_1^T\\{\bf s}_2^T\\ \vdots \\ {\bf s}_N^T \end{array} \right].\] Note that this matrix, unlike the others, need not be symmetric.

All of these smoothers are therefore linear smoothers in that they share this common structure: \[\widehat{\boldsymbol \mu} = {\bf S} {\bf y}\] for variously defined ${\bf S}$, each being defined independently of the values of $y$.

Complementary viewpoints

Since all linear smoothers have the same form we should be able to look at them in much the same way – no matter how they were motivated or derived.

For example, local regression smoothers like loess were built on kernel functions which gave higher weight to observations nearest the point $x$ of interest. We might expect, therefore, that in computing the fit at any given $x$ that the coefficients multiplying any $y$ would be higher for a $y_i$ whose corresponding $x_i$ was closer to $x$ than for one which was farther away. To check this, we might have a look at the coefficients of each $y_i$ as a function of $x_i$.

First we need to have the smoother matrix that corresponds to a loess fit. As we did with the smoothing splines, we could compute this for any particular set of $x$ values as follows:

smootherMatrixLoess <- function(x,
                                span=NULL,
                                enp.target=NULL,
                                ...) {
  n <- length(x)
  S <- matrix(0, n, n)
  for (i in 1:n) {
    ei = rep(0, n)
    ei[i] <- 1
    # insert the fit into the i'th column of S
    if (is.null(span) & is.null(enp.target))
    {
      S[,i]  <- predict(loess(ei ~ x, ...))
    } else {
      if (is.null(span)) {
        S[,i]  <- predict(loess(ei ~ x, 
                                enp.target=enp.target, 
                                ...))
        
      } else {
        S[,i]  <- predict(loess(ei ~ x, 
                                span=span, 
                                ...))
        
      }
      
    }
    
  }
  # For loess, the smoother matrix need
  # not be a symmetric matrix
  S
}

We determine this on our fake data as follows.

S_l <- smootherMatrixLoess(ourdata$x, enp.target=7)
S <- S_l

# We now just look at the coefficients for any one of the ys.
# 
# Let's arbitrarily pick a couple of rows at random
set.seed(12231299)
row <- sample(1:nrow(S), 2)

# First get the order of the x
xOrder <- order(ourdata$x)

# now plot the first row's coefficients ordered by x value
i <- 1
plot(ourdata$x[xOrder], S[row[i],][xOrder], type="l",
     xlim=extendrange(ourdata$x), ylim=extendrange(S[row,]),
     xlab="x", ylab="Coefficient of y",
     main="Two rows of smoother matrix",
     lwd=3,
     col=adjustcolor("steelblue", 0.5))
abline(v=ourdata$x[row[i]], col="steelblue")

# And another row
i <- 2
lines(ourdata$x[xOrder], S[row[i],][xOrder], 
      lwd=3, lty=2,
      col=adjustcolor("steelblue", 0.5))
abline(v=ourdata$x[row[i]], col="steelblue", lty=2)

As can be seen the coefficients concentrate around the $x$ value.

Even though smoothing splines were derived from a global minimization, because they turn out to be linear smoothers, we can look at their coefficients as well.

S_s <- smootherMatrix(ourdata$x, df=7)
S <- S_s
# We now just look at the coefficients for any one of the ys.
# 
# Let's pick the same couple of rows as before
set.seed(12231299)
row <- sample(1:nrow(S), 2)

# First get the order of the x
xOrder <- order(ourdata$x)

# now plot the first row's coefficients ordered by x value
i <- 1
plot(ourdata$x[xOrder], S[row[i],][xOrder], type="l",
     xlim=extendrange(ourdata$x), ylim=extendrange(S[row,]),
     xlab="x", ylab="Coefficient of y",
     main="Two rows of smoother matrix",
     lwd=3,
     col=adjustcolor("firebrick", 0.5))
abline(v=ourdata$x[row[i]], col="firebrick")

# And another row
i <- 2
lines(ourdata$x[xOrder], S[row[i],][xOrder], 
      lwd=3, lty=2,
      col=adjustcolor("firebrick", 0.5))
abline(v=ourdata$x[row[i]], col="firebrick", lty=2)

As was the case with the local regression, the coefficient of $y_i$ is higher the closer its $x_i$ value is to the value of $x$.

Similarly, when we determined the various spline smoothers we considered them as linear combinations of various basis functions. In particular, we looked at some of the orthogonal basis functions that went into detemining any smoother.

We can do the same for any linear smoother. While the spline based smoother matrices ${\bf S}$ were symmetric, this need not be the case in general and it is not the case for the local regression estimators. So, rather than working with the eigen decomposition of the smoother matrix, we work with its singular value decomposition. That is, we decompose any smoother matrix ${\bf S}$ as \[{\bf S} = {\bf U}{\bf D}_{\rho}{\bf V}^T\] for $N \times N$ matrices ${\bf U}=[{\bf U}_1, \ldots, {\bf U}_N]$, ${\bf V}=[{\bf V}_1, \ldots, {\bf V}_N]$, ${\bf D}_{\rho}=diag(\rho_1, \ldots, \rho_N)$ with $\rho_1 \ge \rho_2 \ge \cdots \ge \rho_N \ge 0$ and \[{\bf U}^T{\bf U}=I_N = {\bf V}^T{\bf V}. \]

The smooth can now be written as \[\begin{array}{lcl} \widehat{\boldsymbol \mu} &=& {\bf U}{\bf D}_{\rho}{\bf V}^T {\bf y} \\ & & \\ &=& \sum_{i=1}^N {\bf U}_i \rho_i <{\bf V}_i , {\bf y}> \\ \end{array} \] which separates into the basis vectors ${\bf U}_i$, the singular values $\rho_i$ and the orthogonal component of ${\bf y}$ along the direction vectors ${\bf V}_i$.

If we consider the smoothing spline, as we did before, we can plot these various components as follows

svd_s <- svd(S_s)
sum(svd_s$d)

## [1] 6.998441

plot(svd_s$d, type="b", 
     col=adjustcolor("firebrick",0.5),
     cex=0.5, pch=19, 
     main="singular values", xlab="index", ylab="rho")

plot(t(svd_s$v) %*% ourdata$y, 
     col=adjustcolor("firebrick",0.5),
     cex=0.5, pch=19, 
     type="b", main="y components", 
     xlab="index", ylab="y-component")

parOptions <- par(mfrow=c(2,2))
k <- 12
ylim <- extendrange(svd_s$u[,1:k])
for (i in 1:k) {
  plot(ourdata$x[xOrder],svd_s$u[xOrder,i], 
       type="l", ylim=ylim, col="firebrick",
       main=paste("basis", i), xlab="x", ylab="u value")
}

par(parOptions)

As can be seen, the singular values and the $y$ components die off quickly. These are the multipliers of the basis functions. The orthogonal basis functions increase in complexity as $i$ increases. These higher frequency basis functions are largely obliterated by the small singular values and $y$ components.

Now since loess is also a linear smoother, we can do the same for loess on this data.

svd_l <- svd(S_l)
sum(svd_l$d)

## [1] 8.347651

parOptions <- par(mfrow=c(1,2))
plot(svd_l$d, 
     type="b", main="singular values", 
     col=adjustcolor("steelblue",0.5),
     cex=0.5, pch=19, 
     xlab="index", ylab="rho")
plot(t(svd_l$v) %*% ourdata$y, 
     col=adjustcolor("steelblue",0.5),
     cex=0.5, pch=19, 
     type="b", main="y components", 
     xlab="index", ylab="y-component")

par(parOptions)

parOptions <- par(mfrow=c(2,2))
k <- 12
ylim <- extendrange(svd_l$u[,1:k])
for (i in 1:k) {
  plot(ourdata$x[xOrder],
       svd_l$u[xOrder,i], 
       type="l", ylim=ylim,
       col="steelblue",
       main=paste("basis", i), 
       xlab="x", ylab="u value")
}

par(parOptions)

The local weighted least squares estimate shows much the same pattern as the smoothing spline. It too has a set of orthogonal basis functions for which the singular values and the $y$ components die out quickly. Again, the orthogonal basis functions increase in complexity as $i$ increases and the higher frequency basis functions are largely obliterated by the small singular values and $y$ components.

To make a few direct comparisons, we could overplot some of these as follows:

S <- S_s
# First get the order of the x
xOrder <- order(ourdata$x)

i <- 1
plot(ourdata$x[xOrder], 
     S[row[i],][xOrder],
     type="l",
     xlim=extendrange(ourdata$x),
     ylim=extendrange(c(range(S_l[row,]),
                        range(S_s[row,]))),
     xlab="x", ylab="Coefficient of y",
     main="Two rows of smoother matrix",
     lwd=3,
     col=adjustcolor("firebrick", 0.5))
abline(v=ourdata$x[row[i]], col="grey80")

# And another row
i <- 2
lines(ourdata$x[xOrder], S[row[i],][xOrder], 
      lwd=3, lty=2,
      col=adjustcolor("firebrick", 0.5))
abline(v=ourdata$x[row[i]], col="grey80", lty=2)
# now plot the first row's coefficients ordered by x value
i <- 1
S <- S_l
lines(ourdata$x[xOrder], S[row[i],][xOrder],
     lwd=3,
     col=adjustcolor("steelblue", 0.5))
# And another row
i <- 2
lines(ourdata$x[xOrder], S[row[i],][xOrder], 
      lwd=3, lty=2,
      col=adjustcolor("steelblue", 0.5))
legend(-0.5, 0.025, 
       legend = c("spline", "loess"), 
       lwd=3, 
       col = adjustcolor(c("firebrick", "steelblue"),
                         0.5)
       )

#
#
plot(svd_l$d, 
     type="b", main="singular values", 
     col=adjustcolor("steelblue",0.5),
     cex=0.5, pch=19, 
     xlab="index", ylab="rho")
points(svd_s$d, type="b", 
     col=adjustcolor("firebrick",0.5),
     cex=0.5, pch=19, 
     main="singular values", xlab="index", ylab="rho")

plot(t(svd_s$v) %*% ourdata$y, 
     col=adjustcolor("firebrick",0.5),
     cex=0.5, pch=19, ylim=c(-5,15),
     type="b", main="y components", 
     xlab="index", ylab="y-component")
points(t(svd_l$v) %*% ourdata$y, 
     col=adjustcolor("steelblue",0.5),
     cex=0.5, pch=19, 
     type="b", main="y components", 
     xlab="index", ylab="y-component")

More than one explanatory variate

Suppose we have more than one explanatory variate. For example, consider the ozone data set from the package ElemStatLearn:

library(ElemStatLearn) 
head(ozone)

##   ozone radiation temperature wind
## 1    41       190          67  7.4
## 2    36       118          72  8.0
## 3    12       149          74 12.6
## 4    18       313          62 11.5
## 5    23       299          65  8.6
## 6    19        99          59 13.8

Here we have 111 daily measurements taken from May until September 1973 in New York on four variates:

ozone, the ozone concentration in parts per billion (ppb),
radiation, the solar radiation energy measured in langleys,
temperature, the maximum temperature that day in degrees Fahrenheit, and
wind, the wind speed in miles per hour.

Interest lies in modelling how ozone depends on the other variates.

pairs(ozone, pch=19, col=adjustcolor("firebrick",0.4))

To begin, let’s just try modelling ozone levels as a function of only two variates. That will give three variates in total and allow us to see what’s going on using some three dimensional graphics called scatter3d from the car package.

For example, if we could use a linear model for the mean ozone level. \[\mu(x_i, z_i) = \alpha + \beta (x_i - \overline{x}) + \gamma (z_i - \overline{z}) \] where $x_i$ is the radiation and $z_i$ is the temperature. In this case, we are fitting a plane to a three dimensional point cloud.

Execute the following code and explore.

library(rgl)  # Access to all of open GL graphics

#  Get some graphing code (very slightly) adapted from John Fox's "car" package to accomodate loess.
#  
source("../../Code/scatter3d.R")   # from the course home page
#
scatter3d(ozone ~ radiation + temperature, data=ozone)

# Get a png snap shot of this using rgl package's
# snapshot3d("ozoneRegressionPlane.png")  
#

Fitting a plane to ozone, radiation, and temperature

Trying some power transformation on the ozone variate, it might be better to model its cube root instead - ozone^(1/3).

scatter3d(ozone^(1/3) ~ radiation + temperature, 
          data=ozone,
          model.summary=TRUE  # print out a summary of the fitted model
          )

## $linear
## 
## Call:
## lm(formula = y ~ x + z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23670 -0.08050 -0.00671  0.05930  0.38993 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.10578    0.02951   3.585 0.000508 ***
## x            0.14024    0.04010   3.497 0.000685 ***
## z            0.55325    0.05044  10.968  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1121 on 108 degrees of freedom
## Multiple R-squared:  0.6112, Adjusted R-squared:  0.604 
## F-statistic: 84.88 on 2 and 108 DF,  p-value: < 2.2e-16

# snapshot3d("ozone3RegressionPlane.png")  
#  
# Note that the above coefficient estimates 
# do not match the ones below
lmfit <- lm(ozone^(1/3) ~ radiation + temperature,
            data=ozone)
summary(lmfit)

## 
## Call:
## lm(formula = ozone^(1/3) ~ radiation + temperature, data = ozone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18350 -0.40252 -0.03355  0.29652  1.94967 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.1530326  0.4398302  -4.895 3.46e-06 ***
## radiation    0.0021443  0.0006132   3.497 0.000685 ***
## temperature  0.0643317  0.0058653  10.968  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5603 on 108 degrees of freedom
## Multiple R-squared:  0.6112, Adjusted R-squared:  0.604 
## F-statistic: 84.88 on 2 and 108 DF,  p-value: < 2.2e-16

# The difference in the coefficients is because
# scatter3d standardizes all of the variates before fitting.
# Note that the R-squared, etc. is the same.

Fitting a plane to ozone, radiation, and temperature

What happens if we add an interaction term?

scatter3d(ozone^(1/3) ~ radiation + temperature, 
          data=ozone, 
          fit = "interaction")
# snapshot3d("ozone3interaction.png")

Fitting a plane to ozone, radiation, and temperature

The difference might best be seen when we have these two together on the same plot.

scatter3d(ozone^(1/3) ~ radiation + temperature, 
          data=ozone, 
          fit = c("linear",  "interaction"))
# snapshot3d("ozone3linearandinteraction.png")

Fitting an interaction to ozone, radiation, and temperature

The plane (linear) fit is in blue, the one with an interaction in green.

The plane fitted a surface of the form \[\widehat{\mu}(x,z) = \widehat{\beta}_0 + \widehat{\beta}_1 (x - \overline{x}) + \widehat{\beta}_2 (z - \overline{z})\] which is strictly additive in each of $x$ and $z$. Moreover, each additive component for $x$ and for $z$ is itself linear (in $x$ and in $z$ respectively). Note that we choose to centre each of the explanatory variates. The “intercept” term $\widehat{\beta}_0$ in this model is interpreted as the point on the surface (or average of the $y$s) where $x=\overline{x}$ and $z=\overline{z}$.

When we add an interaction term, we lose the additivity. The surface being fit now has the form \[\widehat{\mu}(x,z) = \widehat{\beta}_0 + \widehat{\beta}_1 (x - \overline{x}) + \widehat{\beta}_2 (z - \overline{z}) + \widehat{\beta}_3 (x - \overline{x}) (z - \overline{z}).\] This is no longer additive in $x$ or in $z$. Instead, the interaction term means that the component for $x$ will depend on the value of $z$ and conversely the component for $z$ will depend on the value of $x$. That is, for any choice of value $z_0$ of $z$, the component for $x$ is still a straight line \[(\widehat{\beta}_0 + \widehat{\beta}_2 (z_0 - \overline{z})) + (\widehat{\beta}_1 + \widehat{\beta}_3 (z_0 - \overline{z})) (x - \overline{x}) \] except that now both the intercept and the slope depend on the value of $z_0$. A similar line holds for the component of $z$ at any value $x_0$ of $x$.

Suppose we add quadratic components for each variate.

scatter3d(ozone^(1/3) ~ radiation + temperature, 
          data=ozone,
          fit=c("quadratic")
          )
# snapshot3d("ozone3RegressionQuadratic.png")

Fitting an additive paraboloid to ozone, radiation, and temperature

The fitted surface has the form \[\widehat{\mu}(x,z) = \widehat{\beta}_0 + \widehat{\beta}_1 (x - \overline{x}) + \widehat{\beta}_2 (z - \overline{z}) + \widehat{\beta}_3 (x - \overline{x})^2 + \widehat{\beta}_4 (z - \overline{z})^2. \] As was the case with the simple planar surface, this surface is also additive in each of $x$ and $z$.
Instead of a straight line though, each additive function is now a quadratic.

A non-additive quadratic fit would include a cross product (or interaction) term $xz$

scatter3d(ozone^(1/3) ~ radiation + temperature, 
          data=ozone,
          fit=c("quadint"))
# snapshot3d("ozone3quadint.png")  
#

Fitting a paraboloid to ozone, radiation, and temperature

The effect of the additional interaction term is best appreciated by having both surfaces appear together:

scatter3d(ozone^(1/3) ~ radiation + temperature, 
          data=ozone,
          fit=c("quadratic", "quadint")
          )

Again, blue is the additive quadratic, green the quadratic with interaction. The effect of the interaction term on the quadratic additive fit is very much like that of the same term on the linear additive fit.

Multiple regression smooths.

Rather than use simple polynomial models, we could also use smoothers. And, in the same fashion, we can choose between additive models and non-additive, or interaction, models.

Two smoothing approaches which generalize to the multiple regression case are local polynomial regression using loess and thin-plate splines using the smoothing splines with the basis functins described earlier.

We can have a look at each of these in turn.

loess

We begin with using local polynomial regression via loess. With default settings we have:

scatter3d(ozone^(1/3) ~ radiation + temperature,
          data=ozone,
          fit=c("loess"))
# snapshot3d("ozone3loess.png")

loess smooth - to ozone, radiation, and temperature

We can see the effect of a more flexible fit by increasing the effective degrees of freedom for the smooth.

scatter3d(ozone^(1/3) ~ radiation + temperature, 
          data=ozone,
          fit=c("loess"), 
          df.loess=20)
# snapshot3d("ozone3loessdf20.png")

loess with 20 df - to ozone, radiation, and temperature

scatter3d(ozone^(1/3) ~ radiation + temperature, 
          data=ozone,
          fit=c("loess"), 
          df.loess=30)
# snapshot3d("ozone3loessdf30.png")

loess with 30 df - to ozone, radiation, and temperature

Thin-plate splines

The default smooth here is a “thin-plate spline” as described earlier.

scatter3d(ozone^(1/3) ~ radiation + temperature, 
          data=ozone,
          fit=c("smooth"))
# snapshot3d("ozone3smooth.png")

thin plate smooth - to ozone, radiation, and temperature

We can compare this flexible fit with loess by trying to match the effective degrees of freedom for each smooth.

scatter3d(ozone^(1/3) ~ radiation + temperature, 
          data=ozone,
          fit=c("smooth"), 
          df.smooth=20)
# snapshot3d("ozone3smoothdf20.png")

thin plate smooth with 20 df - to ozone, radiation, and temperature

We can compare this directly with loess by placing them on the same plot.

scatter3d(ozone^(1/3) ~ radiation + temperature, 
          data=ozone,
          fit=c("loess","smooth"),
          df.loess=20, df.smooth=20)

Here loess is blue, thin-plate green. The major differences appear to be at the corners where there are no points – differences reflect different strategies for extrapolation.

Again, for comparison, we look at the thin-plate smooth for 30 effective degrees of freedom.

scatter3d(ozone^(1/3) ~ radiation + temperature, 
          data=ozone,
          fit=c("smooth"), 
          df.smooth=30)
# snapshot3d("ozone3smoothdf30.png")

thin plate smooth with 30 df - to ozone, radiation, and temperature

And compare the thin-plate spline (green) with the analogous loess (blue):

scatter3d(ozone^(1/3) ~ radiation + temperature, 
          data=ozone,
          fit=c("loess", "smooth"), 
          df.loess=30, df.smooth=30)
# snapshot3d("ozone3bothdf30.png")

loess and thin plate smooths each with 30 df - to ozone, radiation, and temperature

Again the major differences are in locations with no points in the dataset.

Additive models

One of the challenges of multiple regression, particularly for thin-plate splines and local regression methods is that as dimensionality increases so must the local neighbourhood in order to get enough points to construct the fit.

One way around this is to use additive models. Since each additive component is a function only of a single variate, the neighbourhoods needed for estimation need not be as large.

The model surface fitted now has the form: \[\widehat{\mu}(x,z) = \widehat{\beta}_0 + \widehat{\beta}_1 f_1(x - \overline{x}) + \widehat{\beta}_2 f_2(z - \overline{z}) \] where $f_1(\cdot)$ and $f_2(\cdot)$ are splines of some sort (typically either regression or smoothing splines). The default in our demonstration software scatter3d uses cubic regression splines.

These behave much like additive models in the usual linear model, except now the functional form of each component is very flexible.

scatter3d(ozone^(1/3) ~ radiation + temperature,
          data=ozone,
          fit=c("additive"))
# snapshot3d("ozone3additive.png")

gam default

As with adding an interaction term, we might compare this with, for example, a thin plate spline:

scatter3d(ozone^(1/3) ~ radiation + temperature,
          data=ozone,
          fit=c("additive", "smooth"))

What we are giving up in flexibility, we are gaining in estimability and in interpretability.

We can make this a more flexible function fit by specifying the span or, to make it comparable to the other smoothers, by specifying the effective degrees of freedom.

Now the argument specifies the effective or target degrees of freedom for each component. This means that the total degrees of freedom for the smooth part (aside from the intercept) will be the sum of the degrees of freedom for each smooth component.

scatter3d(ozone^(1/3) ~ radiation + temperature,
          data=ozone,
          fit=c("additive"), df.additive=10)
# snapshot3d("ozone3additive10.png")

gam 10 df each

The corresponding thin-plate spline, for example, would have effective degrees of freedom of 20.

scatter3d(ozone^(1/3) ~ radiation + temperature,
          data=ozone,
          fit=c("additive", "smooth"), 
          df.additive=10, df.smooth=20)
# snapshot3d("ozone3additive10sm20.png")

gam 10 df each + smooth on 20

The differences can become more severe as the effective degrees of freedom increases.

scatter3d(ozone^(1/3) ~ radiation + temperature,
          data=ozone,
          fit=c("additive", "smooth"), 
          df.additive=15, df.smooth=30)

And finally, comparing the additive fits to local polynomial regression.

scatter3d(ozone^(1/3) ~ radiation + temperature,
          data=ozone,
          fit=c("additive", "loess"), 
          df.additive=10, df.smooth=20)

Again the greatest discrepancies appear to be where there is little or no data in the $x$-$z$ plane.

Advantages of additive models

Like other smoothing methods for multiple explanatory variates, additive models have the ability to flexibly fit the data without committing to a parametric functional form. In this sense, smoothing methods are sometimes called nonparametric (although clearly several parameters actually remain, just that the few left offer a very large set of possibilities).

The principal advantages of additive models over the other smoothing models (above) are

Ease of interpretation - being additive in each explanatory variate, the dependence of the response on each variate is described by a single smooth function; no interaction between variates is present.
The “curse of dimensionality” is also ameliorated somewhat since neighbourhoods used for smoothing purposes are selected only in a one-dimensional space and not on some $p$ dimensional neightbourhood which can be sparse for large $p$.

The principal disadvantage is the assumption of additivity. If there is significant interaction (i.e. any additive component depends on two or more explanatory variates) then the fitted model can be misleading.

Interaction can of course be added as in the linear model by adding, for example, a term $f (x_i \times z_i)$ which treats the product $x_iz_i$ as a new variate. Alternatively, a component $f(x_i, z_i)$ could be added and fitted with a thin-plate spline or a local-polynomial regression.

Example:

Suppose we fit ozone to all three explanatory variates, that is our fitted model of the surface (now in 4 dimensions) is of the form \[\widehat{\mu}(x,t,w) = \widehat{\beta}_0 + \widehat{\beta}_1 f_1(x - \overline{x}) + \widehat{\beta}_2 f_2(t - \overline{t}) + \widehat{\beta}_3 f_3(w - \overline{w}) \] where $x$, $t$, and $w$ are radiation, temperature, and wind respectively. We fit this model using the function gam (for generalized additive model) from the mgcv package. This is the same function that was used in scatter3d.

The function behaves as if it were a simple extension of lm(...). This means that if it is given a formula which looks like a linear model, then it will build a linear model of that form. For example,

lmgam <- gam(ozone^(1/3) ~ radiation + temperature + wind,
              data=ozone)
summary(lmgam)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## ozone^(1/3) ~ radiation + temperature + wind
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.2973296  0.5552139  -0.536 0.593400    
## radiation    0.0022055  0.0005585   3.949 0.000141 ***
## temperature  0.0500443  0.0061062   8.196 5.85e-13 ***
## wind        -0.0760220  0.0157548  -4.825 4.67e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.672   Deviance explained = 68.1%
## GCV = 0.26999  Scale est. = 0.26026   n = 111

If we would like to have the additive components modelled via a flexible smooth, then we need to specify the components smooth functions. This is done by wrapping each such term in a smooth function using s(...) as follows:

gamfit <- gam(ozone^(1/3) ~ s(radiation) + 
                            s(temperature) + 
                            s(wind),
              data=ozone)
summary(gamfit)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## ozone^(1/3) ~ s(radiation) + s(temperature) + s(wind)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.24778    0.04282   75.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df      F  p-value    
## s(radiation)   2.113  2.661  8.447 0.000114 ***
## s(temperature) 4.279  5.282 13.980 3.10e-11 ***
## s(wind)        2.509  3.179 10.627 2.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.743   Deviance explained = 76.4%
## GCV = 0.22348  Scale est. = 0.20355   n = 111

Each component can now be plotted separately and the dependence of the response on that variate interpreted separately from the other variates.

plot(gamfit, main="Ozone",
     ylab="partial residuals", jit=TRUE,
     residuals =TRUE, pch=19)

Note that on the $y$ axis the partial residuals are plotted. These are the residuals after subtracting the other fitted components in the model. The short vertical marks at the bottom of the plot are the values for that explanatory variate. Because jit=TRUE the locations of these values have been “jittered” by adding a small amount of random noise; this allows the number of points to be more easily seen when there are ties.

Note also, that these models can be compared much like the usual linear models via an anova(...) function in R. For example, in comparing the additive model to the linear model we have

anova(lmgam, gamfit)

## Analysis of Deviance Table
## 
## Model 1: ozone^(1/3) ~ radiation + temperature + wind
## Model 2: ozone^(1/3) ~ s(radiation) + s(temperature) + s(wind)
##   Resid. Df Resid. Dev     Df Deviance
## 1   107.000     27.848                
## 2    98.879     20.578 8.1214     7.27

This produces a deviance which can be compared to a $\chi^2$ random variate on df.residual(lmgam) - df.residual(gamfit) degrees of freedom. The following calculation

d <- deviance(lmgam) - deviance(gamfit)
df <- df.residual(lmgam) - df.residual(gamfit)
SL <- 1 - pchisq(d, df=df)
# Significance level is
SL

## [1] 0.2866793

Which is large enough that that there is no evidence against the hypothesis that all three additive components are strictly linear.

Smoothing

R.W. Oldford

Fitting locally

Piecewise fitting

Splines

Fitting splines in R

Natural splines

Fitting natural splines

Smoothing splines

Choosing \(\lambda\) via degrees of freedom

Multidimensional splines

Local neighbourhoods

K nearest neighbour fitting

Local weighting

Illustration

Scaling also determines locality

Putting the pieces together

Putting it all together

Using both knn and local weights

loess: locally weighted sums of squares

Notes on loess

Multidimensional x

Linear smoothers

Complementary viewpoints

More than one explanatory variate

Multiple regression smooths.

loess

Thin-plate splines

Additive models

Advantages of additive models

Example: