Linear regression

Recall the linear regression model: $y_i = {\bf x}_i^T {\boldsymbol \beta} + r_i \hspace{10em} (1)$ with ${\bf x}_i \mbox{ and } {\boldsymbol \beta} \in \Re^p$ for all $i =1,\ldots, n$ . Here $y_i$ is the $i$ ’th observed value of the response variate $y$ , ${\bf x}_i$ is the $i$ ’th observed value of a $p \times 1$ column vector of explanatory variates, ${\boldsymbol \beta} = (\beta_1, \ldots, \beta_p)^T$ is a $p \times 1$ column vector of unknown parameters, and $r_i$ is the $i$ ’th realized residual (realized in that it has a value even though it has not been observed in contrast with $y_i$ which has been both realized and observed).

Having realized (and observed) values for $y_1, y_2, \ldots, y_n$ and $\bf{x}_1, \bf{x}_2, \ldots, \bf{x}_n$ , we use some method (e.g. least-squares) to arrive at an estimate of the unknown parameter vector ${\boldsymbol \beta}$ .

An example: Swiss fertility data

For example, consider the swiss data from R. This data consists of a standardized fertility measurement and five other standardizded socio-economic measurements taken on each of 47 different french speaking Swiss cantons/districts in 1888. Around this time Switzerland was entering a demographic transition period where its fertility was beginning to drop from the high level that is typical of underdeveloped countries.

The variates are:

Fertility, here a ‘common standardized fertility measure’ denoted `Ig’ and scaled to be in [0, 100],
Agriculture, as % of males involved in agriculture as occupation,
Examination as % draftees receiving highest mark on army examination,
Education as % education beyond primary school for draftees
Catholic as % of ‘catholics’ (as opposed to ‘protestants’) in the canton, and
Infant.Mortality, a rate, being the number of live births who die before reaching 1 year old per 1,000 live births.

Source info https://opr.princeton.edu/archive/pefp/switz.aspx

The first few rows of the data look like:

head(swiss)

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

A straight line model

Interest might reasonably lie in whether values of the fertility index are related to values of infant mortality. That being the case, we might plot pairs $(x_i, y_i)$ for $i=1, \ldots, 47$ where $x_i$ is the infant mortality and $y_i$ the fertility for canton/district $i$ . This would look like:

with(swiss,
     plot(x=Infant.Mortality, y=Fertility)
     )

A straight line could be fitted to this model using least-squares and then the fitted line to the plot as follows:

# Construct the fit of the linear model (using lm)
myfit <- lm(Fertility ~ Infant.Mortality, data=swiss)

# Show the value of `myfit'
myfit

## 
## Call:
## lm(formula = Fertility ~ Infant.Mortality, data = swiss)
## 
## Coefficients:
##      (Intercept)  Infant.Mortality  
##           34.515             1.786

# plot the data
with(swiss,
     plot(x=Infant.Mortality, y=Fertility)
     )


# `myfit' actually has a lot of named components
names(myfit)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

# We can use the `coefficient' component to
# add the fitted line to the existing plot
abline(myfit$coefficient, col="grey")

With respect to equation (1), we have $n=47$ for the 47 points (cantons/districts) and $p=2$ with $i=1$ corresponding to the “Courtelary” district/canton having $y_1=80.2$ and ${\bf x}_1 = (1,~ 22.2)^T$ .

Note: in the code above, specification of equation (1) is simply Fertility ~ Infant.Mortality. No parameters are mentioned, just the response variate on the left of the tilde and the explanatory variates on the right. Moreover, the intercept is not mentioned at all. This is because an intercept term is nearly always included in the model. If one did not want an intercept term, it would have to be explicitly removed from the specification, as in Fertility ~ Infant.Mortality - 1. Note that $1$ is the constant value of the explanatory variate associated with the intercept parameter.

The least-squares estimate $\widehat{{\boldsymbol \beta}}$ of the parameter vector ${\boldsymbol \beta}$ is $\widehat{{\boldsymbol \beta}} = \left(\begin{array}{c} \widehat{\beta}_1\\ \\ \widehat{\beta}_2 \end{array}\right) = \left(\begin{array}{c} 34.515\\ \\ 1.786 \end{array}\right)$

From these estimates and using equation (1), the estimated residuals, $\widehat{r}_i$ , are given as $\begin{array}{rcl} \widehat{r}_i &=& y_i - \bf{x}^T \widehat{{\boldsymbol \beta}} \\ &&\\ &=& y_i - (34.515 + 1.786 \times x_i)\\ \end{array}$ where here $x_i$ denotes the Infant.Mortality for the $i$ ’th canton/district and $y_i$ its Fertility.

The entire set of estimated residuals may be had from the fit and then viewed in a histogram as follows:

resids <- myfit$residuals

hist(resids)

Alternatively, a better view might be that of a density estimate:

dens <- density(resids)
plot(dens, main="An estimate of the residual density")

One might ask what sort of distribution the estimated residuals appear to be a sample from? Could it be a normal (or Gaussian) distribution?

One way to find out would be to look at a normal quantile-quantile plot:

qqnorm(resids)

Thoughts? What are these things?

Here’s another picture of a qqplot from the R package qqtest (you might have to install it first, e.g. using install.packages("qqtest"); and afterwards try help(qqtest) ).

require(qqtest)

## Loading required package: qqtest

qqtest(resids)

Note the differences. Now what can you say about the distribution of the estimated residuals?

The data structure fit myfit contains a good deal more info.

summary(myfit)

## 
## Call:
## lm(formula = Fertility ~ Infant.Mortality, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.672  -5.687  -0.381   7.239  28.565 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       34.5155    11.7113   2.947  0.00507 **
## Infant.Mortality   1.7865     0.5812   3.074  0.00359 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.48 on 45 degrees of freedom
## Multiple R-squared:  0.1735, Adjusted R-squared:  0.1552 
## F-statistic: 9.448 on 1 and 45 DF,  p-value: 0.003585

So … what is all this … stuff?

And what do we do with it?

A generative probability model

What model assumptions are needed for the above output to be interpreted properly?

Equation (1) was about the realized (though possibly unknown) values $y_i = {\bf x}_i^T {\boldsymbol \beta} + r_i \hspace{10em} (1)$ We also want a model for how the response was generated: $Y_i ={\bf x}_i^T {\boldsymbol \beta} + R_i \hspace{10em} (2)$ for $i= 1, \ldots, n$ .

Here, the upper case notation indicates that $Y_i$ is a random variate representing the set of possible values which the $i$ ’th response might take. The lower case $y_i$ is a number, a realized value from the random distribution of $Y-i$ . So too for $R_i$ and $r_i$ .

Note that $n, \bf{x}_i$ and ${\boldsymbol \beta}$ are not random variates in this model.

Further model assumptions often include:

$E(R_i) = 0$ for all $i=1, \ldots, n$ – common and zero residual mean
$SD(R_i) = \sigma$ for all $i=1, \ldots, n$ – common constant residual standard deviation
$R_1, R_2, \ldots, R_n$ are independently distributed
$R_1, R_2, \ldots, R_n$ are identically distributed
$R_i$ is normally distributed (or, equivalently, distributed as Gaussian)

Putting this all together and the normal or Gaussian linear model can be written as $\begin{array}{rcl} Y_i &=&{\bf x}_i^T {\boldsymbol \beta} + R_i \hspace{10em} (3) \\ && \\ \mbox{and} & & ~~ R_i \sim N(0,\sigma^2) ~~\mbox{independently} \end{array}$ for $i= 1, \ldots, n$ .

Alternatively, an equivalent expression of this generative model, one which simply emphasizes the mean structure, is $\begin{array}{rcl} Y_i &=&\mu_i + R_i \hspace{10em} (4) \\ && \\ \mu_i &=&{\bf x}_i^T {\boldsymbol \beta} \\ && \\ \mbox{and} & & ~~ R_i \sim N(0,\sigma^2) ~~\mbox{independently} \end{array}$ for $i= 1, \ldots, n$ .

The straight line model and interpretability of parameters

When the model is simply a straight line we have $p=2$ , ${\bf x}_i^T =(1, x_i)$ , and the vector of unknown linear parameters ${\boldsymbol \beta}$ is often written as ${\boldsymbol \beta}^T = (\beta_0, \beta_1)$ .

This lets us write the model simply as $\begin{array}{rcl} Y_i &=&\mu_i + R_i \hspace{10em} (5) \\ && \\ \mu_i &=& \beta_0 + \beta_1 {x}_i \\ && \\ \mbox{and} & & ~~ R_i \sim N(0,\sigma^2) ~~\mbox{independently} \end{array}$ for $i= 1, \ldots, n$ .

Questions

What does $\beta_1$ mean in this model? What does it mean for the straight line model for the Swiss fertility data?
What does $\beta_0$ mean? In terms of the straight line model for the fertility data?

We can always reparameterize the straight line model as follows: $\begin{array}{rcl} Y_i &=&\mu_i + R_i \hspace{10em} (6) \\ && \\ \mu_i &=& \alpha + \gamma \left({x}_i - \bar{x} \right) \\ && \\ \mbox{and} & & ~~ R_i \sim N(0,\sigma^2) ~~\mbox{independently} \end{array}$ for $i= 1, \ldots, n$ and where $\bar{x} = \sum x_i / n$ is the arithmetic mean of the $x_i$ s.

Questions

What does $\beta_1$ mean in this model? What does it mean for the straight line model for the Swiss fertility data?
What does $\beta_0$ mean? In terms of the straight line model for the fertility data?

On the multivariate normal distribution

Building the standard multivariate Gaussian or normal

Suppose, then that we have $p$ independently distributed standard Gaussian random variates $Z_1, Z_2, \ldots, Z_p$ . That is $Z_i \sim N(0,1)$ independently of one another for $i=1, 2, \ldots, p$ .

Denote the marginal density as $\phi_{Z_i}(z_i)$ for each $i$ , then the joint density is $\phi_{\mathbf{Z}}(\mathbf{z}) = \phi_{Z_1} (z_1) \times \phi_{Z_2} (z_2) \times \cdots \times \phi_{Z_p}(z_p)$ Recalling that for the standard Gaussian, $\phi_{Z}(z) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}z^2}$ we have $\begin{array}{rcl} \phi_{\mathbf{Z}}(\mathbf{z}) & = & \left(\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}z_1^2}\right) \times \left(\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}z_2^2}\right) \times \cdots \times \left(\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}z_p^2}\right) \\ &&\\ & = & \frac{1}{(2\pi)^{p/2}}e^{-\frac{1}{2} \sum_{i=1}^p z_i^2}\\ &&\\ & = & \frac{1}{(2\pi)^{p/2}}e^{-\frac{1}{2} \|\mathbf{z}\|^2}\\ &&\\ & = & \frac{1}{(2\pi)^{p/2}}e^{-\frac{1}{2} \mathbf{z}^T\mathbf{z}} \end{array}$

So what does this look like?

The $p$ vector of random variates $\mathbf{Z}$ is said to have a standard $p$ -dimensional Gaussian distribution, written $\mathbf{Z} \sim N_p(\mathbf{0}, \mathbf{I}_p)$ if its density is $\phi(\mathbf{Z}) = \frac{1}{(2\pi)^{p/2} } e ^{-\frac{1}{2}(\mathbf{Z}^T\mathbf{Z})}$ for all $\mathbf{Z} \in \Re^p$ .

Question: What would this function look like (say when $p=2$ )?

Or, what would contours of this function look like when $p=2$ ? For larger $p$ ?

#################
#
#  
#  A bivariate normal density
#
#
#  Begin with a standard bivariate Gaussian density:
#  They're independent so we can just 
#  multiply the densities

phi <- function(x,y){dnorm(x)*dnorm(y)}

#
#  We need to fet a grid of coordinates
m <- 150
n <- 150
x <- seq(-3,3,length.out = m)
y <- seq(-3,3,length.out = n)
z<- matrix(0,m,n)
for (i in 1:m) {for (j in 1:n) {
                    z[i,j]<-  phi(x[i],y[j])}}


#
plot_colour <- "darkgrey"
#
contour(x,y,z, xlim=c(-3,3),ylim=c(-3,3),
        nlev=5, labcex=1, col=plot_colour, 
        lwd=2, asp=1
                )

#  increase the number of contour levels
contour(x,y,z, xlim=c(-3,3),ylim=c(-3,3),
        nlev=15, labcex=1, col=plot_colour, 
        lwd=2, asp=1
                )

Transforming $\mathbf{Z}$

Suppose we transform $\mathbf{Z}$ ?

We can build a function in R for a bivariate normal density which we will call dBivnorm (to be revealed in a bit). We can show the effect of various affine transformations of $\mathbf{Z}$ .

Rescale: $\mathbf{X} = \mathbf{D}\mathbf{Z}$ Where $\mathbf{D}$ is a $p \times p$ diagonal matrix of positive scales.
E.g. $\mathbf{D} = \left( \begin{array}{cc} 3 & 0 \\ 0 & 1 \end{array} \right)$

makes the density $\mathbf{X}$ become:

m <- n <- 100
x <- seq(-5,5,length.out = m)
y <- seq(-5,5,length.out = n)

# 
z<- matrix(0,m,n)
for (i in 1:m) {for (j in 1:n) {
                    z[i,j]<-  dBivnorm(x[i],y[j])}}
#
# This should be as before
#
contour(x,y,z, xlim=range(x), ylim=range(y),,
        nlev=10, labcex=1, col=plot_colour,
        lwd=2, asp=1
                )
points(0,0, pch=19, cex=2, col= plot_colour)

# Add scale
z<- matrix(0,m,n)
for (i in 1:m) {
    for (j in 1:n) {
        z[i,j]<- dBivnorm(x[i],y[j],  
                         scales=c(3,1))}}
                    

contour(x,y,z, xlim=range(x), ylim=range(y),
        nlev=10, labcex=1, col=plot_colour,
        lwd=2, asp=1
                )
points(0,0, pch=19, cex=2, col= plot_colour)

Rotate

Rescale then apply a rotation matrix $\mathbf{R}$ : $\mathbf{X} = \mathbf{R} \mathbf{D}\mathbf{Z}$ Where $\mathbf{R}$ is a $p \times p$ rotation matrix, e.g. $\mathbf{R} = \left( \begin{array}{cr} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{array} \right)$ with $\theta = \frac{\pi}{4}$ makes the density $\mathbf{X}$ become:

# Add scale and rotation
z<- matrix(0,m,n)
for (i in 1:m) {
    for (j in 1:n) {
        z[i,j]<- dBivnorm(x[i],y[j],  
                         scales=c(3,1),
                         angle=45)}}
                    

contour(x,y,z, xlim=range(x), ylim=range(y),
        nlev=10, labcex=1, col=plot_colour,
        lwd=2, asp=1
                )
points(0,0, pch=19, cex=2, col= plot_colour)

Rescale, rotate, and shift locations: $\mathbf{X} = \mathbf{\mu} + \mathbf{R} \mathbf{D}\mathbf{Z}$ Where $\mathbf{\mu}$ is a $p \times 1$ vector, e.g. $\mathbf{\mu} = (1,2)^T$ makes the density $\mathbf{X}$ finally become:

# Add scale  and rotation and location
z<- matrix(0,m,n)
for (i in 1:m) {
    for (j in 1:n) {
        z[i,j]<- dBivnorm(x[i],y[j], 
                         loc=c(1,2), 
                         scales=c(3,1),
                         angle=45)}}
                    

contour(x,y,z, xlim=range(x), ylim=range(y),
        nlev=10, labcex=1, col=plot_colour,
        lwd=2, asp=1
      )
points(1,2, pch=19, cex=2, col= plot_colour)

which suggests a means of generating realizations from the random variate vector $\mathbf{X}$ beginning with $\mathbf{Z}$ whose coordinates are independently distributed as $N(0,1)$ random variates.

To get the density of $\mathbf{X}$ from that of $\mathbf{Z}$ via a change in variables (multivariable calculus), we rewrite this as $\mathbf{Z} = \mathbf{D}^{-1} \mathbf{R}^T(\mathbf{X} - \mathbf{\mu})$ and so $\mathbf{Z}^T\mathbf{Z} = (\mathbf{X} - \mathbf{\mu})^T\mathbf{R}\mathbf{D}^{-2}\mathbf{R}^T(\mathbf{X} - \mathbf{\mu}) = (\mathbf{X} - \mathbf{\mu})^T\Sigma^{-1}(\mathbf{X} - \mathbf{\mu}) ~~~ \mbox{, say,}$ where $\Sigma = RD^2R^T$ is the eigen decomposition of the variance covariance matrix of $\mathbf{X}$ . Putting this together with the Jacobian of the transformation (absolute value of the determinant of $\mathbf{D}^{-1}\mathbf{R}^T$ which is $|\Sigma |^{-1/2}$ ) yields the general multivariate normal density: $\phi(\mathbf{X}) = \frac{1}{(2\pi)^{p/2} |\Sigma|^\frac{1}{2}} e ^{-\frac{1}{2}(\mathbf{X}-\mathbf{\mu})^T\Sigma^{-1} (\mathbf{X}-\mathbf{\mu})}$

So here’s that R function

#
# General bivariate Gaussian density:
#
dBivnorm <- function(x,y=NULL, 
                    loc=c(0,0),
                    scales=c(1,1),
                    angle=0,
                    degrees=TRUE){
                
        # y is optional, if NULL then assume x has 2 columns
        if (!is.null(y)){
               data <- cbind(x,y)
           } else {
            data <- x
           }
           
           
        # remove the locations (centre the data on origin)
        data <- sweep(data, 2, loc)
        
        # does a rotation need to be undone?
        if (degrees) {
               # convert to radians
               angle <- (2*angle/ 360)*pi
        }
        if (!angle==(0 %% 2*pi)) {
                cosine <- cos(angle)
                sine <- sin(angle)
                rot <- matrix(c(cosine, -sine,
                                 sine, cosine),
                               # This is the inverse rotation
                               byrow=TRUE,
                               nrow=2)
                                
                 data <- data %*% rot             
                 
             }
         # undo any scaling
         for (i in 1:length(scales)){
            if (scales[i]==0) {
                stop("scale cannot be zero")
                }
            if (scales[i]!=1) {
             data[,i] <- data[,i]/scales[i]
                }
         }
        # data has now be transformed to a standard
        # bivariate normal
        return(dnorm(data[,1])*dnorm(data[,2]))
        }

On the construction

Note that this construction can be used more generally.

Rescale, rotate, and shift locations: $\mathbf{X} = \mathbf{\mu} + \mathbf{R} \mathbf{D}\mathbf{Z}$ has the same effect as with the Gaussian, just start with any joint distribution of $\mathbf{Z}$ .
The eigen decomposition of the (variance covariance) matrix $\Sigma = \mathbf{R} \mathbf{D}^2\mathbf{R}^T$ gives information about the principal directions of the orientation ( $R$ rotation) of the data and the stretching (scaling $\mathbf{D}$ ) along these principal directions.
- Eigenvalues provide scaling (square root of the eigen values), eigen vectors provide the directions of the principal axes.

Linear combinations of independent Gaussian random variates

Recall that if $Z_1, Z_2, \ldots, Z_n$ are independent Gaussian random variables, say $Z_i \sim N(\mu_i, \sigma_i^2)$ , then for any real-valued fixed weights, $w_1, w_2, \ldots, w_n$ the distribution of the weighted sum: $\sum_{i=1}^n w_i Z_i$ is also Gaussian. We need only determine its mean and variance. The first is simply $\begin{array}{rcl} E\left(\sum_{i=1}^n w_i Z_i \right) &=& \sum_{i=1}^n E\left(w_i Z_i \right)\\ && \\ &=& \sum_{i=1}^n w_i E\left(Z_i \right)\\ && \\ &=& \sum_{i=1}^n w_i \mu_i\\ \end{array}$ The variance is similarly determined: $\begin{array}{rcl} Var\left(\sum_{i=1}^n w_i Z_i \right) &=& \sum_{i=1}^n Var\left(w_i Z_i \right) \hspace{10em} \mbox{Why?}\\ && \\ &=& \sum_{i=1}^n w_i^2 Var\left(Z_i \right) \hspace{10em} \mbox{Why?}\\ && \\ &=& \sum_{i=1}^n w_i^2 \sigma^2_i. \\ \end{array}$

So given these mathematical assumptions we have that $\sum w_i Z_i \sim N(\sum w_i\mu_i, ~~ \sum w_i^2\sigma_i^2)$ .

This is sometimes more simply written by introducing vectors ${\bf w}^T = (w_1, \ldots, w_n)$ , ${\bf Z}^T = (Z_1, \ldots, Z_n)$ , and ${\boldsymbol \mu}^T = (\mu_1, \ldots, \mu_n)$ . The result on expectation can now be expressed as $\begin{array}{rcl} E \left({\bf w}^T {\bf Z}\right) &=& {\bf w}^T E\left( {\bf Z}\right)\\ && \\ &=& {\bf w}^T {\boldsymbol \mu}\\ \end{array}$ Similarly, $Var\left({\bf w}^T {\bf Z}\right) = {\bf w}^T Var\left( {\bf Z}\right) {\bf w}$ where the middle term on the right denotes the $n \times n$ variance-covariance matrix of ${\bf Z} = (Z_1, \ldots, Z_n)^T$ : $Var\left({\bf Z}\right) = \left(\begin{array}{lcccccr} \sigma_1^2 & 0 & 0 &0& \cdots& 0 & 0 \\ 0 & \sigma_2^2 & 0 &0& \cdots & 0& 0 \\ 0& 0 & \sigma_3^2 &0 &\cdots & 0& 0 \\ \vdots& \vdots & &\cdot & &\vdots&\vdots \\ 0& \vdots & & & \cdot &0 &0 \\ 0 &\cdots &\cdots& & 0 & \sigma_{n-1}^2 & 0 \\ 0 & \cdots & \cdots &&0& 0 & \sigma_n^2 \\ \end{array} \right)$ The $(i,j)$ th off-diagonal entries are the covariance terms $\sigma_{ij} = Cov(Z_i, Z_j)$ and are all identically zero since $Z_i$ and $Z_j$ are independently distributed for $i \ne j$ . (Note that $\sigma_{ii}= \sigma_i^2$ ) In this case, we will sometimes write that $Var \left({\bf Z} \right) = diag(\sigma_1^1, \ldots, \sigma_n^2)$ to indicate this diagonal matrix.

Questions

What does the density of ${\bf w}^T {\bf Z}$ look like? Write it down, then re-express it in terms only of the above vectors and matrix.
What does the joint density of $Z_1, Z_2, \ldots, Z_n$ look like? Write it down, then re-express it in terms only of the above vectors and matrix.
What does the likelihood function look like for the unknown parameters of the Gaussian linear model?

This distributional result is at the core of the derivation of all distributions associated with the Gaussian linear model.

Warning on relying only on summary statistics

Using a probability model, it becomes possible to produce (for example):

maximum likelihood estimates
test statistics and their distributions
confidence intervals
prediction intervals

The output of a statistical package usually shows some or all of the above information based on the particular data and model.
Here it is again for our straight line fit:

summary(myfit)

## 
## Call:
## lm(formula = Fertility ~ Infant.Mortality, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.672  -5.687  -0.381   7.239  28.565 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       34.5155    11.7113   2.947  0.00507 **
## Infant.Mortality   1.7865     0.5812   3.074  0.00359 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.48 on 45 degrees of freedom
## Multiple R-squared:  0.1735, Adjusted R-squared:  0.1552 
## F-statistic: 9.448 on 1 and 45 DF,  p-value: 0.003585

But is it enough? The output just shown is typical of that from many statistical packages. It shows a number of statistics that summarize the linear model just fitted. The statistics shown are chosen because they proved useful in the past in interpreting fitted models like these.

Note that this set (or one very similar to this) of summaries is chosen largely because of tradition and, though interesting (and useful), they cannot tell the whole story of the model and its interpretation.

For example, consider the following summary of a straight line fit:

with(anscombe,
     {
       fit1 <- lm(y1 ~ x1)
       summary(fit1)
     }
     )

## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

What can you say about this fitted line?

How about this one?:

with(anscombe,
     {
       fit2 <- lm(y2 ~ x2)
       summary(fit2)
     }
     )

## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## x2             0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179

This one?:

with(anscombe,
     {
       fit3 <- lm(y3 ~ x3)
       summary(fit3)
     }
     )

## 
## Call:
## lm(formula = y3 ~ x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## x3            0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176

And finally, this one?:

with(anscombe,
     {
       fit4 <- lm(y4 ~ x4)
       summary(fit4)
     }
     )

## 
## Call:
## lm(formula = y4 ~ x4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0017     1.1239   2.671  0.02559 * 
## x4            0.4999     0.1178   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

Comments? Thoughts?

What should we do?

Look at residuals?

with(anscombe,
     { 
       fit1 <- lm(y1 ~ x1)
       qqtest(fit1$residuals, main="Anscombe data 1")
     }
     )

Comments?

with(anscombe,
     { 
       fit2 <- lm(y2 ~ x2)
       qqtest(fit2$residuals, main="Anscombe data 2")
     }
     )

Comments?

with(anscombe,
     { 
       fit3 <- lm(y3 ~ x3)
       qqtest(fit3$residuals, main="Anscombe data 3")
     }
     )

Comments?

with(anscombe,
     { 
       fit4 <- lm(y4 ~ x4)
       qqtest(fit4$residuals, main="Anscombe data 4")
     }
     )

Comments?

How about some other residual plots?

with(anscombe,
     { 
       fit1 <- lm(y1 ~ x1)
       plot(x1,fit1$residuals, 
            pch=19,
            col="steelblue",
            ylab="Estimated residuals",
            main="Anscombe data 1")
     }
     )

Comments?

with(anscombe,
     { 
       fit2 <- lm(y2 ~ x2)
       plot(x2,fit2$residuals,  
            pch=19,
            col="steelblue",
            ylab="Estimated residuals",
            main="Anscombe data 2")
     }
     )

Comments?

with(anscombe,
     { 
       fit3 <- lm(y3 ~ x3)
       plot(x3,fit3$residuals,  
            pch=19,
            col="steelblue",
            ylab="Estimated residuals",
            main="Anscombe data 3")
     }
     )

Comments?

with(anscombe,
     { 
       fit4 <- lm(y4 ~ x4)
       plot(x4,fit4$residuals,  
            pch=19,
            col="steelblue",
            ylab="Estimated residuals",
            main="Anscombe data 4")
     }
     )

Comments?

Plotting the original data reveals the differences:

par(mfrow=c(2,2))    # set up a 2 by 2 grid of plots

with(anscombe,
     { # The first data set
       plot(x1, y1,  
            pch=19,
            col="steelblue",
            main="Anscombe data 1")
       fit1 <- lm(y1 ~ x1)
       abline(fit1$coefficients, col="grey")
       
       # the second
       plot(x2, y2,  
            pch=19,
            col="steelblue",
            main="Anscombe data 2")
       fit2 <- lm(y2 ~ x2)
       abline(fit2$coefficients, col="grey")
       
       # the third
       plot(x3, y3,  
            pch=19,
            col="steelblue",
            main="Anscombe data 3")
       fit3 <- lm(y3 ~ x3)
       abline(fit3$coefficients, col="grey")
       
       # the fourth
       plot(x4, y4,  
            pch=19,
            col="steelblue",
            main="Anscombe data 3")
       fit4 <- lm(y4 ~ x4)
       abline(fit4$coefficients, col="grey")
     }
     )

How might we have changed our approach to each of these data sets?

How would this change be reflected in equation (1)?

We are always going to want to assess the quality of our model against the observed data.

And to adjust our model in light of the evidence.

Back to the Swiss fertility data

Let’s look at the residuals from our initial fit versus the explanatory variate Infant.Mortality:

plot(x=swiss$Infant.Mortality, y=resids,
     xlab="Infant mortality")
abline(h=0, col="grey")  # add a horizontal line at 0

Comments?

We could also fit Fertility as a function of all of the remaining variates:

fit <- lm(Fertility ~ Infant.Mortality + 
                      Agriculture + 
                      Examination + 
                      Education + 
                      Catholic,
          data=swiss)

summary(fit)

## 
## Call:
## lm(formula = Fertility ~ Infant.Mortality + Agriculture + Examination + 
##     Education + Catholic, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10

Comments?

Checking the distributional assumption

qqtest(fit$residuals)

Looking for any remaining dependence on the mean structure (fitted values)

plot(x=fit$fitted.values, y=fit$residuals, 
     pch=19, col="steelblue",
     xlab="Fitted values", 
     ylab="Estimated residuals")
abline(h=0, col="grey")

Comments?

What might we look for in such a plot?

Added variable plots

Rather than plot the estimated residuals against each explanatory variate, we use added variable plots (also called partial regression plots).

If $z$ is the explanatory variate of interest, we plot the estimated residuals from fitting the response to all other explanatory variates except $z$ against the estimated residuals from fitting $z$ as the response to the same set of explanatory variates.

In the plot, we should see whether there is any relationship between what remains of the response and what remains of $z$ after the effect of all other explanatory variates are taken into account.

This plot should reveal the effect of adding the variable $z$ to the model, hence the name added variable plot.

An added variable plot can be constructed for Infant.Mortality as follows:

# fit the model before but without Infant.Mortality
fit_y <- lm(Fertility ~ Agriculture +   
                        Examination + 
                        Education + 
                        Catholic,
          data=swiss)
resids_y <- fit_y$residuals

# Now fit Infant.Mortality as the response 
# **on the same explanatory variates**
fit_z <- lm(Infant.Mortality ~ Agriculture +   
                               Examination + 
                               Education + 
                               Catholic,
          data=swiss)
resids_z <- fit_z$residuals

#
# The added variable plot:
#
plot(x=resids_z, y=resids_y,
     pch=19, col="steelblue",
     main="Added Variable Plot",
     xlab="Infant.Mortality | others",
     ylab="Fertilty | others"
     )

An interesting consequence of least-squares fitting is that if we were to fit a straight line to the data in this plot, the estimated slope will be the same as that of the Infant.Mortality in the model that includes all explanatory variates, the estimated intercept will be zero, and the estimated residuals will be identical to those of the full model.
This means that the $t$ statistic will also be identical.

The plot itself, then, should tell whether there seems to be a reasonable linear fit (and hence whether $z$ has anything to add to explaining the response over and above the other explanatory variates in the model).

This is easily verified empirically (up to rounding errors):

# Construct a new data set containing the two sets of residuals
avData <- data.frame(x=resids_z, y=resids_y)
avFit <- lm(y ~ x, data=avData )
summary(avFit)

## 
## Call:
## lm(formula = y ~ x, data = avData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1.036e-15  9.976e-01   0.000  1.00000   
## x           1.077e+00  3.644e-01   2.956  0.00495 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.839 on 45 degrees of freedom
## Multiple R-squared:  0.1626, Adjusted R-squared:  0.144 
## F-statistic: 8.738 on 1 and 45 DF,  p-value: 0.004947

And the plot can be supplemented by this line

plot(x=resids_z, y=resids_y,
     pch=19, col="steelblue",
     main="Added Variable Plot",
     xlab="Infant.Mortality | others",
     ylab="Fertilty | others"
     )
# add axes at (0,0)
abline(h=0, col="grey")
abline(v=0, col="grey")
# add the fitted line
abline(avFit$coefficients, col="red")

Fortunately there is a function avPlots in the R package called car (acronym for “Companion of Applied Regression”) which will automatically construct all of the added variable plots at once. (Again, you may have to first install this package using install.packages("car").)

require("car")

## Loading required package: car

avPlots(fit, pch=19, col="steelblue")

Comments?

How does this compare to a simpler model? Say one without Examination?

fit1 <- lm(Fertility ~ Infant.Mortality + 
                       Agriculture + 
                       Education + 
                       Catholic,
          data=swiss)

How would we compare these?

Comparing nested models

Form the $F$ statistic to compare nested models. Suppose we have two fitted linear models $M_{A}$ and $M_{B}$ . We say that $M_{B}$ is a submodel of $M_{A}$ if it has the same response variate and the explanatory variates in $M_{B}$ are a subset of those in $M_{A}$ . We also say that $M_{B}$ and $M_{A}$ are nested, with $M_{B}$ being nested within $M_{A}$ (the former being a special case of the latter).

For linear models, a test statistic can be formed as $\begin{array}{rcl} f &= & \left(\frac{SS_{fit}(M_{A}) - SS_{fit}(M_{B})} {df_{fit}(M_{A}) - df_{fit}(M_{B})} \right) \div \left(\frac{SS_{residual}(M_{A})} {df_{residual}(M_{A})} \right) \\ \\ &= & \left(\frac{SS_{residual}(M_{B}) - SS_{residual}(M_{A})} {df_{residual}(M_{B}) - df_{residual}(M_{A})} \right) \div \left(\frac{SS_{residual}(M_{A})} {df_{residual}(M_{A})} \right) \end{array}$ where $SS(\cdot)$ and $df(\cdot)$ denote the corresponding sum of squares and degrees of freedom, respectively.

When the realized residuals are independent realizations from a $N(0,\sigma^2)$ distribution, the observed value of $f$ is a realization from an $F_{num,den}$ distribution where $num$ and $den$ are the numerator and denominator degrees of freedom respectively - namely $num = df_{fit}(M_{A}) - df_{fit}(M_{B}) = df_{residual}(M_{B}) - df_{residual}(M_{A})$ and $den = df_{residual}(M_{A})$ .

The $f$ value can be calculated as follows:

# for the full model
rss_A <- sum(fit$residuals^2)
rdf_A <- fit$df.residual

# for the nested model
rss_B <- sum(fit1$residuals^2)
rdf_B <- fit1$df.residual

# the statistic is
f <- ((rss_B - rss_A)/(rdf_B - rdf_A)) / (rss_A/rdf_A)
f

## [1] 1.0328

Large values of $f$ indicate evidence against the hypothesis that the coefficients are zero for those the explanatory variates appearing in $M_A$ but not in $M_B$ .

The observed significance level can be determined in R as

# use pf(.), the probability function
# (or cumulative distribution function)
# of an F distribution ... see help(pf)
sl <- 1-pf(f,              # value
           rdf_B - rdf_A,  # num df
           rdf_A)          # den df
sl

## [1] 0.3154617

Note that this is exactly the same significance level as we already had for the $t$ -test of the coefficient of the Examination in the full model.

Why?

And when would we be interested in using $f$ instead?

Alternatively, R also supplies a simple analysis of variance function to compare these two models. Namely,

anova(fit, fit1)

## Analysis of Variance Table
## 
## Model 1: Fertility ~ Infant.Mortality + Agriculture + Examination + Education + 
##     Catholic
## Model 2: Fertility ~ Infant.Mortality + Agriculture + Education + Catholic
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     41 2105.0                           
## 2     42 2158.1 -1   -53.027 1.0328 0.3155

Prediction

Once you have a fitted model, it can be used to make predictions for the different values of the explanatory variates. For the observed values of the explanatory variates, the predictions are just the fitted values.

plot(fit1$fitted.values,predict(fit1),
     pch=19, col="steelblue")

What does this mean then?

To predict at other values, the predict function must be handed a data frame that contains the values named just as in the original data.

predict(fit1, newdata=data.frame(Infant.Mortality = 25,
                                 Agriculture = 50,
                                 Education = 5,
                                 Catholic = 95)
        )

##        1 
## 88.27348

is the predicted Fertility.

Why should such predictions worry you?

How about?

predict(fit1, newdata=data.frame(Infant.Mortality = 25,
                                 Agriculture = 50,
                                 Education = 5,
                                 Catholic = 950)
        )

##        1 
## 194.8632

For simplicity, let’s go back to the first model we fitted (with only one explantory variate, Infant.Mortality).

myfit <- lm(Fertility ~ Infant.Mortality, data=swiss)

# Get some new values for the range of possible Infant.Mortality values
newX <- 0:100
conf95 <- predict(myfit, interval="confidence",
                  newdata=data.frame(Infant.Mortality=newX))
plot(x=swiss$Infant.Mortality, 
     y=swiss$Fertility , 
     pch=19, col="steelblue",
     xlab="Infant mortality",
     # make sure the x range is right
     xlim=c(0,100),
     ylab="Fertility index Ig",
     # make sure the y range is right
     ylim=range(conf95)
     )

# Here are the predicted values (the means)
lines(newX, conf95[, "fit"],
      col="steelblue", lwd=2)

# The lower and upper 95% confidence intervals for the mean
lines(newX, conf95[, "lwr"],
      col = "grey", lwd=2,
      lty=2)
lines(newX, conf95[, "upr"],
      col = "grey", lwd=2,
      lty=2)

Comments?

How are these produced?

How about adding the corresponding prediction intervals?

conf95 <- predict(myfit, interval="confidence",
                  newdata=data.frame(Infant.Mortality=newX))
pred95 <- predict(myfit, interval="prediction",
                  newdata=data.frame(Infant.Mortality=newX))
plot(x=swiss$Infant.Mortality, 
     y=swiss$Fertility , 
     pch=19, col="steelblue",
     # make sure the x range is right
     xlab="Infant Mortality",
     xlim=c(0,100),
     ylab="Fertility index Ig",
     # make sure the y range is right
     ylim=range(c(range(conf95), 
                  range(pred95))
                )
     )

# Here are the predicted values (the means)
lines(newX, conf95[, "fit"],
      col="steelblue", lwd=2)

# The lower and upper 95% confidence intervals for
# the mean at that x
lines(newX, conf95[, "lwr"],
      col = "grey10", lwd=2,
      lty=2)
lines(newX, conf95[, "upr"],
      col = "grey10", lwd=2,
      lty=2)

# The lower and upper 95% prediction intervals for 
# the value of a new realization at that x
lines(newX, pred95[, "lwr"],
      col = "grey50", lwd=2,
      lty=3)
lines(newX, pred95[, "upr"],
      col = "grey50", lwd=2,
      lty=3)

legend(0,300,
       legend = c("fitted mean", "95% conf. interval for mean",
                  "95% prediction interval"),
       col=c("steelblue","grey10", "grey50"),
       lty=c(1,2,3),
       lwd=2
      )

Comments?

Aside: More info on the fertility index Ig http://ssh.dukejournals.org/content/25/4/589.full.pdf+html

From this source:

“If and Ig measure how closely any population comes to matching the fertility of the Hutterites, an Anabaptist sect from the upper Midwestern United States and Lower Canada who practiced universal marriage and no birth control in the early twentieth century. Interpretively, If and Ig, which range from 0 to 1, may be read as the proportions of Hutterite general and marital fertility that another population achieves. Thus an Ig of .532 indicates that the recorded marital fertility of the population under study is 53.2% of Hutterite marital fertility. Practically, Igs of .7 to .8 are considered high and Igs of .3 to .4 are low; values of Ig above .6 indicate a noncontrolling population and values below .6 show a population that is practicing marital fertility control.”

– Charles Wetherill (2001), p. 590 of “Another look at Coale’s indices of fertility, If and Ig”. Social Science History, 25:4, pp. 589-608.

In general, have no idea whether the model holds even outside the range of the data – nothing in the data itself can determine this! – let alone that we know that this model is not restricted to the known range of the variates.

Remember, it is a truism that all models are wrong, some are useful; the trick is to know where they are useful.

Some diagnostic statistics

Leverage values: These are the diagonal elements of the projection or Hat matrix ${\bf H}= \left[h_{ij}\right] = {\bf X}\left({\bf X}^T{\bf X} \right)^{-1}{\bf X}^T$

Large values of $h_{ii}$ indicate that the $i$ th point ${\bf x}_i$ has potentially large influence on the fit. Note that the $h_{ii}$ are bounded and have average value of $trace({\bf H})/n =\frac{p}{n}$ .

Some prefer plotting the values $\frac{h_{ii}} {1-h_{ii}}$ instead of the $h_{ii}$ s.

Standardized residuals: $\frac{\widehat{r}_i}{\sqrt{1-h_{ii}}}$ These all have common standard deviation (and distribution) though they are not independently distributed.

Similarly, one might also plot the (internally) studentized residuals: $\frac{\widehat{r}_i}{\widehat{\sigma}\sqrt{1-h_{ii}}}$ where $\widehat{\sigma} = \sqrt{\frac{\widehat{\bf r}^T \widehat{\bf r}}{n-p}}$

Externally studentized residuals: $\frac{\widehat{r}_i}{\widehat{\sigma}^{-(i)}\sqrt{1-h_{ii}}}$ where $\widehat{\sigma}^{-(i)}$ is calculated as before but using the $n-1$ residuals from fitting the same model without the ith point.

Cook’s distance is a measure of the actual influence of the $i$ th point on the fitted values: $D_i = \frac{\left( \widehat{\boldsymbol \beta}_{(i)} - \widehat{\boldsymbol \beta} \right)^T {\bf X}^T{\bf X} \left( \widehat{\boldsymbol \beta}_{(i)} - \widehat{\boldsymbol \beta} \right)} {p\widehat{\sigma}^2}$ Where $\widehat{\boldsymbol \beta}_{(i)}$ is the estimated coefficient vector using all but the $i$ th observation.

This can, with a little work, be re-expressed as $D_i = \frac{1}{p}\widehat{r}_i^2 \frac{h_{ii}} {1-h_{ii}}$

Default plots

plot(fit)

Fitting a linear model (lm)

R.W. Oldford

Fall 2015

Preliminaries on notation and nomenclature

Linear regression

An example: Swiss fertility data

A straight line model

A generative probability model

The straight line model and interpretability of parameters

On the multivariate normal distribution

Building the standard multivariate Gaussian or normal

Transforming $\mathbf{Z}$

On the construction

Linear combinations of independent Gaussian random variates

Warning on relying only on summary statistics

Back to the Swiss fertility data

Added variable plots

Comparing nested models

Prediction

Some diagnostic statistics

Default plots

Fitting a linear model (lm)

R.W. Oldford

Fall 2015

Preliminaries on notation and nomenclature

Linear regression

An example: Swiss fertility data

A straight line model

A generative probability model

The straight line model and interpretability of parameters

On the multivariate normal distribution

Building the standard multivariate Gaussian or normal

Transforming \mathbf{Z}\mathbf{Z}

On the construction

Linear combinations of independent Gaussian random variates

Warning on relying only on summary statistics

Back to the Swiss fertility data

Added variable plots

Comparing nested models

Prediction

Some diagnostic statistics

Default plots

Transforming $\mathbf{Z}$