1 Preamble
- 1.1 The statistical narrative
2 Populations
3 Samples
- 3.1 All possible samples
- 3.2 Selecting samples
4 Inductive inference
5 Prediction

$\renewcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}$ $\renewcommand{\ve}[1]{\mathbf{#1}}$ $\renewcommand{\sv}[1]{\boldsymbol{#1}}$ $\renewcommand{\pop}[1]{\mathcal{#1}}$ $\renewcommand{\samp}[1]{\mathcal{#1}}$ $\renewcommand{\imply}{\Longrightarrow}$ $\renewcommand{\leftgiven}{~\left\lvert~}$ $\renewcommand{\given}{~\vert~}$ $\renewcommand{\suchthat}{~:~}$ $\renewcommand{\widebar}[1]{\overline{#1}}$ $\renewcommand{\wig}[1]{\tilde{#1}}$ $\renewcommand{\bigwig}[1]{\widetilde{#1}}$ $\renewcommand{\field}[1]{\mathbb{#1}}$ $\renewcommand{\Reals}{\field{R}}$ $\renewcommand{\abs}[1]{\left\lvert ~{#1} ~\right\rvert}$ $\renewcommand{\size}[1]{\left\lvert {#1} \right\rvert}$ $\renewcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}}$ $\renewcommand{\norm}[1]{\left|\left|{#1}\right|\right|}$ $\renewcommand{\intersect}{\cap}$ $\renewcommand{\union}{\cup}$ $\renewcommand{\suchthat}{~:~}$ $\renewcommand{\st}{~:~}$

1 Preamble

The subject matter of computational statistics and data analysis is that of Statistics itself, but developed via computation rather than only through mathematics.

The goal of the course is to present essential statistical concepts. Simulation is used to illustrate the concepts and to provide understanding. Mathematical development provides an alternative presentation of the same ideas, when that is possible, and is used to develop a tool or get insight into a concept (N.B. it is not the only, not even the primary, route however).

Because simulation is the primary means to develop this understanding, the statistics/estimators/tests, etc. used should be of sufficient complexity that a complete mathematical treatment would be beyond the level of this course.
Because the statistics/estimators/tests etc. can be complex, several general fitting algorithms are introduced. E.g. Gradient descent, Newton’s method, iteratively reweighted least squares, etc.
Students are expected to program. The language will be R; other languages are not accepted at this point. The idea is to convey programming concepts with the statistical concepts. The purpose of the programming, like mathematic, is used to illustrate the statistical concepts. This means some rather general purpose code is being written taking advantage, for example, of a functional programming language like R.
Where possible, programming constructs should be those which will be seen again in higher level courses. For example, the R functions Map(...) and Reduce(...) are used to iterate or accumulate (respectively) over some list stucture. Such functions are commonplace in functional programming and will be seen a lot by those who go on to “big data” applications involving computing distributed over many machines.
Because the code, like mathematics, is being used to convey the statistical concepts, clarity and simplicity of the code is primary. That is the code is not production code but teaching code. Students might be exposed to production code through accessing built-in R core functions (possibly in assignments), though this is not strictly necessary in this course.
Many important topics (computational and statistical) are covered very lightly. For example, floating point arithmetic is passed over quickly, with some discussion of the potential consequences, just to give exposure to the difference between computational results and mathematical results. Similarly, for example, robust regression methods are treated superficially as just providing another set of equations to be solved by the methods here. The intention is to provide fairly generic concepts and algorithms and have the notes and presentations reflect that. Details of particular estimators, etc. can be explored in assignments as instances of the generic approach. This gives some latitude to tailor/adapt assignments over time.
In contrast to the computational approach, mathematical development will necessarily deal with simpler statistics/estimators/tests. Besides being more tractable, the simplicity of the mathematics should follow on from STAT 231, MATH 237 (multivariable calculus; infinite sequences and series), MATH 235 (linear algebra, matrix decompositions). There is no dependence on STAT 330, 331, 332, 333, 340, for example (although there will be occasional conceptual intersections with these courses, particularly 332 for foundations where the concepts are dealt with more fully).
Throughout, some potential exercise/assignment questions are noted as they would occur in the body of text. Much of the learning in the course is expected to occur through the assignments. These will involve programming in R, simulation, mathematical development/proof, and should both reinforce the classroom material and stretch the students understanding of the fundamental concepts.

1.1 The statistical narrative

There is a narrative in this course which begins with discussions about populations, the units which define them, and variates as measures on the units. All genuine populations are finite and these must be taken from real populations. Some populations that are used throughout the notes are presented as examples.

Summaries of finite populations (i.e. population attributes) will already involve a number of statistical concepts as to what might be interesting and why. How to determine these population attributes will already involve considerable calculation. Note that no probability appears for some time because we have the entire population. It is important that students realize that statistics is not just about probability modelling (though these arise as the narrative progresses). All estimation algorithms, e.g. gradient descent, etc., appear before any probability (even so-called stochastic gradient descent can appear as a pragmatic approach without any mention of probabiilty at this point … exercise).

Once the algorithms, etc. are in place and interesting population attributes described, samples are introduced. All possible samples of some size are first explored and characteristics of various population attributes examined within this context. No probability yet.

Because all possible samples entail a combinatorial explosion for calculation, the notion of selecting a subset of the samples is introduced. Here is where probability is first introduced in the sense of selecting a sample from a specified set of samples (e.g.from all possible samples) with some probability.

The notes now lean heavily on computation and also on the concepts (and some language) from survey sampling (Note that STAT 332 is not a pre-req or co-req so the concepts used are introduced here in the computational context). Again, a number of programs are given that implement and reinforce the statistical sampling concepts. These can be used, especially in conjunction with more complex attributes, to illustrate what amounts to the sampling distribution of any statistic and to compare different attributes, sample sizes, and sampling methods.

The notes pause for a reminder of the inferential path of induction that must be followed in practice. Namely from an attribute determined from values measured on units in a sample, to the corresponding attribute on the study population (which was available for sampling), to the attribute on the target population. This is a really important concept to get across and have students understand (esp. in this age of “big data”).

While considering inductive inference, the opportunity is taken to consider the anatomy of a significance test. This is illustrated through the comparison of two sub-populations of a given population. Permutation tests are used for a variety of discrepancy measures. The problem of multiple testing is raised and a solution given for the case of permutation tests. The section closes with a summary of the concepts from statistical inference that have so far appeared.

Having seen (through simulation) the sampling distribution of statistics based on samples probabilistically drawn from a finite population, the notes now transition to more traditional probability modelling concepts. The focus switches from probability of the samples, and of the units selected to be in a sample, to the probability that a variate takes some value. This allows introduction of the discrete (since the population is finite) distribution function over the population. Selecting units from the population and calculating a statistic (attribute) on that sample is seen to be identical to sampling from the population distribution function (univariate primarily but in genereal multivariate) and working with the empirical distribution. Attributes (statistics) can now be written as functions of the underlying distribution function.

Given only a single sample, the target–study–sample framework of induction can now be used to suggest that the single sample in hand could be used as if it were a study population. This allows the introduction of bootstrap methods.

2 Populations

Note that the approach here is to remain completely non-stochastic, more traditionally a descriptive statistics approach, but nevertheless computational and mathematical.

Introduce a population as being a finite (though possibly huge) set $\pop{P}$. Elements of a population are called units $u \in \pop{P}$, and variates are functions $x(u)$, $y(u)$, etc., on individual units $u \in \pop{P}$, or more simply as $x_u$, $y_u$, etc. when referring to the realized values of these variates for the unit, $u= 1, \ldots, N$.

We define and explore interesting population attributes, denoted generally as $a(\pop{P})$, how they can be calculated, and some of their (non-sampling) characteristics (e.g. interpretation of feature being captured, sensitivity to outlying points, …). All statistics are imagined/discussed as function on the finite population $\pop{P} = \left\{1, \ldots, N \right\}$.

2.1 Examples (real datasets)

Real datasets are used to firmly ground the idea of a population. These should be selected to be topical, show variety of applications, and of course be rich enough to illustrate the concepts. Datasets need to be obvious as populations in the sense of being finite, complete, and containing all that you want to learn about.

We have in place the following sets of data which can be used.

2.1.1 Agricultural census (USA)

US Census of Agriculture: decline in farms from 2007 to 2012

This is an older dataset are taken from the book Sampling: design and analysis by Sharon Lohr (Lohr 2009). The dataset has $N = 3,078$ where each unit is a county (or county equivalent) as defined by the US Census of Agriculture. Variates included are

Variate	Value
`county`	County name
`state`	State abbreviation
`acres92`	Number of acres devoted to farms in 1992
`acres87`	Number of acres devoted to farms in 1987
`acres82`	Number of acres devoted to farms in 1982
`farms92`	Number of farms in 1992
`farms87`	Number of farms in 1987
`farms82`	Number of farms in 1982
`largef92`	Number of farms, with 100 acres or more, in 1992
`largef87`	Number of farms, with 100 acres or more, in 1987
`largef82`	Number of farms, with 100 acres or more, in 1982
`smallf92`	Number of farms, with 9 acres or less, in 1992
`smallf87`	Number of farms, with 9 acres or less, in 1987
`smallf82`	Number of farms, with 9 acres or less, in 1982
`region`	S=South, W=west, NC=north central, and NE=northeast

Note Tons of interesting attributes from averages to regression lines. Also useful for graphical attributes. Lots of opportunities to look at sub-populations.

2.1.2 Facebook posts

Liking Facebook post

This is a subset of data, $N = 500$, collected by (Moro, Rita, and Vala 2016). This study was conducted for a cosmetics company who had a Facebook page and wanted to see the effectiveness of their various postings on that page. Quoting their paper:

“[…] we needed to collect a representative data set of published posts. All the posts published between the 1st of January and the 31th of December of 2014 in the Facebook’s page of a worldwide renowned cosmetic brand were included. As a result, the data set contained a total of 790 posts published. It should be noted that Facebook is the most used social network with an average of 1.28 billion monthly active users in 2014, followed by Youtube with 1 billion and Google+ with 540 million (Insights, 2014).” - (Moro, Rita, and Vala 2016)

The data were downloaded from the University of California (Irvine) “Machine Learning Repository”. The data set found on that site contains only 500 of the 790 posts and a subset of the variates analysed in (Moro, Rita, and Vala 2016). The data uploaded to the course website is a further reduction of the 19 variates available to only 13.

Variate	Value
`share`	the total (lifetime) number of times the post was shared
`like`	the total (lifetime) number of times the post “liked”
`comment`	the total (lifetime) number of comments attached to the post
`All.interactions`	the sum of `share`, `like`, and `comment`
`Page.likes`	the number of “likes” for the facebook page at the original time of the posting
`Impressions`	the total (lifetime) number of times the post has been displayed, whether the post is clicked or not. The same post may be seen by a facebook user several times (e.g. via a page update in their News Feed once, whenever a friend shares it, etc.).
`Impressions.when.page.like`	the total (lifetime) number of times the post has been displayed to someone who has “liked” the page
`Post.Hour`	the hour of the day at the original time of the posting (0-23)
`Post.Weekday`	the day of the week at the original time of the posting (1-7) beginning with Sunday
`Post.Month`	the month of the year at the original time of the posting (1-12)
`Category`	the category of the post (as determined by two separate human reviewers according to the campaign associated with the post), one of `Action` (special offers and contests), `Product` (direct advertisement, explicit brand content), or `Inspiration` (non-explicit brand related content)
`Type`	the type of content of the post, one of `Link`, `Photo`, `Status`, or `Video`
`Paid`	1 if the company paid Facebook for advertising, 0 otherwise

Note: Attributes of interest might include average Impressions depending on Paid or not. Also, for responses like Impressions, Page.likes, etc. transforming the data by square root or logarithms (add one first) will yield more interesting values.

2.1.3 Where’s Waldo?

Waldo appears somewhere in the store.

A small $N=68$ population is defined to the entire collection of “Where’s Waldo?” visual search puzzles taken from an internationally popular children’s book series which appeared from 1987 to 2009. There were 7 primary books in this series. (The books were called “Where’s Waldo” in North America and “Where’s Wally” in the home country of its British creator and illustrator Martin Handford. The character is called by many other names in other countries and languages. )

The character Waldo always wore the same shirt, hat, and pants and would appear somewhere in a picture spread across two pages of a book. The objective was to have the reader find Waldo in a picture like that above.

Determining a strategy for finding Waldo became an occupation of many with at least one self-described “foolproof strategy” being proposed by Ben Blatt in the online publication Slate (Blatt 2013). To provide data, Ben Blatt determined every location of Waldo (in the coordinates of a two page spread) in each of the 68 different appearences found in the seven primary “Where’s Waldo” books. This he did with a tape measure on the physical books!

Randal Olson uses Ben Blatt’s display of Waldo’s locations to produce the numerical coordinates needed to determine his own “optimal search strategy” for finding Waldo. We will now use Olson’s coordinates to illustrate various population attributes.

First the population is the set of all two page spreads, where Waldo is to be found, in the seven primary books by Martin Handford:

Population consists of all two page spreads from all “Where’s Waldo?” books.

which can be denoted as $\pop{P}_{Waldo}$, say. An individual unit in $\pop{P}_{Waldo}$ is any one of the two page spreads. Variates are

Variate	Value
`Book`	Book number (1 - 7) in which picture appears
`Page`	Page number of book
`X`	Waldo’s Horizontal location measured (in inches?)
`Y`	Waldo’s Vertical location measured (in inches?)

Note Possible attributes of interest include density of X values, of Y values, of the pair (X, Y), of a straight line relationship (or smooth) of Y on X. Relation between either of X or Y and even or odd page number?
Note The measurements of X and of Y is in error, for at least one point. (Check sources to find it.)

2.1.4 The Titanic

The great ship

From the help(Titanic) description in R:

“The sinking of the Titanic is a famous event, and new books are still being published about it. Many well-known facts—from the proportions of first-class passengers to the ‘women and children first’ policy, and the fact that that policy was not entirely successful in saving the women and children in the third class—are reflected in the survival rates for various classes of passenger.

These data were originally collected by the British Board of Trade in their investigation of the sinking. Note that there is not complete agreement among primary sources as to the exact numbers on board, rescued, or lost.

Due in particular to the very successful film ‘Titanic’, the last years saw a rise in public interest in the Titanic. Very detailed data about the passengers is now available on the Internet, at sites such as Encyclopedia Titanica.

The data set Titanic provides “information on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarized according to economic status (class), sex, age and survival.”

The population is the set of all people on board the Titanic’s maiden voyage in 1912. The variates are

Variate	Value
`Class`	`1st`, `2nd`, `3rd`, or `Crew`
`Sex`	`Male`, `Female`
`Age`	`Child`, `Adult`
`Survived`	`No`, `Yes`

2.1.5 Great white shark encounters

Friendly great white shark

Data on known great white shark encounters with humans has been gleaned by Prof. P-J Bergeron of the University of Ottawa from a variety of tables which appeared on the now defunct site `http://sharkattackinfo.com/shark_attack_news_sas.html.

Population is the set of $N=65$ encounters where a person was recorded on this site as having been bitten by a great white shark. The variates are

Variate	Value
`Year`	the year in which the encounter occurred
`Sex`	sex of the victim (`M` = male, `F` = female)
`Age`	age of the victim in years
`Time`	time of the encounter (`AM` or `PM`)
`Australia`	1 if encounter was in Australian waters, 0 if not
`USA`	1 if encounter was in USA waters, 0 if not
`Surfing`	1 if the victim was surfing at the time of the encounter, 0 otherwise (N.B. other unrecorded activities could have been “free diving”, “fishing”, “pearl diving”, etc.)
`Scuba`	1 if the victim was scuba diving at the time of the encounter, 0 othewise (N.B. other unrecorded activities might be “free diving”, “fishing”, “pearl diving”, etc.)
`Fatality`	1 if the victim died after being attacked (though not necessarily directly because of the attack), 0 if they survived
`Injury`	1 if the victim was injured by the encounter, 0 if not
`Length`	the recorded length in inches of the shark thought to have encountered the victim

Sharks have been vilified in the popular imagination. In fact, they are ancient species that are increasingly at risk from needless human activity. The interested student might consult the following links to help provide balance:

White sharks – Outside the cage, involves the same location that Discovery used
Shark Attacks
The visual art of Chris Jordan. Chris Jordan has created numerous works of art to convey large numbers related to important topics. Here the number is 270,000 which equals “the estimated number of sharks of all species killed around the world every day for their fins”. To get some sense of the immensity of this number, see Sharkfins; click on the picture at this link.
The critically acclaimed documentary film “Sharkwater” (trailer).

2.1.6 NASA Earth surface data

Earth Surface image

The data are geographic and atmospheric measures on a very coarse 24 by 24 grid covering Central America. Measurements at each grid are: temperature (surface and air), ozone, air pressure, and cloud cover (low, mid, and high). All variate values are monthly averages, with observations for Jan 1995 to Dec 2000.

The latitude and longitude of each geographic grid are recorded as well.

There are in total $N$ = 41,472 = (6 years * 12 months) * (24 * 24) individual locations in space and time. These constitute the units of the population. There are seven variates measured for each unit.

Variate	Value
`lat`	latitude: 24 evenly spaced values from latitude 36.2N to 21.2S
`long`	longitude: 24 evenly spaced values from longitude 113.8W to 56.2W
`year`	1995, 1996, 1997, 1998, 1999, 2000
`month`	1, 2, …, 12 (January to December)
`ozone`	monthly mean amount of total ozone in the atmospheric column (in dobson units)
`cloudlow`	monthly average low altitude cloud cover (percent of the sky covered by clouds roughly less than 3.24 km high)
`cloudmid`	monthly average mid altitude cloud cover (percent of the sky covered by clouds roughly 3.24 to 6.5 km high)
`cloudhigh`	monthly average high altitude cloud cover (percent of the sky covered by clouds roughly greater than 6.5 km high)
`pressure`	monthly mean atmospheric surface pressure at a given location on the Earth’s surface. (in millibars)
`surftemp`	monthly mean temperature based on the energy being emitted from the Earth’s surface under clear sky conditions (in degrees Kelvin)
`temperature`	monthly mean temperature of the air near the surface of the Earth (in degrees Kelvin).

These data were obtained from the NASA Langley Research Center Atmospheric Sciences Data Center.

The dataset is available as nasa in the R package dplyr

2.1.7 Gadsby

Gadsby: A novel of more than 50,000 words.

This is a novel, available online, containing approximately 50,000 words in 43 chapters. The population could be the collection of unique words in the novel itself ($N < 50,000$), or perhaps the collection of word positions in the book. Variates would include the number of letters of a word, the number of each vowels or consonants in each word, the number of times that word appears, the chapters of the novel in which that word appears, the word at that position, etc. Population attributes are up to your imagination.

Note This data set could be explored by the students in a guided exercise in extracting data from the web: Full text available from WikiSource. Or we could scrape it and allow the students to work on a local copy.
Note The messy nature of the data provides an opportunity for the student to construct the variate values and to better appreciate measurement error.
Note Perhaps the most amazing feature of this data is that nowhere in the text of the novel does the letter e appear! This can be for the students to discover, or to motivate counting the frequency of other vowels in the text.
Note We can introduce a variety of population attributes that are of interest in text analysis.

2.2 Explicitly defined population attributes

For any one of the above populations, begin with simple attributes:

the population total: $a(\pop{P}) = \sum_{u \in \pop{P}} y_u$
the population average: $a(\pop{P}) = \frac{1}{N} \sum_{u \in \pop{P}} y_u$
the population minimum: $a(\pop{P}) = \min_{u \in \pop{P}} ~~y_u$
the population median: $a(\pop{P}) = median_{u \in \pop{P}} ~~y_u$
the population max: $a(\pop{P}) = \max_{u \in \pop{P}} ~~y_u$
the population range: $a(\pop{P}) = \left(\max_{u \in \pop{P}} ~~y_u \right)~~~- ~~~ \left(\min_{u \in \pop{P}} ~~y_u \right)$
various counts over the population: $a(\pop{P}) = \sum_{u \in \pop{P}} I_A (y_u)$ where $I_A(y)$ is the indicator function \[ I_A(y) = \left\{\begin{array}{lcl} 1 &~~~& y \in A \\ &&\\ 0 && y \notin A \end{array} \right. \] for any set $A$. Note that we might also define the counts as indicator functions on the population units or labels as in $I_A(u)$ for $u \in \pop{P}$.

Such simple attributes can give a quick description of a population via the summary(...) given in R. This is highly recommended and can be revealing.

For example, consider the US Census of Agriculture data. First we read the data into R

### read the data from wherever it is stored, e.g. in some directory named Data
directory <- "Data"
dirsep <-"/"
filename <- paste(directory, "agpop_data.csv", sep=dirsep)
agpop <- read.csv(filename, header=TRUE)

Now a number of population attributes are had as summary(agpop)

summary(agpop)

##                county         state         acres92       
##  WASHINGTON COUNTY:  30   TX     : 254   Min.   :    -99  
##  JEFFERSON COUNTY :  25   GA     : 159   1st Qu.:  80903  
##  FRANKLIN COUNTY  :  24   KY     : 120   Median : 191648  
##  JACKSON COUNTY   :  23   MO     : 114   Mean   : 306677  
##  LINCOLN COUNTY   :  23   KS     : 105   3rd Qu.: 366886  
##  MADISON COUNTY   :  19   IL     : 102   Max.   :7229585  
##  (Other)          :2934   (Other):2224                    
##     acres87           acres82           farms92          farms87      
##  Min.   :    -99   Min.   :    -99   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:  86236   1st Qu.:  96397   1st Qu.: 295.0   1st Qu.: 318.5  
##  Median : 199864   Median : 207292   Median : 521.0   Median : 572.0  
##  Mean   : 313016   Mean   : 320194   Mean   : 625.5   Mean   : 678.3  
##  3rd Qu.: 372224   3rd Qu.: 377065   3rd Qu.: 838.0   3rd Qu.: 921.0  
##  Max.   :7687460   Max.   :7313958   Max.   :7021.0   Max.   :7590.0  
##                                                                       
##     farms82          largef92         largef87         largef82     
##  Min.   :   0.0   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 345.0   1st Qu.:  8.00   1st Qu.:  8.00   1st Qu.:  8.00  
##  Median : 616.0   Median : 30.00   Median : 27.00   Median : 25.00  
##  Mean   : 728.1   Mean   : 56.18   Mean   : 54.86   Mean   : 52.62  
##  3rd Qu.: 991.0   3rd Qu.: 75.00   3rd Qu.: 70.00   3rd Qu.: 65.00  
##  Max.   :7394.0   Max.   :579.00   Max.   :596.00   Max.   :546.00  
##                                                                     
##     smallf92          smallf87          smallf82       region   
##  Min.   :   0.00   Min.   :   0.00   Min.   :   0.00   NC:1054  
##  1st Qu.:  13.00   1st Qu.:  17.00   1st Qu.:  16.00   NE: 220  
##  Median :  29.00   Median :  35.00   Median :  34.00   S :1382  
##  Mean   :  54.09   Mean   :  59.54   Mean   :  60.97   W : 422  
##  3rd Qu.:  59.00   3rd Qu.:  67.00   3rd Qu.:  67.00            
##  Max.   :4298.00   Max.   :3654.00   Max.   :3522.00            
##

As can be seen, the first two variates (county and state) are categorical, so the counts of the first few different values are shown in the summary. So too is the last variate, region, which takes only four different values (NC, NE, S, W) so that the count of each one can appear in the summary. The remaining variates are numeric and so summary produces the average (or Mean) of each together with five numbers giving the minimum and maximum, the first and third quartiles (points for which 25% and 75% of the observations are less than or equal to), and the median.

A quick look at the numerical attributes on the acreage variates (i.e. acres92, acres87, acres82) reveals something curious – the minimum of each is -99 which is a strange value for the number of acres! No acreage should be less than zero. The explanation is that missing data are encoded as -99 in this data set. These should be replaced by NA which is the standard representation for missing data in R.

### which values are missing can be determined with a logical query
missing92 <- agpop[,"acres92"] == -99
### missing92 is a logical vector of the same length 
### as agpop[,"acres92"] containing a TRUE in every
### position where a -99 appeared and FALSE everywhere else.
### The total number of missing values can be had by
### summing (because logical TRUE is treated as 1, and FALSE as 0)
sum(missing92)

## [1] 19

### Alternatively, the `which` function could be used 
### to identify the row numbers
rowNumsMissing <- which(agpop[,"acres92"] == -99)

### The values can be changed to NA by using these locations
### (either rowNumsMissing or missing92) to identify the rows
### and replace the values
agpop[missing92, "acres92"] <- NA

### The same can be done for the other two acreages
agpop[agpop[,"acres87"] == -99, "acres87"] <- NA
agpop[agpop[,"acres82"] == -99, "acres82"] <- NA

The summary of these variates will now reflect the changes.

summary(agpop[,c("acres92", "acres87", "acres82")])

##     acres92           acres87           acres82       
##  Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:  82446   1st Qu.:  87530   1st Qu.:  97835  
##  Median : 193688   Median : 201728   Median : 209222  
##  Mean   : 308582   Mean   : 315374   Mean   : 321973  
##  3rd Qu.: 368482   3rd Qu.: 374576   3rd Qu.: 379172  
##  Max.   :7229585   Max.   :7687460   Max.   :7313958  
##  NA's   :19        NA's   :23        NA's   :17

Now the number of NAs are shown and the other attribute values calculated without the missing values. The agpop data is now in a more appropriate form.

Note that many programs in R accommodate missing data in NAs and do something appropriate (typically they omit them). For your own code, either you must consider what to do with NAs or ensure that the data do not have any NAs. For example, the function na.omit(...) will remove rows which contain an NA from a data set. For other possibilities see help("na.omit").

Exercise Look at one or more of the above conditional on the value of another binary (or categorical) variate $x_u$. Look at the differences in these for the different values of $x_u$ (e.g. total acreage in 1987 for farms for counties in the southern versus western regions)

2.2.1 Invariance and equivariance of attributes

Often variate values are reported in some unit of measurement: for example a length measurement in metres, millimetres, yards, or miles; a weight measurement in kilograms, grams, or pounds; a temperature measure in degrees Celsius, degrees Kelvin, or degrees Fahrenheit; or a liquid volume in imperial gallons, US gallons, or litres.

As units of measurement change, so too do the values to reflect that change. Sometimes only the scale of measurement is changed:

1 yard = 3 feet; 1 mile = 5280 feet; 1 metre = 1000 mm;
1 imperial gallon = 4.54609 litres; 1 US gallon = 0.832674 imperial gallon;
1 kilogram = 1000 grams = 2.20462 pounds.

Sometimes only the location of the zero for that measurement is changed:

absolute zero is $0^\circ$ Kelvin or $-273^\circ$ Celsius
1 Celsius degree = 1 Kelvin degree (no change in scale of measurement)

Sometimes both location and scale change:

water freezes at $0^\circ$ Celsius = $32^\circ$ Fahrenheit (location change)
1 Celsius degree = 1.8 Fahrenheit degrees (scale change)

Sometimes the change involves more than just a change in location and/or scale of measurement:

fuel economy might be reported in miles per gallon (US or Imperial) or litres per hundred kilometres.
the Richter scale for earthquakes is a logarithmic measure of the amplitude of seismic waves

For any attribute of a population that is a function of measured variates $y_u$ given in some measurement units, it is of interest to understand how that attribute changes (or not) to changes in the units of measurement. For an attribute \[a(\pop{P}) = a(y_1, \ldots, y_N)\] we say that for any $m > 0$ and $b \in \Reals$, that the attribute is

location invariant if $a(y_1 + b, \ldots, y_N + b) = a(y_1, \ldots, y_N)$;
location equivariant if $a(y_1 + b, \ldots, y_N + b) = a(y_1, \ldots, y_N) + b$;
scale invariant if $a(m \times y_1, \ldots, m \times y_N ) = a(y_1, \ldots, y_N)$;
scale equivariant if $a(m \times y_1, \ldots, m \times y_N ) = m \times a(y_1, \ldots, y_N)$;
location-scale invariant if it is both location invariant and scale invariant, i.e. $a(m \times y_1 + b, \ldots, m \times y_N + b ) = a(y_1, \ldots, y_N)$;
location-scale equivariant if it is both location equivariant and scale equivariant, i.e. $a(m \times y_1 + b, \ldots, m \times y_N + b ) = m \times a(y_1, \ldots, y_N) + b$.

For example,

the population average and the population median are both location-scale equivariant;
the population range is location invariant but is scale equivariant;
the ratio of the population average to the population median is scale invariant but is neither location invariant nor location equivariant.

Another invariance/equivariance property of interest for population attributes is replication invariance and replication equivariance. The idea here is that extending the population $\pop{P}$ by $k-1$ duplicates of itself which are added to $\pop{P}$ to produce a population $\pop{P}^k$, say, of size $k\times N$. That is very unit $u \in \pop{P}$ is replicated $k-1$ times, together with every one of its variate values and added to $\pop{P}$ in order to to get $\pop{P}^k$. (Of course, if one considers populations to be sets this means that the labels $u$ must be distinguished between replicates.) The attribute $a(\pop{P})$ is replication invariant whenever \[a(\pop{P}^k) = a(\pop{P})\] and replication equivariant whenever \[a(\pop{P}^k) = k \times a(\pop{P}).\] Replication invariant population attributes include the average, the median, and the inter-quartile range. An example of a replication equivariant attribute is the population total $\sum_{u \in \pop{P}} y_u$ for any variate $y$.

Exercise: Determine the effect of doubling the population on each of \[ a_1(\pop{P}) = \sqrt{ \frac{\sum_{u \in \pop{P}} \left(y_u - \bar{y}\right)^2} {N}} \] and \[ a_2(\pop{P}) = \sqrt{ \frac{\sum_{u \in \pop{P}} \left(y_u - \bar{y} \right)^2} {N-1}}. \] What does this say about the replication invariance/equivariance of $a_1(\pop{P})$ and $a_2(\pop{P})$?

2.2.2 Influence and sensitivity curves

There are other characteristics of any population attribute that are interest. An important one is its sensitivity to the value of the variate on a single unit in the population. Again, suppose the attribute of interest can be written as \[a(\pop{P}) = a (y_1, \ldots, y_N).\] To see how this might be affected by any one unit in the population we might look at the difference in the attribute when the variate value associated with that unit us removed.

For each unit $u$ in the population, the difference \[\Delta(a, u) = a (y_1, \ldots, y_{u-1}, y_u, y_{u+1}, \ldots, y_N) - a (y_1, \ldots, y_{u-1}, y_{u+1}, \ldots, y_N) \] is a measure of the effect, or influence, it has on the value of the attribute. Ideally, no one unit’s value would have far greater influence than any other. One which did would require further investigation as it might be in error, or it might be the most interesting unit in the population.

Exercise: Plot the values $(u, \Delta(a, u))$ for one of the populations and variate for two different attributes like the average and the median.

Some attributes may be less sensitive to the effect if individual values. One way to characterize an attribute is throught its sensitivity. Without loss of generality, we take the unit to be omitted that which is labelled $N$. To examine the effect of the value of this one unit, we define the sensitivity curve of our attribute as \[\begin{array}{rcl} SC (y ~;~ a(\pop{P})) & = & \frac{a (y_1, \ldots, y_{N-1}, y) - a (y_1, \ldots, y_{N-1})}{\frac{1}{N}} \\ &&\\ & = & N \left[~a (y_1, \ldots, y_{N-1}, y) - a (y_1, \ldots, y_{N-1}) ~\right] \end{array} \] Plotted as a function for all possible values of $y$ (i.e. for $y_N$), the sensitivity curve gives a scaled measure of the effect that a single variate value $y$ has on the value of a population attribute $a(\pop{P})$.

Explore the sensitivity curves of any of the above attributes. These are easily determined mathematically in general, but can also be determined computationally for any particular population.

Note: Note that other robustness characteristics such as the breakdown point could be introduced here and/or in assignments.

2.2.3 Graphical attributes

Population attributes can also be entirely graphical as in

histograms of $y_u$ values
bar plots of $y_u$ values
scatterplots of pairs $(x_u, y_u)$

Each of these is itself a summary, and hence an attribute (or statistic) of the whole population

Using the US Agricultural census data:

### read the data from wherever it is stored, e.g. in some directory named Data
directory <- "Data"
dirsep <-"/"
filename <- paste(directory, "agpop_data.csv", sep=dirsep)
agpop <- read.csv(filename, header=TRUE)

We can now construct any plot we want.

hist(agpop$farms87, col=adjustcolor("grey", alpha = 0.5),
     main="Number of farms per county in 1987",
     xlab="number of farms",
     breaks=100
     )

plot(agpop$farms87, agpop$acres87, pch = 19, cex=0.5, 
     col=adjustcolor("black", alpha = 0.3), 
     xlab = "Number of farms", 
     ylab = "Total acreage of farming",  
     main = "US counties 1987")

2.2.4 Power transformations

For any variate $y$, it is sometimes helpful to re-express the values in a non-linear way via a transformation $T(y)$, say, so that on the re-expressed scale attributes are easier to define, to understand, or simply to determine. A commonly used family of re-expressions are the power transformations which are defined for $y > 0$ and indexed by a power $\alpha$. The general form is \[T_\alpha(y) = \left\{ \begin{array}{cc} ay^\alpha + b & ( \alpha \ne 0 ) \\ c \log(y) +d & (\alpha=0) \end{array} \right. \] where $a,b,c,d$ and $\alpha$ are real numbers with $a > 0$ when $\alpha>0$, $a<0$ when $\alpha <0$, and $c>0$ when $\alpha=0$. The choices of $a, b, c$ and $d$ are somewhat arbitrary otherwise. Note that $\alpha = 1$ is at most a location-scale transformation of the values.

These transformations are monotonic, in the sense that $y_u < y_v \iff T_\alpha(y_u) < T_\alpha(y_v)$, that is they preserve the order of the variate values associated with the units $u$ and $v$. What does change, often dramatically, is the relative positions of the variate values.

For example, consider again the US agricultural census and let $y_u$ denote the 1 + farms87. Note that 1 has been added to every count of the number of farms. This is because the power transformations can only be applied to positive values, $y_u > 0 ~\forall ~u \in \pop{P}$. Suppose we apply a power transformation $z_u = T_\alpha(y_u)$ for a couple of values of $\alpha$. Some corresponding histograms are

Effect of power transformation on number of farms in 1987

Note how the histograms have changed shape as $\alpha$ has decreased. The “bump” in the histogram has moved from left to right as $\alpha$ has decreased. Note also that when there is only a single large bump, different values of $\alpha$ also change the symmetry of the histogram about that bump. For example, to have a more symmetric looking histogram one need only change the value of $\alpha$.

If interest lies primarily in changing the symmetry of the histogram, then location and scale could be set at any convenient values since they have no effect on symmetry. For example, for mathematical investigation, a convenient choice is to have the location and scale change with $\alpha$ in the following way: \[ T_\alpha(y) = \frac{y^\alpha -1}{\alpha} ~~~~ \forall ~ \alpha . \] This requires no separate equation for the $x=0$ case, since $lim_{\alpha \rightarrow 0} T_{\alpha}(y) = ln(y)$. (Exercise: Prove this.) The family is now smoothly indexed by the single parameter $\alpha$.

The function which implements it does of course require making a special case for $\alpha=0$:

powerfun <- function(x, alpha) {
  if(sum(x <= 0) > 1) stop("x must be positive")
  if (alpha == 0)
    log(x)
  else
    (x^alpha-1)/alpha
}

To save computation and minimize introduction of calculational errors, the following implementation is preferred:

powerfun <- function(x, alpha) {
  if(sum(x <= 0) > 1) stop("x must be positive")
  if (alpha == 0)
    log(x)
  else if (alpha > 0) {
    x^alpha
  } else -x^alpha
}

Although $\alpha$ can take any real value in principle, in practice one usually restricts the possible values to a small reasonable set. Besides reducing the number of possible powers to consider, the powers can restricted to those which are easily interpretable. John Tukey suggested (Tukey 1977) imagining that the set of powers were arranged in a “ladder” with the smallest powers on the bottom and the largest on the top. They might be arranged as

alpha	ladder
…	up
2	\|
1	<- original values
1/2	\|
1/3	\|
0	\|
-1/3	\|
-1/2	\|
-1	\|
-2	\|
…	down

Now one simply moves “up” or “down” on Tukey’s ladder of powers to arrive at a re-expression that achieves the desired effect on the data values.

Two different, but related, effects are often of interest. First, as above, interest might lie in producing a re-expression that has a more symmetric looking histogram. Second, in the case of two variates $x$ and $y$, we might like to have each variate re-expressed separately so that the all pairs of the values might be well described as being roughly linearly related. That is, imagine (for all $\u \in \pop{P}$) a scatterplot of all pairs $(x_u, y_u)$. Of interest is whether there are powers $\alpha_x$ and $\alpha_y$ for each such that the scatterplot of the re-expressed pairs $(T_{\alpha_x}(x), T_{\alpha_y}(y))$ lie more nearly on a straight line.

Fortunately, for each of these effects there is a corresponding bump rule that indicates the direction (up or down) to move on Tukey’s ladder to achieve it.

2.2.4.1 Bump rule 1: Making histograms more symmetric

The rule is that the location of the “bump” in the histogram (where the points are concentrated) tells you which way to “move” on the ladder. If the bump is on “lower” values, then move the power “lower” on the ladder; if it is on the “higher” values, then move the power “higher” on the ladder.

2.2.4.2 Bump rule 2: Straightening scatterplots

A scatterplot of $(x_u,y_u)$ for $u \in \pop{P}$ may be “straightened” by applying (possibly) different power transforms to each coordinate to give a new (hopefully straighter looking) scatterplot of the re-expressed data $(T_{\alpha_x}(x_u), T_{\alpha_y}(y_u))$.

Because each of the coordinates has its own power transformation there will be two different ladders of transformation – the $x$ ladder and the $y$ ladder – one for each $\alpha$.

Just as with histograms, there is a “bump” in the display that tells you which way to go on the ladder of transformations. In this case, the “bump” corresponds to the curvature appearing in the scatterplot. This curvature is only approximate in practice but, in idealized form, is only one of four different possibilities. The curve “bumps” up and right, down and right, down and left, or up and left. The idealized curves are shown as that part of the circle appearing in each of the four quadrants below together with the corresponding bump rule to straighten the plot.

Bump rule for straightening scatterplots

Exercise: Have students explore power transformations and Tukey’s bump rules for symmetrizing histograms and straightening scatterplots. Using loon and the following code, find a choice of $\alpha_x$ and $\alpha_y$ that symmetrizes the histograms and produces the straightest scatterplot. Identify the mammals which have the largest brain weights for their body weight.

### This requires that the loon package be installed.
### install.package("loon") will install the package from CRAN
### (requires R >= 3.4)
###
library(loon)
###
power <- function(x, y, 
                  linkingGroup="linkingGroup",
                  from=-5, to=5, ...) {
  ## Create histograms
  histX <- l_hist(x, linkingGroup = linkingGroup, 
                  yshows="density")
  histY <- l_hist(y, linkingGroup = linkingGroup, 
                  yshows="density", swapAxes = TRUE
                  )
  ## Now we build an interactive scatterplot
  ## with sliders for power transformations 
  ## on each of x and y
  tt <- tktoplevel()
  tktitle(tt) <- "Power Transformation"
  p <- l_plot(x=x, y=y, parent=tt, 
                  linkingGroup=linkingGroup,
              ...)
  ## Alpha values
  alpha_x <- tclVar('1')
  alpha_y <- tclVar('1')
  ## Sliders to change the alphas
  sx <- tkscale(tt, orient='horizontal',
                variable=alpha_x, 
                from=from, to=to, resolution=0.1)
  sy <- tkscale(tt, orient='vertical',
                variable=alpha_y, 
                from=to, to=from, resolution=0.1)
  ## Laying out the pieces in one window
  tkgrid(sy, row=0, column=0, sticky="ns")
  tkgrid(p, row=0, column=1, sticky="nswe")
  tkgrid(sx, row=1, column=1, sticky="we")
  tkgrid.columnconfigure(tt, 1, weight=1)
  tkgrid.rowconfigure(tt, 0, weight=1)
  
  ## This function redraws the plots with the alphas
  ## from the slider values whenever it is called.
  ## 
  update <- function(...) {
    ### get transformed x and y
    transformedX <- powerfun(x, as.numeric(tclvalue(alpha_x)))
    transformedY <- powerfun(y, as.numeric(tclvalue(alpha_y)))
    
    ## First the scatterplot
    l_configure(p,
                x = transformedX,
                y = transformedY)
    l_scaleto_world(p)
    ## Now the histograms
    l_configure(histX, x = transformedX)
    l_scaleto_world(histX)
    l_configure(histY, x = transformedY)
    l_scaleto_world(histY)
  }
  ## Set the function update to be called
  ## whenever the slider values are changed
  tkconfigure(sx, command=update)    
  tkconfigure(sy, command=update)    
  ## Return the scatterplot if assigned
  invisible(p)    
}

###
### Here's an example using the mammals data set
### from the MASS packages
library(MASS)
data("mammals")
p <- with(mammals, 
          power(body, brain,
                xlabel="body weight",
                ylabel="brain weight",
                title=
                  "Brain and Body Weights for 62 Species of Land Mammals",
                linkingGroup = "mammals",
                itemLabel=rownames(mammals),
                showItemLabels=TRUE)
)

2.2.5 Order and rank statistics

Population attributes can also be an indexed collection of values. For example, consider the following different attributes

the order statistic \[y_{(1)} \le y_{(2)} \le \cdots \le y_{(N)}\] These are the ordered values (including ties) of the variate values $y_u \in \pop{P}$
the rank statistic \[r_{1}, r_{2}, \ldots , r_{N}\] These are the ranks of the variate values $y_1, y_{2}, \ldots, y_{N}$ from the $y_u \in \pop{P}$. Each $r_i$ is the rank of $y_i$ in terms of its magnitude relative to all $N$ values of $y_u$ for $u \in \pop{P}$. For example, if $y_i = y_{(k)}$ then $y_i$ is the $k$th smallest value and so $y_i$ has rank $r_i = k$. In other words, this means that \[ y_{(u)} = y_{r_u} ~~~~ \forall ~~u \in \pop{P} \]

There are functions in R that calculate these statistics. For example,

y <-  c(3, 1, 4, 22, 12)
### The order statistic
yordered <- sort(y)
yordered

## [1]  1  3  4 12 22

# The rank statistic (Note, no ties to worry about)
yrank <- rank(y) 
yrank

## [1] 2 1 3 5 4

### The connection between them
y[yrank] == yordered

## [1] TRUE TRUE TRUE TRUE TRUE

These two attributes are often combined as a single graphical attribute by plotting the pairs $(r_u, y_u)$ (or equivalently $(u, y_{(u)})$ for all $u \in \pop{P}$).

For example, using the agricultural census data we have in R, we could look at the variate acres87

### read the data from wherever it is stored, e.g. in some directory named Data
directory <- "Data"
dirsep <-"/"
filename <- paste(directory, "agpop_data.csv", sep=dirsep)
agpop <- read.csv(filename, header=TRUE)
y <- agpop$acres87[agpop$region == "NE"]
yordered <- sort(y)
yrank <- rank(y, ties.method = "first")  # Now ensure ties appear in data set order

plot(yrank, y, pch = 19, col=adjustcolor("grey", alpha = 0.5), 
     xlab = "County rank by acreage", 
     ylab = "Farming acres in 1987",  
     main = "Counties in the North East USA")

Note

the height of the curve at any point tells the location of the value of $y$
the horizontal location identifies where in the order of the variate values that unit appears
by construction, this plot is monotonically non-decreasing from left to right.
flat spots indicate tied values of $y$; nearly flat spots are counties where the number of acres under farming are nearly the same.
rapidly rising spots are counties which, though near each other in order (rank), are very different in the actual values of $y$ (acreage).
the slope of the curve seems to indicate how spread out the points are there

Exercise: How do ranks change after a power transformation? Explain.

2.2.6 Quantiles

Rather than use rank, it can be more convenient to mark the horizontal axis with the proportion of units in the population having a smaller value of $y$. That is, instead of plotting the pairs $(r_u, y_u)$, we could equivalently plot the pairs $(p_u, y_u)$ where \[p_u = \frac{r_u}{N}\] is the proportion of the units $i \in \pop{P}$ whose value $y_i \le y_u$ (tied values have to be considered; here they will simply be given distinct values of $p_u$ in arbitrary order).

The plot becomes

N <- length(y)
p <- yrank/N
plot(p, y, pch = 19, col=adjustcolor("grey", alpha = 0.5), 
     xlim=c(0,1),
     xlab = "Proportion p", 
     ylab = "Farming acres in 1987",  
     main = "Counties in the North East USA")

and is identical except for the labelling of the horizontal axis.

In this form, the plotted points are denoted \[(p, Q_y(p)) \] for $p \in \{\frac{1}{N}, \frac{2}{N}, \ldots, 1 \}$ and $Q_y(p)$ is the $p$th quantile of $y$: \[Q_y(p) = y_{(N p)}.\] By simply linearly interpolating between these points, the $Q_y(p)$ can be thought of as the quantile function of $y$ for all $p\in[\frac{1}{N},1]$.

The quantile function of $y$ is itself a population attribute which in turn can be used to generate a number of other interesting population attributes. For example, any quantile $Q_y(p)$ for any $p$ locates the variate values in the population, and is called a measure of location. More typically location measures try to capture some more central measure of location. These include:

the median $Q_y(1/2)$
the average of the first and third quartiles $\frac{1}{2}\left(Q_y(1/4) + Q_y(3/4) \right)$
the trimean $\frac{1}{4}\left(Q_y(1/4) + 2\times Q_y(1/2) + Q_y(3/4) \right)$

as well as many others. Principally, reading off the vertical location of $Q_y(p)$ for any pre-determined $p$ provides some location of measure.

Note that even the population average is a function of the quantiles, namely their average as well, but is not easily determined from the plot. Moreover, unlike the above measures of location, the sensitivity curve of the arithmetic average is unbounded, making it very sensitive to outlying observations. (Exercise: determine the sensitivity curve for each of the above measures of location.) The above measures are also simple to construct from the plot (or from the ordered variate values).

The quantile function can also be used to provide some straightforward and natural measures of of the scale or spread of the variate values:

the range $Q_y(1) - Q_y(1/N)$
the inter-quartile range $IQR_y = Q_y(3/4) - Q_y(1/4)$
the central 100 $\times p$ percent range

Alternatively, any of these measures might be divided by the difference in the corresponding $p$ values. That is, the slope of the line segment joining any two points $(p_1, Q_y(p_1))$ and $(p_2, Q_y(p_2))$ for $p_1 < p_2$ provides a measure of scale. As with measures of location, the sensitivity curves of any of these measures of scale or spread can be determined and plotted.

Exercise: How do quantiles change after a power transformation? Explain.

2.2.6.1 Flat areas correspond to regions of concentration

Flatter regions in a quantile plot indicate areas where the variate values appear to be concentrated. To see just how concentrated the values are a box of fixed height could be drawn just wide enough that it included all of the individuals $u \in \pop{P}$ whose $y_u$ are within the box. For example, the following function will draw such a box having bottom left and top right corners given by coordinates $(x_i, y_i)$ for $i = 1,2$, respectively.

# Here's an R function that draws a single
# box between the pair of points (x[1],y[1]) and (x[2],y[2])
# 
drawbox <- function(x,y, ...) {
  rect(xleft = x[1], ybottom = y[1], xright = x[2], ytop = y[2], ...)
}

Using this function, we can add a box at any position on the quantile plot. For example,

### Quantiles:
qvals <- sort(y)
pvals <- ppoints(length(qvals))
plot(pvals, qvals, pch = 19, col=adjustcolor("grey", alpha = 0.5), 
     xlim=c(0,1),
     xlab = "Proportion p", 
     ylab = "Quantiles Q_y(p)",  
     main = "1987 farming acreage for north east counties")


# Need some boundaries for the qvals range
qrange <- extendrange(qvals)
bins <- seq(qrange[1], qrange[2], length.out=15)
col <- adjustcolor("steelblue", 0.2)
border <- adjustcolor("black", 0.7)

# Draw one 
i <- 1
drawbox(c(min(pvals),
          pvals[sum(qvals <= bins[i+1])]),
        bins[i:(i+1)], 
        lty=1,
        lwd=2,
        col= col, border = border)

Concentration box on quantile plot

The width of the box is proportional to the number of points between its bottom and its top. The greater the width, the greater the concentration of points between those two $y$ values.

We can produce all such boxes, one above the other on the vertical axis, to see how the concentration changes with $p$.

plot(pvals, qvals, pch = 19, col=adjustcolor("grey", alpha = 0.5), 
     xlim=c(0,1),
     xlab = "Proportion p", 
     ylab = "Quantiles Q_y(p)",  
     main = "1987 farming acreage for north east counties")

# Draw first one 
i <- 1
drawbox(c(min(pvals),
          pvals[sum(qvals <= bins[i+1])]),
        bins[i:(i+1)], 
        lty=1,
        lwd=2,
        col= col, border = border)

# Now the rest
for (i in 2:length(bins)) {
  biny <- c(sum(qvals <= bins[i]), 
            sum(qvals <= bins[i+1]))
  drawbox(pvals[biny],
          bins[c(i, i+1)], 
          lty=1,
          lwd=2,
          col= col, border = border)
}

Contiguous concentration boxes on quantile plot

Flat areas on the plot correspond to wide boxes, which in turn correspond to high concentrations of $y$ values. As can be seen from the wider boxes on the left, the $y$ values are concentrated more on low values of acreage than on high values.

Indeed, if all of the boxes are moved so that their left edge is on $p=0$, a familiar graphic appears.

plot(pvals, qvals, pch = 19, col=adjustcolor("grey", alpha = 0.5), 
     xlim=c(0,1),
     xlab = "Proportion p", 
     ylab = "Quantiles Q_y(p)",  
     main = "1987 farming acreage for north east counties")

# Draw first one 
i <- 1
drawbox(c(0,
          pvals[sum(qvals <= bins[i+1])]),
        bins[i:(i+1)], 
        lty=1,
        lwd=2,
        col= col, border = border)

# Now the rest
for (i in 2:length(qvals)) {
  biny <- c(sum(qvals <= bins[i]), 
            sum(qvals <= bins[i+1]))
  drawbox(c(0, diff(abs(pvals[biny]))),
          bins[c(i, i+1)], 
          lty=1,
          lwd=2,
          col= col, border = border)
}

Contiguous concentration boxes on quantile plot

A histogram of the acreage (or any $y$ variate) is formed from the boxes that identify concentrations on the quantile plot. This is an important connection.

Exercise: Draw the plot again as necessary to answer the questions below, except that this time switch the axes so that $p$ is on the vertical axis and $Q_y(p)$ on the horizontal. That is plot $(Q_y(p), p)$.

What population attribute does this graph represent? That is, as a function of $Q_y(p)$, what does the value $p$ describe?
Draw a single box for $i = 3$ where now the width is fixed and the height represents the concentration of points. What does the ratio of the height to the width describe in this plot?
Draw all boxes of concentration as in part (b).
Draw all boxes of concentration as in part (b) but now with their bottom on the horizontal ($p=0$) axis.
Comment on the relation between the histogram in (d) and the function in (a).

Exercise: Repeat the above for the duration time of the “Old faithful geyser” as given by the geyser data from the package MASS. Comment on how what features of the quantile plot correspond to a bimodal histogram of values $y$.

2.2.7 Calculation errors

A commonly used measure of scale of $y$ is the standard deviation ($SD$) which is traditionally defined as the square root of the average squared difference between the variate values and their average value. For our finite population $\pop{P}$ of size $N$ this is \[SD_{\pop{P}}(y) = \sqrt{\frac{\sum_{u \in \pop{P}} \left( y_u - \widebar{y} \right)^2}{N}}.\] Taking the square root ensures that this is a measure of the scale of the variate values $y$ in the same units of measurement as the values $y$. As with the population average, the sensitivity curve of the $SD$ (and hence of $Var$) is unbounded (Exercise: show this)

Note that the $SD$ is a complicated calculation but for some purposes can be more mathematically convenient when squared. This gives the population variance of the variate $y$: \[{Var}_{\pop{P}}(y) = {SD}^2_{\pop{P}}(y).\]
Of course, being expressed in squared units, the variance is less interpretable than the standard deviation.

As a measure of scale, the standard deviation requires much more calculation than say the interquartile range.

Note that all calculations are necessarily inexact since the numbers used are not all real numbers but rather all numbers available in a floating point system. Every real number can be expressed as an infinite sum: \[ \pm \left(\sum_{i=1}^{\infty} d_i \left(\frac{1}{\beta} \right)^i \right) \times \beta^k \] for some integer base $\beta \ge 2$ and suitable integer choices for the common power $k$ and the “digits” $0 \le d_i \le (\beta -1)$. A unique representation is ensured for each real by restricting $d_1 > 0$, which in turn forces the choice of $k$. The number $0$ is uniquely represented as $d_i = 0 ~ \forall ~i$. Our standard base 10 representation has $\beta= 10$; computers more often use base $\beta = 2$.

Numerical calculation does not use real numbers but floating point numbers. Numbers that are represented by a floating point number have the following form: \[ \pm \left(\sum_{i=1}^{m} d_i \left(\frac{1}{\beta} \right)^i \right) \times \beta^k \] a finite sum. The constraints on the $d_i$ are as before except that $k$ is also finite with $k_{min} \le k \le k_{max}$. The floating point number system is the set of all such numbers and is denoted $F(\beta, m, k_{min}, k_{max})$ to emphasize the four arguments which determine the set: the base $\beta$, the precision $m$ (number of terms in the sum), and the lower $k_{min} < 0$ and upper $k_{max} > 0$ limits of the common exponent.

The floating point numbers are clearly a finite, though potentially very large, subset of the reals. This means many real numbers map to the same floating point number. For example, denote by $fl(y)$ that function which maps a real number $y$ to its floating point representation. The function might round the $(m+1)$th digit in $y$ to determine the values $d_1, \ldots, d_m$ in $fl(y)$ or it might simply chop all digits from $m+1$ on from $y$. The mapping $y \mapsto fl(y)$ is a many to one mapping.

In R, these (and other) values can be found as part of the data structure .Machine.

floatSystem <- data.frame(base = .Machine$double.base, 
                          precision = .Machine$double.digits,
                          kmin = .Machine$double.min.exp,
                          kmax = .Machine$double.max.exp)
library(knitr)
kable(floatSystem, caption="Floating point number system in R")

Floating point number system in R
base	precision	kmin	kmax
2	53	-1022	1024

Comment kable(...) from the knitr package produces nice tables for named data frames.

Exercises: How many distinct numbers exist in $F(\beta, m, k_{min}, k_{max})$? What is the largest positive number in the floating point system? (Any number larger than this is said to overflow in the floating point number system.) What is the smallest positive number? What is the smallest number which when added to 1 produces a floating point number larger than 1 (assume chopped $fl$). Determine the values of all positive numbers in the floating point system $F(2,3,-1,2)$. Plot these on a line.

For any $y \in \Reals$ its mapping into $F(\beta, m, k_{min}, k_{max})$ will produce an error. The relative error of this mapping (for $y \ne 0$) is \[ \left| \frac{y - fl(y)}{y} \right| ~\le ~~ \left\{\begin{array}{rcl} \beta^{1-m} & ~~~&\mbox{if the computer chops} \\ &&\\ \frac{1}{2}\beta^{1-m} & ~~~&\mbox{if the computer rounds} \end{array} \right. \] which is independent of the exponent $k$.

Exercise: Prove the above result in the case of the chopped representation. The rounded is proved similarly.

This bound is called the unit rounding error $u$. \[ u = \left\{\begin{array}{rcl} \beta^{1-m} & ~~~&\mbox{for chopped arithmetic} \\ &&\\ \frac{1}{2}\beta^{1-m} & ~~~&\mbox{for rounded arithmetic} \end{array} \right. \]

Arithmetic in a floating point system is not the same as in the reals. For example, there is a smallest number $\epsilon > 0$ in $F(\beta, m, k_{min}, k_{max})$ which when added to $1$ produces a sum that is larger than $1$. That is $1 + \epsilon \ne 1$. In R this is the value of .Machine$double.eps (a similar constant, .Machine$double.neg.eps, exists for subtraction from 1).

epsilon <- .Machine$double.eps
epsilon

## [1] 2.220446e-16

### Adding this to 1 produces a number not equal to 1
1 + epsilon == 1

## [1] FALSE

### Something half is still in the floating point system
halfEpsilon <- epsilon/2
halfEpsilon

## [1] 1.110223e-16

### But adding it to 1 will not change 1
1 + halfEpsilon == 1

## [1] TRUE

This also shows that the order of calculation matters. Different values may be returned depending on the order. Floating point addition is not associative.

(1 + halfEpsilon) + halfEpsilon == 1

## [1] TRUE

1 + (halfEpsilon + halfEpsilon) == 1

## [1] FALSE

Errors can dramatically accumulate over many sums and differences. A little more numerical analysis sheds some light on this. Let $\circ$ denote any of the four basic operations $+$, $-$, $*$, or $/$ and that $w \circ v$ is the exact result for any real $w$ and $v$. Suppose that $x$ and $y$ are both numbers in $F(\beta, m, k_{min}, k_{max})$ and that $fl(x \circ y) \in F(\beta, m, k_{min}, k_{max})$ is the floating point representation of the exact result $x \circ y \in \Reals$. It then follows that \[ \left| \frac{(x \circ y) - fl(x \circ y)}{(x \circ y)} \right| ~\le ~ u\] from which we can write, for $x$ and $y$ in $F(\beta, m, k_{min}, k_{max})$, that \[fl(x \circ y) = (x \circ y)(1 + \delta) \] where $\abs{\delta} < u$.

Note: This relative error is for calculation with numbers $x$ and $y$ in $F(\beta, m, k_{min}, k_{max})$, not for any two real numbers $w$ and $v$ say. That is if $x=fl(w)$ and $y=fl(v)$, then the two relative errors \[\left| \frac{(w \circ v) - fl(x \circ y)}{(w \circ v)} \right| ~~~\mbox{ and } ~~~ \left| \frac{(x \circ y) - fl(x \circ y)}{(x \circ y)} \right| \] are not the same. The result is only giving information about the second of these.

Exercise: Consider real numbers $w=0.93214$ and $v=0.93221$ and assume we are working in $F(10, 4, -5, 5)$. Determine each of the above relative errors. Describe in words what each of the relative errors (and their values) is telling you about the computations.

It is now possible to prove, for $x$, $y$, $z$, in $F(\beta, m, k_{min}, k_{max})$, that \[ \frac{\abs{(x + y + z) - fl(x + y + z)}}{\abs{x+y+z}} \le \frac{(\abs{x} + \abs{y} + \abs{z})(2u +u^2)}{\abs{x+y+z}} \] which suggests that there may be problems in addition when the signs of the numbers summed change. More simply put, subtraction of two same-signed numbers has more relative error than does their addition.

Exercise: Prove this result for chopped arithmetic.

The result with respect to multiplication is \[ \frac{\abs{(x * y * z) - fl(x * y * z)}}{\abs{x * y * z}} \le (2u +u^2) \] Exercise: Prove this result for chopped arithmetic.

A good illustration of the effect of alternating signs is the calculation formula for the standard deviation commonly recommended by some authors: \[ \sum_{u \in \pop{P}} \left( y_u - \widebar{y} \right)^2 = \sum_{u \in \pop{P}} y_u^2 - N \widebar{y}^2 \] or \[ \sum_{u \in \pop{P}} \left( y_u - \widebar{y} \right)^2 = \sum_{u \in \pop{P}} y_u^2 - \frac{1}{N} \left( \sum_{u \in \pop{P}} y_u \right)^2 \] which, while mathematically correct when calculating over the reals, is a bad idea when it comes to calculation over the floating point numbers!

While it has the advantage of only needing one-pass through the data (each sum can be calculated in the same pass), it can give quite wrong (possibly even negative) values for the sum of squared differences.

For example, when both pieces on the right are large (and the left hand side small), the floating point representation of each piece will use most of its space to represent the large parts of that number. The important difference between the two numbers is relegated to occupy the least significant parts of of the floating representation. When the difference between the two large numbers is calculated much has been lost.

Calculation using the left hand side would be a two pass algorithm. In first pass $\widebar{y}$ is calculated; in the second pass this the squared differences $(y_u - \widebar{y})^2$ are calculated and summed. This is often more accurate than the one pass algorithm because the size of every term being squared and summed is smaller. The value can be made more accurate still with careful order in the summation (e.g. sum all small terms first).

A more accurate two-pass algorithm is \[ \sum_{u \in \pop{P}} \left( y_u - \widebar{y} \right)^2 - \frac{1}{N} \left( \sum_{u \in \pop{P}} (y_u - \widebar{y}) \right)^2 \] The first term is the two pass algorithm. The second term is zero mathematically but not necessarily computationally. In practice, the right hand term provides some correction to the floating point error of the first.

The various methods can be implemented in R as follows

# First the built in function in R (which is carefully implemented)
builtIn <- function (y) {
  N <- length(y)
  (N-1) * var(y)
}

twoPass <- function (y) {
  N <- length(y)
  ybar <- Reduce(function(y0, y1) {y0 + y1}, y, init=0) / N
  result <- Reduce(function(y0, y1) {y0 + (y1 - ybar)^2}, y, init=0)
  result
}

onePass <- function (y) {
  N <- length(y)
  result <- Reduce(function(y0, y1) {y0 + c(y1^2, y1)}, y, init=c(0,0))
  result[1] - (result[2]^2 / N)
}

twoPassCorrected <- function (y) {
  N <- length(y)
  ybar <- Reduce(function(y0, y1) {y0 + y1}, y, init=0) / N
  result <- Reduce(function(y0, y1) {y0 + c((y1 - ybar)^2, (y1-ybar))}, 
                   y, init=c(0,0))
  result[1] - (result[2]^2 / N)
}

Comment: Each Reduce(...) call is a single pass through the data. Note that the first argument to Reduce is itself an anonymous function (i.e. unnamed) that is applied to each new value y1 which is combined with the previous value y0. The value from each such function call provides the next value of y0; the argument init provides the very first value. In this way, Reduce reduces the collection of y values to a single value.

We can compare the three methods

set.seed(12345)
y <- runif(100000)
### They all agree for this data
results <- data.frame(builtIn = builtIn(y),
                      onePass = onePass(y),
                      twoPass = twoPass(y),
                      twoPassCorrected = twoPassCorrected(y))
library(knitr)
kable(results, caption = "All methods agree for this data")

All methods agree for this data
builtIn	onePass	twoPass	twoPassCorrected
8313.825	8313.825	8313.825	8313.825

But look at what happens when the numbers get large. Here we simply apply a location scale transformation of the same data. The sum of squares should simply rescale with the squared scale.

scale <- 10^10
location <- 1234123552341341234
ybig <- y * scale  +  location
### 
results <- data.frame(builtIn = builtIn(ybig),
                      onePass = onePass(ybig),
                      twoPass = twoPass(ybig),
                      twoPassCorrected = twoPassCorrected(ybig))
kable(round(results, digits=3), caption = "One pass is negative!!!")

One pass is negative!!!
builtIn	onePass	twoPass	twoPassCorrected
8.313825e+23	-3.617106e+27	8.313825e+23	8.313825e+23

The above should only have rescaled the results by $10^{20}$; the location change should have no effect. Yet the textbook formula now produces a negative sum of squares!

Here we add a much larger location (which should have no effect mathematically on the reals)

scale <- 10^10
location <- 1234123552341341234132434
ybig <- y * scale  +  location
### 
results <- data.frame(builtIn = builtIn(ybig),
                      onePass = onePass(ybig),
                      twoPass = twoPass(ybig),
                      twoPassCorrected = twoPassCorrected(ybig))
kable(round(results, digits=3), 
      caption = "Both the one pass and the uncorrected two pass fail")

Both the one pass and the uncorrected two pass fail
builtIn	onePass	twoPass	twoPassCorrected
8.330865e+23	-1.781804e+41	2.929722e+28	8.321753e+23

Again, the one pass produces an even larger negative value for the sum of squares; the two pass now also fails, but the corrected two pass remains close to the built in result. The two pass corrected could be improved more by carefully arranging the order of summation so that small terms are not lost.

Contary to many textbook recommendations, the one pass “calculation formula” should be avoided! In floating point systems it is important to find algorithms which have small calculational error.

Exercise: Assuming chopped arithmetic, some simple theorems can be proved here which show that summing values with alternating signs is more error prone than same signs. Could also illustrate this programmatically (e.g. via Taylor series expansion for $e^y$).

2.3 Implicitly defined population attributes

In many cases, attributes of a population are defined implicitly, typically as the solution to some equation or set of equations.

2.3.1 The minimum of a function

For example we are interested in a (possibly vector-valued) attribute $\sv{\theta}$ which minimizes some function of $\rho(\cdots)$ of the variates in the population. That is, we want the value $\widehat{\sv{\theta}}$ which satisfies \[ \widehat{\sv{\theta}} = \arg\min_{\sv{\theta} ~ \in ~ \sv{\Theta}}\rho(\sv{\theta}; {\pop{P}}) \] where the possible values of $~\sv{\theta}$ may be constrained to be in some set $\sv{\Theta}$ and $\arg\min$ means the value of the argument $\sv{\theta}$ which minimizes the function $\rho$.

Note that maximizing a function is the same as minimizing its negation, so that we need only consider minimization here.

The most common choice for $\rho$ is a sum of functions $\rho$ evaluated at each unit $u \in \pop{P}$. That is

\[ \widehat{\sv{\theta}} = \arg\min_{\sv{\theta} ~ \in ~ \sv{\Theta}}\sum_{u ~\in ~\pop{P}} \rho(\sv{\theta}; u). \]

Familiar examples for a scalar valued attribute $\theta \in \Reals$ include:

Least-squares: \[\rho({\theta}; u) = (y_u - {\theta})^2 \] yields the average $\widehat{\theta} = \widebar{y}$ as the solution to \[ \widehat{\theta} = \arg\min_{\theta ~ \in ~ \Reals}\sum_{u ~\in ~\pop{P}} (y_u - {\theta})^2 . \]
Weighted least-squares: \[\rho({\theta}; u) = w_u ~ (y_u - {\theta})^2 \] yields the weighted average \[ \widehat{\theta} = \frac{\sum_{u \in \pop{P}} w_u y_u}{ \sum_{u \in \pop{P}} w_u}\] as the solution to \[ \widehat{\theta} = \arg\min_{\theta ~ \in ~ \Reals}\sum_{u ~\in ~\pop{P}} w_u ~ (y_u - {\theta})^2 . \] Least-squares is weighted least squares with equal weights.
Least absolute deviations: \[\rho({\theta}; u) = \abs{y_u - {\theta}}\] yields the median $\widehat{\theta} = Q_y (1/2)$ as the solution to \[ \widehat{\theta} = \arg\min_{\theta ~ \in ~ \Reals}\sum_{u ~\in ~\pop{P}} \abs{y_u - {\theta} }. \]

A familiar vector valued attribute $\sv{\theta} \in \Reals^2$ is the pair of coefficients of a line $\sv{\theta} = \tr{(\alpha, \beta)}$ being fitted to two variate values $y$ and $x$ as in \[y_u = \alpha + \beta x_u + r_u \] for all $u \in \pop{P}$. Here $r_u = y_u - \alpha - \beta x_u$ is called the residual and is the signed vertical distance between the point $(x_u, y_u)$ and the line given by the values of $\sv{\theta}$. Note that, in terms of interpretability, it is preferable to have chosen the variate $x_u$ such that its value at $x=0$ is somewhere meaningful in the data set (e.g. in the middle). One way to achieve this for any $x_u$ is to write the line as \[y_u = \alpha + \beta ~( x_u - c) + r_u \] for a fixed constant $c$ (e.g. choose $c = \widebar{x}$). Then the coefficient $\alpha$ has some interpretation as the value of the line when $x_u = c$; the choice $c = 0$ can be outside the range of the $x$ values and so might have no meaningful interpretation. Note that whatever the value of $c$ chosen, only the interpretation of the intercept $\alpha$ changes.

When $\rho(\sv{\theta}; u) =r_u^2$, the coefficients are determined as \[ \widehat{\sv{\theta}} = \arg\min_{\sv{\theta} ~ \in ~ \Reals^2}\sum_{u ~\in ~\pop{P}} (y_u - \sv{\theta})^2 \] or equivalently with $\widehat{\sv{\theta}} = \tr{(\widehat{\alpha}, \widehat{\beta})}$ as \[ (\widehat{\alpha}, \widehat{\beta}) = \arg\min_{(\alpha, \beta) ~ \in ~ \Reals^2}\sum_{u ~\in ~\pop{P}} (y_u - \alpha - \beta (x -c))^2 \] the resulting fitted line is called the least-squares line.

Similarly, we could have a weighted least squares line, or a least absolute deviations line. Minimizing a function $\rho(y_u - \alpha - \beta (x -c))$ would produce a fitted line.

2.3.1.1 Gradient descent

For the examples considered above the solution can be determined in closed form. A more general approach is to develop an algorithm which will produce the solution by a sequence of specific steps.

A common approach is to have an iterative procedure which produces a sequence of iterates \[ \widehat{\sv{\theta}}_0,~ \widehat{\sv{\theta}}_1,~ \widehat{\sv{\theta}}_2,~ \ldots,~ \widehat{\sv{\theta}}_i,~ \widehat{\sv{\theta}}_{i+1},~ \ldots \] such that the limit of this sequence as $i \rightarrow \infty$ converges to the solution $\widehat{\sv{\theta}}$. Ideally, each iterate is closer to the solution than is the one before it.

Since the intention is to minimize a function $\rho(\sv{\theta})$ we can imagine this function as a surface in $\Reals^p$ say (for $p$ dimensional $\sv{\theta}$). Any value $\widehat{\sv{\theta}}_i$ is a point on that surface. We can improve on this estimate by moving away from $\widehat{\sv{\theta}}_i$ in a direction which is lower on the surface than is $\rho(\widehat{\sv{\theta}}_i)$. If $\rho(\sv{\theta})$ is a differentiable function of $\sv{\theta}$ then we know that the gradient \[\ve{g}_i = \ve{g}(\widehat{\sv{\theta}}_i) = \left. \frac{\partial~ \rho ({\sv{\theta}})}{\partial~ \sv{\theta}} \right|_{\sv{\theta} = \widehat{\sv{\theta}}_i}. \] Normalizing $\ve{g}_i$ to unit length provides the direction in which $\rho(\sv{\theta})$ rises, or ascends, fastest.
\[ \ve{g}_i \leftarrow \frac{\ve{g}_i} {\norm{\ve{g}_i}} \] The gradient vector $\ve{g}_i$ is now the direction of steepest ascent. Its negation, $-\ve{g}_i$ is therefore in the direction of steepest descent. We could correct the $i$th iterate $\widehat{\sv{\theta}}_i$ by taking downhill step of length $\lambda > 0$ in the direction of $-\ve{g}_i$ to get the next iterate as \[ \widehat{\sv{\theta}}_{i+1} = \widehat{\sv{\theta}}_i - \lambda ~\ve{g}_i. \] We choose the step size $\lambda$ for each $i$ to be that value which minimizes \[ \rho(\widehat{\sv{\theta}}_i - \lambda ~ \ve{g}_i) \] as a function of $\lambda$ (i.e. for fixed $\widehat{\sv{\theta}}_i$). Finding the value of $\lambda$ in this way is called a line search.

Putting it all together gives the gradient descent algorithm for minimizing a differentiable function $\rho(\sv{\theta})$:

Initialize ; $i \leftarrow 0$; determine an initial estimate $\widehat{\sv{\theta}}_0$ (how, depends on problem)
LOOP:
1. Gradient: \[\ve{g}_i = \left. \frac{\partial ~\rho(\sv{\theta})}{\partial \sv{\theta}} \right|_{\sv{\theta} = \widehat{\sv{\theta}}_i} \]
2. Gradient direction: \[\ve{g}_i \leftarrow \frac{\ve{g}_i}{\norm{\ve{g}_i}} \]
3. Line search: Find the value $\widehat{\lambda}$ \[ \widehat{\lambda} = \arg\min_{\lambda > 0} \rho(\widehat{\sv{\theta}}_i - \lambda ~ \ve{g}_i) \]
4. Update the iterate: \[ \widehat{\sv{\theta}}_{i+1} \leftarrow \widehat{\sv{\theta}}_i - \lambda ~\ve{g}_i \]
5. Converged?
  
  if the iterates are not changing, then return
  
  else $i \leftarrow i + 1$ and repeat LOOP.
Return: $~\widehat{\sv{\theta}} = \widehat{\sv{\theta}}_{i}$

This simplifies somewhat depending on $\rho(\cdot)$.

Because R is a functional programming language, we can easily implement gradient descent function minimization as a simple function which takes functions as arguments. This allows the structure of the general algorithm to be seen in the code itself.

For example, here is a simple implementation:

gradientDescent <- function(theta = 0, 
                            
                            rhoFn, gradientFn, 
                            lineSearchFn, testConvergenceFn,
                            
                            maxIterations = 100,   # maximum number of iterations 
                                                   # in gradient descent loop
                                                   # 
                            tolerance = 1E-6,      # parameters for the test
                            relative = FALSE,      # for convergence function
                                                   #
                            lambdaStepsize = 0.01, # parameters for the line search
                            lambdaMax = 0.5        # to determine lambda
                            ) {
  ## Initialize
  converged <- FALSE
  i <- 0
  ## LOOP
  while (!converged & i <= maxIterations) {
    ## gradient
    g <- gradientFn(theta)
    ## gradient direction
    glength <-  sqrt(sum(g^2))
    if (glength > 0) g <- g /glength
    
    ## line search for lambda
    lambda <- lineSearchFn(theta, rhoFn, g, 
                           lambdaStepsize = lambdaStepsize, 
                           lambdaMax = lambdaMax)
    ## Update theta
    thetaNew <- theta - lambda * g
    ##
    ## Check convergence
    converged <- testConvergenceFn(thetaNew, theta,
                                   tolerance = tolerance,
                                   relative = relative)
    ## Update
    theta <- thetaNew
    i <- i + 1
  }
  ## Return last value and whether converged or not
  list(theta = theta, 
       converged = converged, 
       iteration = i,
       fnValue = rhoFn(theta)
       )
}

Comment If this were production code, then more error checking on the suitability of the incoming arguments would be required. We are avoiding this here to better illustrate the algorithm.

Note that the gradient descent algorithm is specified above even though none of the functions it requires exist! These will need to be implemented for each particular function $\rho(\cdots)$ we wish to minimize.

Both the lineSearchFn(...) and the testConvergence(...) functions can be given fairly general implementations. For example, we

### line searching could be done as a simple grid search
gridLineSearch <- function(theta, rhoFn, g, 
                       lambdaStepsize = 0.01, 
                       lambdaMax = 1) {
  ## grid of lambda values to search
  lambdas <- seq(from = 0, 
                 by = lambdaStepsize,
                 to = lambdaMax)
  
  ## line search
  rhoVals <- Map(function(lambda) {rhoFn(theta - lambda * g)},
                 lambdas)
  ## Return the lambda that gave the minimum
  lambdas[which.min(rhoVals)]
}

### Where testCovergence might be (relative or absolute)
testConvergence <- function(thetaNew, thetaOld, tolerance = 1E-10, relative=FALSE) {
   sum(abs(thetaNew - thetaOld)) < if (relative) tolerance * sum(abs(thetaOld)) else tolerance
}

Suppose for example that we were finding that $\theta$ which minimized \[\rho(\theta) = 2 \theta^2 -5 \theta +3 .\] Then \[ g = 4 \theta - 5\] and we need only write the corresponding R functions

rho <- function(theta) { 2 * theta^2 - 5 * theta +3}
g <- function(theta) {4 * theta - 5}

We find the value $\widehat{\theta}$ as

gradientDescent(rhoFn = rho, gradientFn = g, 
                lineSearchFn = gridLineSearch,
                testConvergenceFn = testConvergence)

## $theta
## [1] 1.25
## 
## $converged
## [1] TRUE
## 
## $iteration
## [1] 4
## 
## $fnValue
## [1] -0.125

As written, the gradientDescent(...) function is fairly general and is not confined, for example, to scalar values attributes $\theta$.

For example, a gradient descent algorithm to determine the least-squares fitted line has $\sv{\theta} = \tr{(\alpha, \beta)}$ and using $c=\widebar{x}$ we have \[ \rho(\alpha, \beta) = \sum_{u \in \pop{P}} (y_u - \alpha - \beta (x_u - \widebar{x}))^2 \] so that \[ \begin{array}{rcl} \ve{g}_i &=& \left(\begin{array}{c}\sum_{u \in \pop{P}}-2 (y_u -\alpha - \beta (x_u - \widebar{x})) \\ \\ \sum_{u \in \pop{P}} -2(y_u -\alpha - \beta (x_u - \widebar{x})) (x_u - \widebar{x}) \end{array} \right)_{\alpha = \widehat{\alpha}_i ~;~ \beta = \widehat{\beta}_i} \\ &&\\ &&\\ &=& -2 \left( \begin{array}{c} N~(\widebar{y} - \widehat{\alpha}_i ) \\ \\ \sum_{u \in \pop{P}} (x_u - \widebar{x}) ~y_u ~-~ \widehat{\beta}_i \sum_{u \in \pop{P}} (x_u - \widebar{x})^2 \end{array} \right). \end{array} \]

For a least-squares line, the $\rho$ function depends on the values of the variates $x$ and $y$. For illustration, we can take $\pop{P}$ to the set of Where's Waldo? pages and $x$ and $y$ to be the horizontal and vertical locations of Waldo.

waldofile <- paste(directory, "WheresWaldo", "wheres-waldo-locations.csv", sep=dirsep)
waldo <- read.csv(waldofile)

For this problem, we need only write the corresponding R functions rho and gradient.

rho <- function(theta) { 
  alpha <- theta[1]
  beta <- theta[2]
  ## Note that we are accessing waldo from the globalEnv
  sum( (waldo$Y - alpha - beta * (waldo$X - mean(waldo$X)) )^2 )
}
 
gradient <- function(theta) {
  alpha <- theta[1]
  beta <- theta[2]
  ## Note that we are accessing waldo from the globalEnv
  x <- waldo$X
  y <- waldo$Y
  N <- length(x)
  xbar <- mean(x)
  ybar <- mean(y)
  g <-  -2 * c(N * (ybar - alpha),
               sum((x - xbar) * y) - beta * sum((x - xbar)^2))
  # Return g
  g
}

Comment: Note that in the above global variables like waldo$X and waldo$Y appear inside the functions rho(...) and gradient(...). This will work provided that waldo is available in the global environment when rho and gradient are called. This may not be reliable and so should be avoided in general.

An obvious solution, and one which generally works, is to pass the needed variables (here x and y) to the function as arguments. Unfortunately, for our purposes, this would clutter up the code somewhat and necessitate rewriting the nice and very general gradient descent function. A better way to proceed, which still keeps clutter to a minimum, is to write functions which will return the appropriate functions.

Functions containing their own data environment are called closures. Every function has a local environment where variables may be defined; this is the closure of the function. Functions also have access to the environment in which they were created (that’s why functions can access values in the global environment). We use this property to have one function define another function within itself and return it. The interior function is enclosed within the function that created it, hence the word closure. The returned function (or closure) has access to any variables defined within the enclosing function that defined it. Encapsulation of data within a function is an important and powerful construct.

For example, we can write a function that will take x and y and return an appropriate rho (or gradient) function as follows:

createLeastSquaresRho <- function(x,y) {
  ## local variable
  xbar <- mean(x)
  ## Return this function
  function(theta) { 
    alpha <- theta[1]
    beta <- theta[2]
    sum( (y - alpha - beta * (x - xbar) )^2 )
  }
}

### We now get the rho function for the waldo data
rho <- createLeastSquaresRho(waldo$X, waldo$Y)

### Similarly for the gradient function
createLeastSquaresGradient <- function(x,y) {
  ## local variables
  xbar <- mean(x)
  ybar <- mean(y)
  N <- length(x)
  function(theta) {
    alpha <- theta[1]
    beta <- theta[2]
    -2 * c(N * (ybar - alpha),
           sum((x - xbar) * y) - beta * sum((x - xbar)^2)
           )
  }
}
gradient <- createLeastSquaresGradient(waldo$X, waldo$Y)

These functions now have access to their own data as the values of x and y. The creator functions allow us to simply implement gradient descent for least squares for any x and y.

The functions rho and gradient can now be passed as arguments to gradientDescent(...) to find the value $\widehat{\theta}$.

result <- gradientDescent(theta = c(0,0), 
                          rhoFn = rho, gradientFn = gradient,
                          lineSearchFn = gridLineSearch,
                          testConvergenceFn = testConvergence)
### Print the results
Map(function(x){if (is.numeric(x)) round(x,3) else x}, result)

## $theta
## [1] 3.857 0.141
## 
## $converged
## [1] TRUE
## 
## $iteration
## [1] 29
## 
## $fnValue
## [1] 233.774

And we can plot the resulting line

The line might be used, for example, to guide the eye across any two page spread in a Where’s Waldo? book to help find him. That said, a more complex summary than a simple line might be better.

Note that, we could have used a lineSearchFn(...) that was tailored to $\rho(...)$. After all, in step 2(c) of the gradient descent algorithm, the function now being minimized is a function of a scalar parameter $\lambda$ (no matter what the dimensionality of $\theta$). So any suitable method for minimizing a real-valued scalar function could be used. For the least squares $\rho(...)$ function, we can determine the solution $\widehat{\lambda}$ to be \[ \widehat{\lambda} = \frac{g_{i~2}(\widehat{\beta}_i S_{xx} - S_{xy}) -g_{i~1}N(\widebar{y} - \widehat{\alpha}_i)} {g_{i~2}^2 S_{xx} + N g_{i~1}^2}\] where $\ve{g}_i = \tr{(g_{i~1}, g_{i~2})}$, $S_{xx} = \sum_{u \in \pop{P}} (x_u - \widebar{x})^2$, and $S_{xy} = \sum_{u \in \pop{P}} (x_u - \widebar{x}) y_u$.

This can be implemented as

createExactLambdaFn <- function(x,y) {
  ## local variables
  N <- length(x)
  xbar <- mean(x)
  ybar <- mean(y)
  Sxx <- sum((x - xbar)^2)
  Sxy<- sum((x - xbar)*y)
  ## Return this function
  function(theta, rhoFn, g, ...){
    ## ignore remaining arguments in ...
    g1 <- g[1]
    g2 <- g[2]
    alpha <- theta[1]
    beta <- theta[2]
    lambda <- (g2 * (beta * Sxx - Sxy) - g1 * N * (ybar - alpha)) / (g2^2 * Sxx + N * g1^2)
    lambda
  }
}
exactLambda <- createExactLambdaFn(waldo$X, waldo$Y)

Using this as the lineSearchFn(...) gives

result2 <- gradientDescent(theta = c(0,0), 
                          rhoFn = rho, gradientFn = gradient,
                          lineSearchFn = exactLambda,
                          testConvergenceFn = testConvergence)

Map(function(x){if (is.numeric(x)) round(x,3) else x}, result2)

## $theta
## [1] 3.875 0.143
## 
## $converged
## [1] TRUE
## 
## $iteration
## [1] 66
## 
## $fnValue
## [1] 233.748

Surprisingly, for this problem the simple gridLineSearch(...) took fewer iterations than did the exactLambda(...) function. The latter did however produce a more accurate estimate, in the sense of smaller $\rho(\widehat{\theta})$ (see fnValue).

When $\rho(\cdot)$ is a sum over $u \in\pop{P}$, that is when \[ \rho(\sv{\theta}, \pop{P}) = \sum_{u \in \pop{P}} \rho(\sv{\theta}; u) \] then the gradient descent algorithm can be tailored in a way that can be usefully extended in important (big data) applications.

Putting it all together gives the gradient descent algorithm for minimizing a differentiable function $\rho(\sv{\theta})$:

Initialize ; $i \leftarrow 0$; determine an initial estimate $\widehat{\sv{\theta}}_0$ (how, depends on problem)
LOOP:
1. Gradient: \[\ve{g}_i = \left. \frac{\partial ~\sum_{u \in \pop{P}}\rho(\sv{\theta};u)}{\partial \sv{\theta}} \right|_{\sv{\theta} = \widehat{\sv{\theta}}_i} = \sum_{u \in \pop{P}} \left. \frac{\partial ~\rho(\sv{\theta};u)}{\partial \sv{\theta}} \right|_{\sv{\theta} = \widehat{\sv{\theta}}_i} = \sum_{u \in \pop{P}} \ve{g}_i(u), ~~\text{say}\]
2. Gradient directions: \[\ve{g}_i(u) \leftarrow \frac{\ve{g}_i(u)}{\norm{\ve{g}_i}} \] \[\ve{g}_i \leftarrow \frac{1}{N} \sum_{u \in \pop{P}} \ve{g}_i(u)\] Note that each component is a weighted direction (their sum is a direction).
3. Line search: Find the value $\widehat{\lambda}$ \[ \widehat{\lambda} = \arg\min_{\lambda > 0} \rho(\widehat{\sv{\theta}}_i - \lambda ~ \ve{g}_i) \]
4. Update the iterate: \[ \widehat{\sv{\theta}}_{i+1} \leftarrow \widehat{\sv{\theta}}_i - ~ \frac{\lambda}{N} \sum_{u \in \pop{P}} \ve{g}_i(u) = \widehat{\sv{\theta}}_i - ~ \lambda^\star \sum_{u \in \pop{P}} \ve{g}_i(u) , ~~\text{say}\]
5. Converged?
  
  if the iterates are not changing, then return
  
  else $i \leftarrow i + 1$ and repeat LOOP.
Return: $~\widehat{\sv{\theta}} = \widehat{\sv{\theta}}_{i}$

Each of steps 2(a), (b), and (d) break down into sums of individual components. This can be very handy when $N$ is large. As written above, this is sometimes called the batch gradient descent. Typically, the value of $\lambda$ is fixed as a “learning rate” and the line search step 2(c) omitted as well as the normalization of step 2(b).

This form of the algorithm lends itself to interchanging the order of computation between units. That is step 2(d) could be broken into a loop of updates. Loop over $u \in \pop{P}$ in any order and update the iterate for every unit $u$. That is beginning with \[\widehat{\sv{\theta}} \leftarrow \widehat{\sv{\theta}}_i \] and loop over $u \in \pop{P}$ updating $\widehat{\sv{\theta}}$ as \[\widehat{\sv{\theta}} \leftarrow \widehat{\sv{\theta}} - \lambda^\star \ve{g}_i(u) \] finally setting \[\widehat{\sv{\theta}}_{i+1} \leftarrow \widehat{\sv{\theta}}.\] This is called stochastic gradient descent, especially when the order of $u \in \pop{P}$ is randomized at each iteration. This often naturally comes up when $N$ is so large that calculations have been farmed out to (say) $N$ machines over a “cloud” of computers. Some of these computers will finish before others and in an order that may change (beyond our control) from iteration to interation. By using stochastic gradient descent, we update from each computer as it finishes its calculation without waiting for all computers (or units) to finish.

2.3.2 Solving a system of equations

A population attribute might also be defined implicitly as the solution $\sv{\theta} \in \sv{\Theta}$ that satisfies a system of equations \[\sv{\psi}(\sv{\theta}; \pop{P}) = \ve{0}\] where there are as many independent equations as there are unknowns $p$, say. That is the dimension of the vector valued function $\sv{\psi}(\cdots)$ is the same, $p$, as the dimensionality of the vector $\sv{\theta} \in \sv{\Theta}$.

As with the $\rho(\cdot)$ which we previously considered minimizing, a common choice for $\sv{\psi}(\sv{\theta}; \pop{P})$ is a sum over the population $\pop{P}$, namely find $\widehat{\sv{\theta}}$ to be that $\sv{\theta} ~\in\ \sv{\Theta}$ that solves \[\sum_{u\in \pop{P}}\psi(\sv{\theta}; u) = \ve{0}.\]

There are numerous ways in which this solution might be found. Many fall under the category of root finding methods.

2.3.2.1 Newton-Raphson

If, for example, $\sv{\psi}(\sv{\theta}; \pop{P})$ is differentiable in $\sv{\theta}$, then we can use a first order Taylor series approximation to construct an iterative method to find a root.

First, let \[ \sv{\psi}^\prime(\sv{\theta}_i; \pop{P}) = \left. \frac{\partial ~\sv{\psi}(\sv{\theta}; \pop{P}) }{\partial \sv{\theta}} \right|_{\sv{\theta} = \sv{\theta}_i} \] be the $p \times p$ matrix of partial derivatives.

Then, a first order approximation can be written as \[\sv{\psi}(\sv{\theta}_i + \sv{\Delta}; \pop{P}) \approx \sv{\psi}(\sv{\theta}_i; \pop{P}) + \sv{\psi}^\prime(\sv{\theta}_i; \pop{P}) \times \sv{\Delta}. \] Now assume that the value $\sv{\theta}_i$ is our current best guess (or approximation) of the root $\sv{\theta}$. We use this first order approximation as a way to improve the guess by setting $\sv{\Delta} = \sv{\theta} - \sv{\theta}_i$. The approximation becomes \[\sv{\psi}(\sv{\theta}; \pop{P}) \approx \sv{\psi}(\sv{\theta}_i; \pop{P}) + \sv{\psi}^\prime(\sv{\theta}_i; \pop{P}) \times (\sv{\theta} - \sv{\theta}_i) \] the left hand side of which $\sv{\psi}(\sv{\theta}; \pop{P}) = \ve{0}$. Solving for $\sv{\theta}$ gives \[ \sv{\theta} \approx \sv{\theta}_i - \left[ \sv{\psi}^\prime(\sv{\theta}_i; \pop{P}) \right]^{-1}~ \sv{\psi}(\sv{\theta}_i; \pop{P})\] which suggests the iterative Newton-Raphson algorithm for root finding.

The algorithm is as follows:

Initialize ; $i \leftarrow 0$; determine an initial estimate $\widehat{\sv{\theta}}_0$ (how, depends on problem)
LOOP:
1. Update the iterate: \[ \widehat{\sv{\theta}}_{i+1} \leftarrow \widehat{\sv{\theta}}_i - \left[ \sv{\psi}^\prime(\widehat{\sv{\theta}}_i; \pop{P}) \right]^{-1}~ \sv{\psi}(\widehat{\sv{\theta}}_i; \pop{P})\]
2. Converged?
  
  if the iterates are not changing, then return
  
  else $i \leftarrow i + 1$ and repeat LOOP.
Return: $~\widehat{\sv{\theta}} = \widehat{\sv{\theta}}_{i}$

When $p=1$, all vectors above are scalars, the iterative step 2(a) simplifies to \[ \widehat{\theta}_{i+1} \leftarrow \widehat{\theta}_i - \frac{ \psi(\theta_i; \pop{P})}{ \psi^\prime(\theta_i; \pop{P})}\] and the method is known simply as Newton’s method.

For example, here is a simple implementation of Newton’s method:

Newton <- function(theta = 0, 
                   psiFn, psiPrimeFn, 
                   testConvergenceFn = testConvergence,
                   maxIterations = 100,   # maximum number of iterations 
                   tolerance = 1E-6,      # parameters for the test
                   relative = FALSE       # for convergence function
) {
  ## Initialize
  converged <- FALSE
  i <- 0
  ## LOOP
  while (!converged & i <= maxIterations) {
    ## Update theta
    thetaNew <- theta - psiFn(theta)/psiPrimeFn(theta)
    ##
    ## Check convergence
    converged <- testConvergenceFn(thetaNew, theta,
                                   tolerance = tolerance,
                                   relative = relative)
    ## Update iteration
    theta <- thetaNew
    i <- i + 1
  }
  ## Return last value and whether converged or not
  list(theta = theta, 
       converged = converged, 
       iteration = i,
       fnValue = psiFn(theta)
       )
}

The solution to the weighted sum \[ \psi(\theta) = \sum_{u \in \pop{P}} w_u (y_u - \theta) = 0 \] can now be found via Newton’s method by defining the appropriate psi(...) and psiPrime(...). For this function \[\psi^\prime(\theta) = - \sum_{u \in \pop{P}} w_u \]

For example, suppose we look at the facebook data set and the number of likes a posting receives. It might make sense to weight the likes inversely proportional to the number of Impressions the post has. That is the more often it is seen, the lower the weight we give to the likes it receives.

facebookfile <- paste(directory, "FacebookMetrics", 
                      "facebook.csv", sep=dirsep)
facebook <- read.csv(facebookfile)
### Remove all rows with missing data
fb <- na.omit(facebook)

### Here we create both functions at once
createPsiFns <- function(y, wt) {
  psi <- function(theta = 0) {sum(wt * (y -theta))}
  psiPrime <- function(theta = 0) {-sum(wt)}
  list(psi = psi, psiPrime = psiPrime)
}

### Create them for the particular data
psiFns <- createPsiFns(y = fb$like, wt = 1/fb$Impressions)
psi <- psiFns$psi
psiPrime <- psiFns$psiPrime

Use these functions together with Newton’s method

result <- Newton(theta = mean(fb$like), 
                 psiFn = psi, psiPrimeFn = psiPrime)
kable(as.data.frame(result))

theta	converged	iteration	fnValue
78.46845	TRUE	2	0

### The theta of which should be the same as the weighted average
y <- fb$like
wt <- 1/fb$Impressions
kable(data.frame(theta = result$theta, weightedAve = (sum(wt*y) / sum(wt))))

theta	weightedAve
78.46845	78.46845

Exercise Why does this converge in two iterations?

Exercise Try out some other $\psi$ functions for location estimation and have students write the code to implement them. E.g. the bisquare weight function, Andrews’s sine function, Huber’s psi or Hampel’s psi for more complex coding challenges (not everywhere differentiabele $\psi$s).

Exercise Introduce secant method (with bracketing) as an approximate Newton method. Sometimes called a quasi-Newton method.

The more general Newton-Raphson function can be written similarly as

NewtonRaphson <- function(theta, 
                          psiFn, psiPrimeFn, 
                          dim, 
                          testConvergenceFn = testConvergence,
                          maxIterations = 100,   # maximum number of iterations
                          tolerance = 1E-6,      # parameters for the test
                          relative = FALSE       # for convergence function
) {
  if (missing(theta)) {
    ## need to figure out the dimensionality
    if (missing(dim)) {dim <- length(psiFn())}
    theta <- rep(0, dim)
  }
  ## Initialize
  converged <- FALSE
  i <- 0
  ## LOOP
  while (!converged & i <= maxIterations) {
    ## Update theta
    thetaNew <- theta - solve(psiPrimeFn(theta), psiFn(theta))
    ##
    ## Check convergence
    converged <- testConvergenceFn(thetaNew, theta,
                                   tolerance = tolerance,
                                   relative = relative)
    ## Update iteration
    theta <- thetaNew
    i <- i + 1
  }
  ## Return last value and whether converged or not
  list(theta = theta, 
       converged = converged, 
       iteration = i,
       fnValue = psiFn(theta)
       )
}

NewtonRaphson(...) is identical to Newton(...) with the two important exceptions. First, since this is Newton’s method in arbitrary dimensions, we must be given the dimensionality either implicitly from theta or explicitly via dim. If neither of these arguments are given, it is inferred by evaluating the psiFn(...) once with no arguments. The second important difference appears in the line updating theta where the R function solve(...) is used. This follows because \[ \left[ \sv{\psi}^\prime(\widehat{\sv{\theta}}_i; \pop{P}) \right]^{-1}~ \sv{\psi}(\widehat{\sv{\theta}}_i; \pop{P}) \] is the value of $\ve{x}$ that solves the system of equations \[ \left[ \sv{\psi}^\prime(\widehat{\sv{\theta}}_i; \pop{P}) \right]~ \ve{x} = \sv{\psi}(\widehat{\sv{\theta}}_i; \pop{P})\] which is precisely what the function solve(...) does and does so in a numerically reliable way (e.g. forming an $LU$ decomposition and backsolving twice).

Exercise: Could review/introduce a Cholesky decomposition and backsolving to have students implement their own solve function using the functions chol(...), backsolve(...), and forwardsolve(...). This should remind them of row-reduction operations and the construction of matrix inverses from linear algebra. This is a nice way to connect the simple linear system of linear algebra back to the more general $\psi$ problem introduced here. It also reinforces that the methods for solving linear systems that are not-parameterized are fundamentally at the bottom of those which are parameterized.

An example where Newton-Raphson might be used is in fitting a straight line via least-squares. We did this via gradient descent in the previous section. Newton-Raphson will be very similar. To make this slightly different, let’s suppose that we wish to minimize a weighted sum of squares. That is we find weighted least squares estimates of the coefficients $\alpha$ and $\beta$ by minimizing \[ \sum_{u \in \pop{P}} w_u (y_u - \alpha - \beta (x_u - c))^2 \] for weights $w_u \ge 0$ and a constant $c$ of our choice. We will choose $c = \sum_{u \in \pop{P}} w_u x_u /\sum_{u \in \pop{P}} w_u = \widebar{x}_w$ say. It will be convenient to similarly define $\widebar{y}_w = \sum_{u \in \pop{P}} w_u y_u /\sum_{u \in \pop{P}} w_u$.

Differentiating the weighted sum of squares with respect to $\alpha$ and $\beta$ gives \[ \begin{array}{rcl} \sv{\psi}(\alpha, \beta) &=& \left(\begin{array}{c}\sum_{u \in \pop{P}}-2 w_u (y_u -\alpha - \beta (x_u - \widebar{x}_w)) \\ \\ \sum_{u \in \pop{P}} -2 w_u (y_u -\alpha - \beta (x_u - \widebar{x}_w)) (x_u - \widebar{x}_w) \end{array} \right) \\ &&\\ &&\\ &=& -2 \left( \begin{array}{c} (\sum_{u \in \pop{P}}w_u)~(\widebar{y}_w - \alpha ) \\ \\ \sum_{u \in \pop{P}} w_u (x_u - \widebar{x}_w) ~y_u ~-~ \beta \sum_{u \in \pop{P}} w_u (x_u - \widebar{x}_w)^2 \end{array} \right). \end{array} \] To find $\sv{\psi}^\prime(\sv{\theta}) = \sv{\psi}^\prime(\alpha, \beta)$ we differentiate again with respect to each of $\alpha$ and $\beta$ to get the matrix \[ \begin{array}{rcl} \sv{\psi}^\prime(\alpha, \beta) &=& \frac{\partial}{\partial \tr{\sv{\theta}}} \sv{\psi}(\sv{\theta}) \\ &&\\ && \\ &=& \left( \begin{array}{ccc} \frac{\partial}{\partial \alpha} \sv{\psi}(\alpha, \beta) &,& \frac{\partial}{\partial \beta} \sv{\psi}(\alpha, \beta) \end{array} \right) \\ &&\\&&\\ &=& 2 \left( \begin{array}{cc} \sum_{u \in \pop{P}}w_u & 0 \\ & \\ 0 & \sum_{u \in \pop{P}} w_u (x_u - \widebar{x}_w)^2 \end{array} \right) \end{array} \] which, thanks to our choice of $c = \widebar{x}_w$, is a diagonal matrix. A diagonal matrix will simplify calculations but only if the solve(..) function takes advantage of this since our Newton-Raphson function is written to handle more general cases.

We can now use these functions to calculate the weighted least-squares estimate of a line via Newton-Raphson. For illustration, we calculate this for the waldo locations and least-squares only (i.e. $w_u =1 ~ \forall u \in \pop{P}$).

### Here we create both functions for 
### the weighted least squares line
### 
createPsiFnsWLSline <- function(y, x, wt) {
  ## Create a little helper function to be used 
  weightedAve <- function(z, w) {sum(w * z) / sum(w)}
  ## Local variables
  xbarw <- weightedAve(x, wt)
  ybarw <- weightedAve(y, wt)
  sumW <- sum(wt)
  
  ## the functions to be returned
  psi <- function(theta = c(0,0)) {
    alpha <- theta[1]
    beta <- theta[2]
    -2 * c(sumW * (ybarw - alpha),
           sum(wt*(x-xbarw)*y) - beta * sum(wt * (x-xbarw)^2) )
  }
  
  psiPrime <- function(theta = c(0,0)) {
    2 * diag(c(sumW,
               sum(wt*(x - xbarw)^2)))
  }
  
  ## return the functions in a named list
  list(psi = psi, psiPrime = psiPrime)
}

# Create these functions for the particular data
psiFns <- createPsiFnsWLSline(y = waldo$Y, x = waldo$X, 
                              # weights all 1 here
                              wt = rep(1, length(waldo$X)))

psi <- psiFns$psi
psiPrime <- psiFns$psiPrime

Use these functions together with Newton-Raphson

result <- NewtonRaphson(psiFn = psi, psiPrimeFn = psiPrime)
kable(as.data.frame(result))

theta	converged	iteration	fnValue
3.8753064	TRUE	2	0
0.1429041	TRUE	2	0

As can be seen, the estimates agree with those determined by gradient descent.

Exercise Why does this converge in 2 iterations?

Exercise Introduce some general quasi-Newton methods and have students code them up.

2.3.2.2 Iteratively reweighted least-squares.

Different straight lines could be fit to a pair of $x$ and $y$ values by finding a solution to \[\sv{\psi}(\alpha, \beta; x, y, \pop{P}) = \ve{0} \] for different functions $\sv{\psi}(\cdots)$. Alternatively we might minimize \[\rho(\alpha, \beta x;, y, \pop{P}) \] for some function $\rho(\cdots)$ Following the pattern of least squares, the $\rho(\alpha, \beta x;, y, \pop{P})$ to be minimized has the following form \[ \sum_{u \in \pop{P}} \rho(y_u - \alpha - \beta (x_u - c))\] for some function $\rho(...)$. Differentiating this with respect to $\alpha$ and $\beta$ gives the minimum as a solution to \[ \sum_{u \in \pop{P}} \psi(y_u - \alpha - \beta (x_u - c))~\left(\begin{array}{c} 1 \\ \\ x_u - c \end{array}\right) = \ve{0} \] where $\psi(\cdots) = \rho^\prime(\cdots)$. The structure of these equations is more readily apparent if we let $r_u = y_u - \alpha - \beta (x_u - c)$ and $\ve{z}_u = \tr{(1,~ x_u - c)}$. The equation is then \[ \sum_{u \in \pop{P}} \psi(r_u) ~\ve{z}_u = \ve{0}.\] In the case of weighted least-squares, $\rho(r) = w r^2$ and $\psi(r) = 2 w r$ and the equation to be solved becomes \[ \sum_{u \in \pop{P}} w_u r_u ~\ve{z}_u = \ve{0} \] and we have seen how to solve this in a variety of ways. The more general equation can be made to look like the weighted least squares equation as follows: \[ \begin{array}{rcl} \ve{0} &=& \sum_{u \in \pop{P}} \psi(r_u) ~\ve{z}_u \\ &&\\ &=& \sum_{u \in \pop{P}}\left( \frac{\psi(r_u)}{r_u} \right)~ r_u ~\ve{z}_u\\ && \\ &=& \sum_{u \in \pop{P}} w_u r_u ~\ve{z}_u \end{array} \] where the weight is $w_u = {\psi(r_u)}/{r_u}$ (provided $r_u \ne 0$). This suggests another possible algorithm with which we might solve for the unknown parameters. Namely, begin with initial estimates of $\alpha$ and $\beta$. Use these to construct the residuals $r_u$. Construct weights $w_u = \psi(r_u)/r_u$. Find new estimates of $\alpha$ and $\beta$ by solving this weighted least squares problem. Use these to get new $r_u$ values and new $w_u$. Solve the weighted least squares problem to get new $\alpha$ and $\beta$. Repeat until the estimates of $\alpha$ and $\beta$ converge. This algorithm is called iteratively reweighted least squares.

We can make this algorithm a little more general by expressing the problem in terms of the vector parameter $\sv{\theta} = \tr{(\alpha, ~\beta)}$. Then $r_u = y_u - \tr{\ve{z}_u} \sv{\theta}$.

Iteratively reweighted least squares can be expressed as

Initialize ; $i \leftarrow 0$; determine an initial estimate $\widehat{\sv{\theta}}_0$
LOOP:
1. Construct residuals for all $u \in \pop{P}$ \[r_u = y_u - \tr{\ve{z}_u} \sv{\theta}_i \]
2. Construct weights for all $u \in \pop{P}$ \[w_u = \frac{\psi(r_u)}{r_u} \]
3. Solve the weighted least squares problem \[\sum_{u \in \pop{P}} w_u (y_u - \tr{\ve{z}_u} \sv{\theta}) ~\ve{z}_u \] for $\widehat{\sv{\theta}}$
4. Update the iterate: \[ \widehat{\sv{\theta}}_{i+1} \leftarrow \widehat{\sv{\theta}} \]
5. Converged?
  
  if the iterates are not changing, then return
  
  else $i \leftarrow i + 1$ and repeat LOOP.
Return: $~\widehat{\sv{\theta}} = \widehat{\sv{\theta}}_{i}$

All that is needed to make this algorithm work is the ability to solve the appropriate weighted least squares problem. It is completely general for any attribute (not just a straight line) that can be expressed as \[ y_u = \tr{\ve{z}_u} \sv{\theta} + r_u.\] Such attributes are called linear response models which are fitted to the population according to the definition of $\psi(\cdot)$. Here $y_u$ is the response and the model is linear in the unknown parameters of $\sv{\theta}$.

For the straight line model (with choice $c = \widebar{x}_w$, the weighted average of the $x$s), the solution to the weighted least squares problem can be had in closed form. Namely, \[\widehat{\sv{\theta}} = \tr{(\widehat{\alpha}, ~\widehat{\beta})}\] with \[\widehat{\alpha} = \widebar{y}_w ~~\mbox{ and }~~\widehat{\beta} = \frac{\sum_{u \in \pop{P}} w_u (x_u - \widebar{x}_w) y_u} {\sum_{u \in \pop{P}} w_u (x_u - \widebar{x}_w)^2} \] where $\widebar{y}_w$ is the weighted average of the $y$s.

We can implement iteratively weighted least squares for a straight-line model as follows:

irls <- function(y, x, theta, psiFn, 
                 dim = 2, delta = 1E-10,
                 testConvergenceFn = testConvergence,
                 maxIterations = 100,   # maximum number of iterations 
                 tolerance = 1E-6,      # parameters for the test
                 relative = FALSE       # for convergence function
) {
  if (missing(theta)) {theta <- rep(0, dim)}
  ## Initialize
  converged <- FALSE
  i <- 0
  N <- length(y)
  wt <- rep(1,N)
  ## LOOP
  while (!converged & i <= maxIterations) {
    ## get residuals
    resids <- getResids(y, x, wt, theta)
    ## update weights  (should check for zero resids)
    wt <- getWeights(resids, psiFn, delta)
    ## solve the least squares problem
    thetaNew <- getTheta(y, x, wt)
    ##
    ## Check convergence
    converged <- testConvergenceFn(thetaNew, theta,
                                   tolerance = tolerance,
                                   relative = relative)
    ## Update iteration
    theta <- thetaNew
    i <- i + 1
  }
  ## Return last value and whether converged or not
  list(theta = theta, 
       converged = converged, 
       iteration = i
       )
}

It remains to write the functions getResids(...), getWeights(...), and getTheta(...).

getResids <- function(y, x, wt, theta) {
  xbarw <- sum(wt*x)/sum(wt)
  alpha <- theta[1]
  beta <- theta[2]
  ## resids are
  y - alpha - beta * (x - xbarw)
}

getWeights <- function(resids, psiFn, delta = 1E-10) {
  ## for calculating weights,
  ## minimum |residual| will be delta
  smallResids <- abs(resids) <= delta
  ## take care to preserve sign (in case psi not symmetric)
  resids[smallResids] <- delta * sign(resids[smallResids])
  ## calculate and return weights
  psiFn(resids)/resids
}


getTheta <- function(y, x, wt) {
  theta <- numeric(length = 2)
  ybarw <- sum(wt * y)/sum(wt)
  xbarw <- sum(wt * x)/sum(wt)
  theta[1] <- ybarw
  theta[2] <- sum(wt * (x - xbarw) * y) / sum(wt * (x - xbarw)^2)
  ## return theta
  theta
}

We can try these out for the least-squares problem

psi <- function(resid) {resid}

result <- irls(waldo$Y, waldo$X, theta = c(0,0), psiFn = psi)
kable(as.data.frame(result))

theta	converged	iteration
3.8753064	TRUE	2
0.1429041	TRUE	2

Exercise: Explain how zero residuals are handled in the construction of weights and why it might make sense.

Exercise: Try the iteratively reweighted least squares for some other $\psi$ functions.

Exercise: Show how the above iteratively reweighted least squares could be made to work to fit a parabola in $x$.

Exercise: Using the R function lm(...) (and any related functions) show how the above iteratively reweighted least squares can be used for any linear model (and $\psi$).

3 Samples

It may not be possible to calculate an attribute for the population $\pop{P}$. For example, we might not have access to the entire population. Or perhaps the calculation is not feasible because the population is too large or the attribute too complex. Whatever the reason, only a sample $\samp{S}$ of $n << N$ units might be available and the attribute $a(\samp{S})$ calculated only on this sample.

The value $a(\samp{S})$ is an estimate of its population counterpart $a(\pop{P})$. To emphasise this relationship we could write \[a(\samp{S}) = \widehat{a}(\pop{P}) = a(\widehat{\pop{P}})\] as an estimate of $a(\pop{P})$. The second equality emphasises that we are explicitly thinking of $\samp{S}$ as an estimate of $\pop{P}$ to achieve this.

There are two things to note from this simple relationship.

First, any difference between the actual values of the estimate $a(\samp{S})$ and the thing being estimated (the estimand) $a(\samp{S})$ is an error. In this particular case we call this error the sample error. The error will depend both on the actual sample and on the attribute being evaluated. With some abuse of notation we write \[\mbox{sample error } ~~ = ~~ a(\samp{S}) - a(\pop{P}). \] For numerical attributes, this is easily determined mathematically; for graphical attributes it is not precise and meant to be taken notionally (with possibly more precise quantification to be added). In the latter case, it at least suggests an awareness that the concept of sample error must apply here as well.

Second, that the sample error would be zero (or non-existent) when the sample ${\samp S}$ is replaced by the population ${\pop P}$ means that the estimation is in some sense consistent. That is, if we actually had the population itself, then we would get the right answer – the estimation is consistent. More technically, this type of consistency is sometimes called Fisher consistency in the statistical literature, named after the statistical scientist Ronald A. Fisher who in 1922 identified this consistency as an important criterion for estimation.

In every respect ${\samp S}$ could be considered a population itself and might even sensibly be called a “sample population”. Such nomenclature, while arguably legitimate, does unfortunately fly in the face of traditional statistical language and common English usage – it is to be avoided therefore and will not be used here. Nevertheless, treating ${\samp S}$ as its own population we can evaluate any population attribute on the sample in the same way we would for ${\pop P}$.

Some samples will have a small sample error and some will have a large one. To see how large sample errors can be we could look at all possible samples of size $n$.

3.1 All possible samples

Suppose the population $\pop{P}$ was of size $N$ and that the sample $\samp{S}$ was of size $n$. Suppose also that the $\samp{S} \subset \pop{P}$. Then there would be $N \choose n$ different possible samples $\samp{S}$ of size $n$.

To make matters concrete, suppose we consider the population $\pop{P}$ to be all the encounters we have recorded with a great white shark reported from 1999 to 2014 worldwide. There are $N=65$ such encounters in our population. To get some sense of the potential magnitude of the sample error for an attribute $a(\samp{S})$, we might like to compute $\samp{S} \subset \pop{P}$ is of size $n$, then the number of different possible samples there are shown in the table below

Number of samples of size n
n = 5	n=10	n=15	n=20
8259888	179013799328	2.073747e+14	2.83396e+16

Even with a population having as few as $N=65$ units, there are 8,259,888 different samples of size $n=5$ and 179,013,799,328 different samples of size $n=10$, and even more for sample sizes of $n=15$ and $n=20$.

sharkfile <- paste(directory, "Sharks", "sharks.csv", sep=dirsep)
sharks <- read.csv(sharkfile)
kable(head(sharks))

Year	Sex	Age	Time	Australia	USA	Surfing	Fatality	Injury	Length
2014	M	35	AM	1	0	1	0	0	180
2013	M	19	AM	0	0	1	0	1	140
2013	M	74	AM	0	0	0	1	1	144
2013	M	45	AM	0	1	1	0	1	95
2013	M	46	PM	0	0	0	1	1	156
2012	M	24	AM	1	0	1	1	1	196

Even for this small population, generating all possible samples of size $n=5$ can be computationally prohibitive. To reduce the computation, we focus on a sub-population these encounters, just those which occurred in Australian waters (sharks$Australia == 1).

### Units in the large population of all encounters
popSharks <- rownames(sharks)
### get the sub-population that is just those encounters in Australian waters
popSharksAustralia <- popSharks[sharks$Australia == 1]
### the units in the sub-population are
popSharksAustralia

##  [1] "1"  "6"  "7"  "9"  "10" "11" "14" "16" "18" "19" "20" "21" "22" "24"
## [15] "25" "30" "33" "34" "37" "38" "40" "41" "48" "54" "55" "58" "59" "61"

This population contains only $N =$ units. There are now only 98,280 possible samples of size $n=5$ from this population, a still large but much more manageable number.

We can generate the indices of all possible samples of size $n$ from a population of size $N$ in R using the combination function combn(...).

### Get all possible samples of size 5
samples <- combn(popSharksAustralia, 5)
### Number of samples
N_s <- ncol(samples)
### Each column of samples contains the units of a sample
### The first five and last samples are
kable(data.frame(first = samples[,1], second = samples[,2], 
                 third = samples[,3], fourth = samples[,4], 
                 fifth = samples[,5], last = samples[,N_s]),
      caption = "First five and last samples of size 5")

First five and last samples of size 5
first	second	third	fourth	fifth	last
1	1	1	1	1	54
6	6	6	6	6	55
7	7	7	7	7	58
9	9	9	9	9	59
10	11	14	16	18	61

Note again that the samples in the above table show the unit labels $u$ for each sample and that these labels were defined within the larger population of all great white shark encounters (of which there were 65). That is why, although there are only 28 different units in the population of Australian water encounters, we see unit labels as high as 61 and why labels such as 2 through 5 do not appear in any samples (because those encounters were not in Australian waters).

Suppose we are interested in the average length (in inches) of the great white sharks encountering humans in Australian waters. Then the attribute of interest is \[a(\pop{P}_{Australia}) = \frac{1}{28} \sum_{u \in \pop{P}_{Australia}} y_u \] and the sample error for a sample $\samp{S}$ is \[a(S) - a(\pop{P}_{Australia}) = \frac{1}{28} \sum_{u \in \samp{S}} y_u - \frac{1}{28} \sum_{u \in \pop{P}_{Australia}} y_u .\] This can be calculated for all possible samples.

avePop <- mean(sharks[popSharksAustralia, "Length"])
### Because the samples are stored in a matrix,
### use the apply function to apply FUN over its columns 
### (i.e. its second dimension; margin = 2)
### Each column provides the row indices 
### for that sample in the original population
avesSamp <- apply(samples, MARGIN = 2, 
                  FUN = function(s){mean(sharks[s,"Length"])})
sampleErrors <- avesSamp - avePop

The average sample error over all possible samples of size $n$ is \[\mbox{Average sample error}~~= \frac{\sum_{i=1}^{N_s} a(\samp{S}_i)}{N_s} - a(\pop{P}) \] where $N_s$ (= 98,280 here) is the number of possible samples $\samp{S}_i$. For the average shark length the average sample error was actually round(mean(avesSamp) - avePop,5) = 0. At least for this attribute, the sample error is zero on average.

Exercise: Prove that the average sample error must be zero when $a(\pop{P})$ is the arithmetic average.

However, the sample error also ranges from about -77 to about 59 inches. Depending on the sample we selected, the average length of a great white shark reported encountered by someone in Australia’s waters could have been as short as about 79 or as large an average length as 214 inches whereas the population attribute is about 156. This is quite a range of possible average lengths.

More information can be had if we look at the distribution of the sample attributes via a histogram.

hist(avesSamp, col=adjustcolor("grey", alpha = 0.5),
     main="All possible sample attribute values (n = 5)",
     xlab="Average shark length (inches)",
     breaks=25
     )
### Mark the population attribute in red
abline(v=avePop, col="red", lty=3, lwd=2)

All possible samples: great white encounters in Australia

The red dotted line is the value of the attribute on the population. As can be seen, depending on the sample, the corresponding sample attribute could be very close to the population value or very far away. However there are many fewer samples that produce a value far away than there are samples which produce a value nearer the population value. The sample values concentrate near the population value. The other values are distributed nearly symmetrically about the population value. The left side howewver is slightly more spread out than the right hand side; alternatively, the right side is pulled in closer to the middle than is the left. For a fixed absolute sample error, the proportion of samples wih that absolute error or greater is larger if the error is negative than if it is positive. This distribution is slightly “negatively” skewed, also called “left” skewed (a possible mnemonic: in Latin, left is “sinister” or “negative”).

A numerical summary of the sample averages is the five number summary containg the minimum, maximum, and the three quartiles (as defined by boxplot.stats(...)).

fivenum(avesSamp)

## [1]  79.2 142.4 156.8 169.8 214.4

Fully half of the samples will produce an average shark length between the first and third quartiles: between 142.4 and 169.8 inches. This is somewhat reassuring, especially given the sample is of size 5 (which is little more than 1/7 the population size).

3.1.1 Effect of increasing sample size

As the sample size increases, it becomes more difficult for any sample to be so different from the population. And the attribute values will concentrate even more about the population value, matching it exactly when the sample becomes the population as $n \rightarrow N$.

In the Figure below, the average shark length histogram for every sample is shown for varying samples sizes.

All possible samples for different sample sizes

As might be expected, the histograms concentrate more about the population average as the sample size increases, whatever the value of $c$.

This too indicates some kind of consistency for the particular atribute here (viz. the arithmetic average). Namely, if we take any value $c > 0$ and the proportion $p_c$ of samples for which the absolute sample error is larger than $c$, that is for which \[ \abs{a(S) - a(\pop{P}_{Australia})} = \abs{\frac{1}{28} \sum_{u \in \samp{S}} y_u - \frac{1}{28} \sum_{u \in \pop{P}_{Australia}} y_u} > c \] then this proportion appears to decrease as the sample size $n$ increases. Consistency of this sort is not the same as Fisher consistency (which applies only when $n = N$). This consistency seems to capture another desirable characteristic of an attribute (or estimator).

More formally, consider the case where for every sample size $n$, we have the set of all possible samples of size $n$ from a population $\pop{P}$ of size $N < \infty$. For each $n$, let \[\pop{P}_{\samp{S}}(n) = \left\{ \samp{S} \suchthat \samp{S} \subset \pop{P} \mbox{ and } \size{S} =n \right\} \] For any $c > 0$, let \[\samp{P}_{a} (c, n) = \left\{ \samp{S} \suchthat \samp{S} \subset \pop{P}_{\samp{S}}(n) \mbox{ and } \abs{a(\samp{S}) - a(\pop{P})} < c \right\} \] and define the proportion \[p_{a}(c,n) = \frac{\size{\samp{P}_{a} (c, n)}}{\size{\pop{P}_{\samp{S}}(n)}} \] for all $c > 0$, and $n \le N$. What we have observed in this example is that for a fixed $c >0$, $p_{a}(c,n)$ increases with $n$.

Proportion sample error curves of the average over all possible samples for different sample sizes

Another feature worth noting is how much each histogram resembles a Gaussian density. This is largely an effect due to the attribute being an average. Still, the Gaussian shape appears to be there for even $n=5$. Note however that there is much data in each histogram (i.e of all $N \choose n$ sample averages) that it is not hard to detect that the shape is not exactly Gaussian. Indeed, each histogram is demonstrably different from that which would be produced by a typical sample of the same size from a Gaussian distribution. Moreover, in this example, the difference is just as detectable when $n=23$ as when $n=5$. Though the shapes agree very well with a Gaussian distribution for the great central bulk of the data, they do not in the extremes. The Gaussian tails appear a bit heavier than the histogram in that there is a greater proportion of the data under a Gaussian tail than appears under the histogram tail.

3.1.2 Other attributes

Here we consider some other simple population attributes, two more location attributes, and three scale attributes.

The two additional location attributes are the median and the trimmed average. The $100p$% trimmed average removes the $100p$% largest and the $100p$% smallest variate values $y$ before averaging the remaining $y$ values. That is the $100p$% trimmed average is the arithmetic average of the central $100(1-2p)$% of the sorted values. For example, in R the function mean(y, trim = 0.10) returns the average of the central 80% of the variate values $y$.

The scale estimates are the range $\abs{y_{max} - y_{min}}$, the inter-quartile range $\abs{Q_y(3/4) - Q_y(1/4)}$, and the population standard deviation \[\sqrt{\frac{\sum_{u \in \pop{P}} \left( y_u - \widebar{y}\right)^2}{N}} \]

To see the effect, we need to calculate the values analogous to avePop, avesSamp, and avesSampSets for each of these new population attributes. Once calculated, we can look at the histograms for each attribute.

Exercise: Calculate these values and reproduce the histograms below.

First for the location attributes for samples of size $n = 5$. Note that these are all plotted on the same scale to aid the comparisons. The value of the attribute calculated on the whole population is marked.

Different location attributes: over all possible samples (n = 5)

The average is as before. The trimmed average behaves much like the average. The median however is quite different. When there are an odd number like $n=5$ units in the sample, the median will be one of the $y$ values in the sample. This is why we see such distinct bars in the histogram. Nevertheless, the sample attribute values do concentrate around the population value, even more so than does either average.

Similarly, the scale attributes can be compared for samples of size $n = 5$.

Different scale attributes: over all possible samples (n = 5)

The sample values for the range are quite far from the population range, considerably underestimating its value. The sample errors would be mostly negative. Clearly, the right most bar, where the range is correct, are those samples which happen to contain the population $y_{min}$ and $y_{max}$.

The interquartile range performs much better with some positive sample errors in addition to the negative sample errors. Still, the population interquartile range appears to be far more freauently underestimated than over estimated.

The standard deviation appears much more like an average. The sample values concentrate (roughly symmetrically compared to the interquartile range) about the population value.

We might also investigate the effect of sample size on the histograms of each attribute’s sample values.

The following function is used to produce a png file with the histograms for the fours samples sizes arranged in a $2\times2$ grid.

doit <- function(valsSamp1, valsRestSamps, 
                 sizeSamp1, sizeRestSamps, 
                 popVal, FUN = mean, 
                 xlab = "Average shark length (inches)",
                 fname = "SampleSizeOtherAttributes")
{
  ## Get the range of length averages
  xlims <- extendrange(list(valsSamp1, valsRestSamps))
  
  ## Set up output file
  outDir <- "img"
  png(filename = paste(outDir, fname, sep=dirsep),
      width = 600, height = 500)
  ## All plots on one image
  par(mfrow = c(2,2))
  
  ## Include n=5 case from earlier but with density vertical axis
  hist(valsSamp1, col=adjustcolor("grey", alpha = 0.5),
       main=paste("n =", sizeSamp1),
       xlab=xlab,
       probability = TRUE,
       xlim= xlims,
       breaks=25
  )
  ### Mark the population attribute in red
  abline(v=popVal, col="red", lty=3, lwd=2)
  
  ### Now the rest
  for (i in 1:length(sizeRestSamps)) {
    sampSize <- sizeRestSamps[i]
    sampSet <- valsRestSamps[[i]]
    hist(sampSet, col=adjustcolor("grey", alpha = 0.5),
         xlim=xlims,
         main=paste("n =", sampSize),
         xlab=xlab,
         probability = TRUE,
         breaks=25
    )
    ### Mark the population attribute in red
    abline(v=popVal, col="red", lty=3, lwd=2)
  }
  dev.off()
}

Assuming that the arguments exist, the following calls will produce the files.

doit(trimmedMeansSamp, trimmedMeanSampSets, 
     5, sampleSizes, 
     trimmedMeanPop, 
     xlab = "Trimmed average shark length (inches)",
     fname = "TrimmedMeanSampleSizeEffect.png")
doit(mediansSamp, mediansSampSets, 
     5, sampleSizes, 
     medianPop, 
     xlab = "Median shark length (inches)",
     fname = "MedianSampleSizeEffect.png")
doit(rangesSamp, rangesSampSets, 
     5, sampleSizes, 
     rangePop, 
     xlab = "Range of shark lengths (inches)",
     fname = "RangeSampleSizeEffect.png")
doit(iqrsSamp, iqrsSampSets, 
     5, sampleSizes, 
     iqrPop, 
     xlab = "IQR of shark lengths (inches)",
     fname = "IQRSampleSizeEffect.png")
doit(sdsSamp, sdsSampSets, 
     5, sampleSizes, 
     sdPop, 
     xlab = "SD of shark length (inches)",
     fname = "sdsSampleSizeEffect.png")

The results are as follows. First for the two additional location attributes.

Trimmed averages (trim = 0.10) over all possible samples for different sample sizes

The trimmed average behaves much like the ordinary average.

Medians over all possible samples for different sample sizes

The median behaviour is much as it was before. There is a greater variety of possible values when the sample size $n$ is even, since the median is then the average of the two middle values. As the sample size increases, there is a greater concentration of the sample values about the population values.

Ranges over all possible samples for different sample sizes

The range shows a consistent underestimation of the population value. The average sample error will be negative. As the sample size increases, more samples will contain both $y_{min}$ and $y_{max}$ and so will match the population value of the range.

Interquartile ranges over all possible samples for different sample sizes

The interquartile range histogram becimes more symmetric and increasingly concentrated about the population value as $n$ increases.

Standard deviations over all possible samples for different sample sizes

The standard deviations concentrate about the population value as the sample size increases. The histogram is quite skewed.

Note: All of the calculations for each attribute are for this particular population. Results will vary depending on the population values as well as the attributes.

3.1.3 Comparisons across attributes

To help us compare across different attributes, we can use the relative absolute sample error. For any $c > 0$, let \[\samp{P}^\star_{a} (c, n) = \left\{ \samp{S} \suchthat \samp{S} \subset \pop{P}_{\samp{S}}(n) ~\mbox{ and }~ \frac{\abs{a(\samp{S}) - a(\pop{P})}} {\abs{a(\pop{P})}} < c \right\} \] and define the corresponding proportion \[p^\star_{a}(c,n) = \frac{\size{\samp{P}^\star_{a} (c, n)}}{\size{\pop{P}_{\samp{S}}(n)}} \] for all $c > 0$, and $n \le N$. It is important to realize that this measures the consistency of the sample attribute with respect to the same population attribute. When making comparisons between attributes, then, we are evaluating each attribute on how well its sample values track its population value on the same scale.

For each sample size $n$, we can plot the proportion $p^\star_{a}(c,n)$ as a function of $c$.

For the location attributes we have

Locations: proportion sample relative error curves over all possible samples for different sample sizes

The 10% trimmed average performs the same as the average itself for this population. The median also performs as well. Note that the median for sampe size $n=23$ never achieves zero relative error. The reason is that for odd sample sizes the median will be a the value of $y$ for a point in the sample. If, as in this case, the population size $N$ is even, then the population attribute is an average of two $y$ values. Unlesss these are identical, the median for an odd sample size cannot exactly reproduce the population median.

Scales: proportion sample relative error curves over all possible samples for different sample sizes

The range has zero sample error for any sample that includes the minimum and maximum values of $y$ in the population. When $n=23$, for a population of $N=28$, a great many samples will include these values, hence the vertical line at left for the range. The range outperforms the interquartile range in the sense that relative error curves for the range are consistently to the left and above those of the interquartile range. Similarly, with the exception of the proportion of zero error samples for the range when $n=23$, the standard deviation outperforms both the range and the interquartile range.

It is important to note that these findings hold for this particular population. To see how things might change dramatically when the population is slightly different, we could introduce a single outlier into the population as in the exercise below.

Exercise: The “Discovery Channel” has been one of the worst offenders of demonizing sharks with its “shark week”. It has even produced fake documentaries to attract ratings. For example, in 2014 the Discovery Channel produced the following film and, though entirely faked, passed it off as “documentary evidence” about a supposed 35-40 foot “cunning”, “intelligent”, and “stealthy” killer great white called Submarine Shark of Darkness – Wrath of Submarine. While fake, suppose that a great white shark the size of “submarine” was encountered in Australia waters. In this exercise, reconstruct the above relative error curves for the six attributes (three location, three scale) after substituting the length of 480 inches for one of the shark lengths in the Australian waters. That is, instead of the original sharks data, use the data constructed as in the following R code:

sharksBigSubmarine <- sharks
set.seed(12345564)
replaceShark <- sample(length(popSharksAustralia), 1)
rownameReplaceShark <- popSharksAustralia[replaceShark]
sharksBigSubmarine[rownameReplaceShark, "Length"] <- 480

Exercise: How sensitive are the various relative errors to the presence of the single outlier “submarine”? Comment on any differences or similarities you notice in the the curves you produced compared with those produced above. (Instructor note code in Rmarkdown file.)

3.2 Selecting samples

The concentration of the sample attribute values about the population attribute, and the consistency seen in that attribute as the sample size increased reassure us that estimating the population attribute from a sample attribute may not be too misleading (at least in terms of its sample error). Unfortunately, as the previous results demonstrate, it is also possible that the sample in hand is such that it calculation of any attribute value from it could be grossly in error.

For any sample, there is little to suggest whether it is good or bad in itself. The attribute determined on that sample might be identical to that determine on the population. Or, it might be so different we would be completely misled about the true nature of the population attribute from the sample attribute. Worse, there is no way to tell which is the case.

This is why it is important to understand **how* the sample is selected, and if it is within our power to do so to have a hand in selecting the sample itself. Even when the latter is possible, enormous care must be taken so that our own prejudices and pre-conceptions about the population do not render a sample that is misleading.

A simple route out of the problem is suggested by the histograms seen in the previous sections. Imagine that every sample of size $n$ is itself a unit in some population, namely the set of all possible samples of size $n$. That is, consider the population \[ \pop{P}_{\samp{S}} = \left\{\samp{S}_1, \samp{S}_2, \ldots, \samp{S}_M \right\} \] where each $\samp{S}_i$ is a sample of size $n$ from the original population $\pop{P}$. The samples have become units of a population and any attribute $a(\samp{S}_i)$ is now just a variate on that unit! If we select our sample from $\pop{P}_{\samp{S}}$ with probability $\frac{1}{M}$ then the histograms seen in above show the distribution for the variate values $a(\samp{S})$. This is good news! It means that by randomly selecting a sample from $\pop{P}_{\samp{S}}$ we are able to make statements about the probability that the resulting attribute $a(\samp{S})$ will take on any value. In particular, for the population of samples of size $n=5$, we know that with probability $\frac{1}{2}$ the attribute that results will be within the range $[142.4, 169.8]$ inches. That is, we know that \[Pr\left(~a(\samp{S}) \in [142.4, 169.8]~\right) = \frac{1}{2} \] because we are selecting $\samp{S}$ from $\pop{P}_{\samp{S}}$ with probability $p(\samp{S}) = \frac{1}{M}$. We can similarly read off many other probabilities about $a(\samp{S})$ from the histogram.

Suppose we draw a sample of $m$ units $\samp{S}_{u_1}, \ldots, \samp{S}_{u_m}$ from $\pop{P}_{\samp{S}}$

### Select a number of samples
m <- 10000
### from the N_s all possible which have already been created
### we randomly select m with equal probability
set.seed(312345)
selection <- sample(N_s, m)
### Now pick them out of the population of all samples
samplesSelected <- samples[,selection]
### Calculate the averages (could have just picked these too)
avesSampSel <- apply(samplesSelected, MARGIN = 2, 
                     FUN = function(s){mean(sharks[s,"Length"])})

From this randomly selected sample of $m$ = 10,000 samples $\samp{S}_{u_i}$ from the population $\pop{P}_{\samp{S}}$ of ${N \choose n} = {28 \choose 5} =$ 98,280 possible samples, a histogram of the 10,000 sample attribute values may be constructed. In the Figure below, the leftmost histogram is of all 98,280 attributes from all possible samples. Contrast with the middle histogram constructed only from the 10,000 randomly selected samples. The latter, in blue, is overlaid the former in the third histogram.

All versus 10,000 randomly selected samples (n = 5)

Comparing the two left most histograms, the 10,000 samples selected produced a histogram very much like that produced by all 98,280 samples. This may seem surprising, given that only about 10% of all possible samples were used. More surprising perhaps is that no knowledge about which samples should be selected and which might be better avoided was used. Rather amazingly, the samples were selected at random with equal probability from all available samples.

As the rightmost figure shows, by selecting the samples with equal probability, the smaller histogram reproduces the larger histogram approximately proportionately. This point can be made more precisely since we are selecting each sample with probability \[p(\samp{S}) = \frac{1}{M}.\] Suppose the histograms have $K$ bins $B_1 = (b_0, b_1], B_2 = (b_1, b_2], \ldots, (b_{K-1}, b_K]~$ and that the $k$th bin $B_k$ contains $M_k \ge 0$ of the attribute values $a(S_i)$ $i =1, \ldots, M$. The bins contain the attribute values of all of the $\samp{S}_i \in \pop{P}_{\samp{S}}$ so that $\sum_{k=1}^K M_k = M$. Let $m_k$ be the number of the $m$ selected samples whose attribute value falls in $B_k$, with $m = \sum_{k=1}^K m_k$. With this notation, the grey histogram bars of the rightmost figure have heights $M_1, \ldots, M_K$ and the blue histogram heights $m_1, \ldots, m_K$ for one randomly selected sample of size $m$ of samples $\samp{S}$ from $\pop{P}_{\samp{S}}$.

The probability of any particular histogram arising from a random selection of $m$ samples is therefore a multivariate hypergeometric probability \[ \frac{{M_1 \choose m_1} {M_2 \choose m_2} \cdots {M_K \choose m_K}} {{M \choose m}} \] which, when $m << M$ can be approximated by the multinomial probability \[ {M \choose m_1 ~~ m_2 ~~ \cdots ~~ m_K}~ p_1^{m_1} ~ p_2^{m_2} ~ \cdots ~ p_K^{m_K} \] with probabilities $p_k = \frac{M_k}{M}~$ for $~k = 1, \ldots, K$. From the multinomial, the expected value of the number of attribute values in each bin $B_k$ is proportional to $M_k$ (i.e. $m p_k = \frac{m}{M}~M_k$).

The resulting frequency histogram is (in expectation at least) a scaled version of that of all possible samples. An instance (as opposed to the expectation) is shown in the right most display above. On a density scale, where the vertical axis is scaled so that the area of the histogram is 1, the histograms are the same (in expectation). Again, comparisons of the two leftmost displays in the above plot are essentially density scaled (not quite but very nearly).

Exercise Reproduce the two histograms above but with the argument freq = FALSE.

Exercise Plot the quantiles of the attribute values of all $M$ available samples. In a separate plot (or overlaid with legend) plot the quantiles of a randomly selected sample of size $m$ from all possible samples for $m=10000$, $m=1000$, and $m=100$.

Exercise For each value of $m$ and the corresponding sample, produce a quantile-quantile plot to compare the attribute values of $m$ samples selected randomly from $M$ to those of all $M$. Hint: Use the function quantile(...) in R and the probs = ppoints(m) argument to pair quantiles when producing a quantile-quantile plot.

3.2.1 Sampling

We select a sample $\samp{S}$ from the population $\pop{P}_{\samp{S}}$ of size $M$ containing all available samples. Suppose, as above, we make that selection randomly so that every sample $\samp{S}$ has some probability $p(\samp{S}) \ge 0$ of being selected. We require of course that the probabilities also sum to one, that is \[ \sum_{\samp{S} \in \pop{P}_{\samp{S}}} p(\samp{S}) = 1.\] For any sample, $\samp{S} \in \pop{P}_{\samp{S}}$, we have its sample error \[ \mbox{Sample Error } ~ = a(\samp{S}) - a(\pop{P}).\] For any collection of samples (or population of samples) $\pop{P}_{\samp{S}}$, we have the average sample error \[ \mbox{Average Sample Error } ~ = \frac{1}{M} \sum_{\samp{S} \in \pop{P}_{\samp{S}}} \left(a(\samp{S}) - a(\pop{P})\right) .\] By sampling $\samp{S}$ randomly from $\pop{P}_{\samp{S}}$, we also have the sampling bias \[\begin{array}{rcl} \mbox{Sampling Bias} &=& E(a(\samp{S})) - a(\pop{P}) \\ &&\\ &=& \sum_{\samp{S} \in \pop{P}_{\samp{S}}} a(\samp{S})p(\samp{S}) - a(\pop{P}) \\ &&\\ &=& \sum_{\samp{S} \in \pop{P}_{\samp{S}}} \left(a(\samp{S}) - a(\pop{P})\right)p(\samp{S}) \end{array} \] which is just an expected sample error induced by the repeated random sampling of $\samp{S}$ from $\pop{P}_{\samp{S}}$. Of course, when $p(\samp{S}) = \frac{1}{M}$, the sampling bias is identical to the average sample error of $a(\samp{P})$.

We could similarly define other characteristics of the sampling such as the sampling variance \[Var\left(a(\samp{S})\right) = E\left(~ \left(~ a(\samp{S}) - E(a(\samp{S})) ~\right)^2 ~\right) \] where all expectations are taken with respect to the probabilities $p(\samp{S})$ of the samples $\samp{S}$ from $\pop{P}_{\samp{S}}$.

Clearly, the sampling bias depends on the attribute $a(\cdot)$, the set of possible samples $\pop{P}_{\samp{S}}$, and the sample probabilities $p(\samp{S})$. Ideally we would like to choose these, particularly $p(\samp{S})$ and/or $\pop{P}_{\samp{S}}$, so that both the square of the sampling bias and the sampling variance are as small as possible.

Notation could be simplified somewhat by introducing a random variate, say $A$, that takes values $a$ from the distinct values of $a(\samp{S})$ for all $\samp{S} \in \pop{P}_{\samp{S}}$. The induced probability distribution has \[Pr(A = a) = \sum_{\samp{S}\in \pop{P}_{\samp{S}} } p(\samp{S}) \times I_{\{a\}}(a(\samp{S})) \] where $I_X(x)$ is the usual indicator function defined for any $x$ and set $X$ as \[I_X(x) = \left\{ \begin{array}{rcl} 1 &~~~&\mbox{if }~ x \in X \\ 0 && \mbox{otherwise.} \end{array} \right. \]

Exercise: If there are only $K \le M$ distinct values, say $a_1, \ldots, a_K$, then show that $A$, as defined above, is a discrete random variate with probabilities $Pr(A=a_i)$. Express the sampling bias and the sampling variance in terms of this random variate.

It follows that $A$ is a discrete random variate like any other. Probability statements about its values can be made using its distribution, including its expectation, variance, et cetera.

3.2.2 Large populations

If we select $m$ samples from $\pop{P}_{\samp{S}}$ with probability $p(\samp{S}) = \frac{1}{M}$ we should be able to reproduce, using only the $m$ samples selected, many characteristics of the $a(\samp{S})$ for all $\samp{S}$ in $\pop{P}_{\samp{S}}$ provided $m$ is sufficiently large. Rather than calculate the attribute value on all $N \choose n$ possible samples, we might only determine the values for a much smaller number of these.

For example, consider the agricultural census of US counties whose population consists of only $N=3078$ counties. To determine the distribution of an attribute for sample sizes $n=100$, there are $3078 \choose 100$ or about $1.4 \times 10^{190}$ possible samples. To put this into some perspective, to determine $a(\samp{S})$ for all possible samples would require a computer capable of determining about $1 \times 10^{170}$ new attribute values every second about 4.5 billion years (or the approximate age of the Earth) to complete. Clearly all possible samples is not possible for even moderate populations sizes (e.g. see the Wikipedia page on future developments of number of floating point operations per second, of FLOPS, are currently being imagined.)

The combinatorial explosion is avoided if we examine only $m$, say $m =$ 10,000, samples. Selecting these at random gives us some hope of calculating the values which in aggregate will be much like those of all possible samples. This would provide us a means of learning something about much larger populations than the $N=28$ case we have so far exclusively examined.

Unfortunately, if we have to enumerate all possible samples just to select from them we are no farther ahead. The combinatorial explosion is now entirely in the enumeration.

3.2.3 Sampling mechanisms

Rather than select samples at random from all possible samples, the same outcome is effected by sampling the units that will appear in any particular sample. That is rather than select $\samp{S}$ with probability $p(\samp{S})$ from $\pop{P}_{\samp{S}} = \left\{\samp{S}_1, \samp{S}_2, \ldots, \samp{S}_N \right\}$, we form $\samp{S}$ by selecting $n$ units $u_{i_1}, u_{i_2}, \ldots, u_{i_n}$ directly from the population of units $\pop{P} = \left\{u_1, u_2, \ldots, u_N \right\}$.

The simplest way to think about this is to imagine that each unit $u$ in a sample $\samp{S}$ is selected one at a time from the population $\pop{P}$. Let \[ s_k = (u_{i_1}, u_{i_2}, \ldots, u_{i_k}) \] be the sequence of the first $k$ units $u_{i}$ selected from $\pop{P}$. Then a sampling mechanism is defined by the probabilities \[ Pr(u \given k, s_{k-1}) ~~~\mbox{and}~~~ Pr(u). \] The first unit is selected with probability $Pr(u)$ and the probability of the sequence of the first $k$ units selected is \[ Pr(s_k) = Pr(u_{i_1}) \times Pr(u_{i_2} \given 2, s_1) \times Pr(u_{i_3} \given 3, s_2) \times \cdots \times Pr(u_{i_k} \given k-1, s_{k-1}).\] Now for a sample $\samp{S}$ of size $n$, the order in which the units appeared does not matter. Any permutation of the elements of $s_n$ counts as $\samp{S}$ provided $s_n$ and $\samp{S}$ contain exactly the same units. It follows then that $p(\samp{S})$ is simply the sum of $Pr(s_n)$ over all permutations $s_n$ having the same units as $\samp{S}$.

A sampling mechanism which comes immediately to mind is simple random sampling without replacement. In this case \[Pr(u) = \frac{1}{N} ~~~\mbox{and}~~~ Pr(u \given k, s_{k-1}) = \frac{1}{N-k+1} \] yielding \[Pr(s_n) = \frac{1}{N} \times \frac{1}{N-1} \times \frac{1}{N-2} \times \cdots \times \frac{1}{N-n+1}\] which is the same for all $n!$ permutations of the units in $s_n$. This gives the sum over all permutations as \[p(\samp{S}) = \frac{n!}{N(N-1)(N-2)\cdots (N-n+1)} = \frac{1}{{N \choose n}}, \] which is the same probability we had before for selecting $n$ distinct units from a population of $N$ distinct units. However, we now have a mechanism that allows us to select a sample without first enumerating all $M = N \choose n$ possible samples in $\pop{P}_{\samp{S}}$.

In R, the indices of a simple random sample of size $n$ from indices $1, \ldots,N$ generated without replacement is returned from the function call sample(N,n). If rather than indices, the units were identified by the (assumed unique) contents of a vector Pop, then sample(Pop,n) would return the vector of units in the sample.

This also suggests another sampling mechanism, namely simple random sampling with replacement. Here \[Pr(u) = \frac{1}{N} = Pr(u \given k, s_{k-1}) \] since every unit is eligible for selection at each turn. Samples $\samp{S}$ can now contain anywhere from 1 to $n$ distinct units; typically, one or more units will be repeated in the sample. For any sequence $s_n$ of $n$ selections \[Pr(s_n) = \left(\frac{1}{N}\right)^n.\] We might now proceed in either of two ways to describe $\samp{S}$, $p(\samp{S})$, and $\pop{P}_{\samp{S}}$.

The first and easiest is to distinguish each sequence $s_n$ as a different sample $\samp{S}$, and so preserving the importance of the order. In this case \[p(\samp{S}) = \frac{1}{N^n}\] and the population of all populations $\pop{P}_{\samp{S}}$ contains $M= N^n$ different samples.

The second is to take take every sequence $s_n$ containing the same distinct units and the same number of each as producing the same sample however the units are ordered in $s_n$. If there are $k$ distinct units and $n_1 > 0, n_2 > 0, \ldots, n_k >0$ of each (with $\sum_{i=1}^k n_i =n$) then \[p(\samp{S}) = \frac{{n \choose n_1 n_2 \cdots n_k}}{N^N} \] and $\pop{P}_{\samp{S}}$ consists of all such possible samples.

For any attribute that is invariant to the order of the units in the sample, the two ways of thinking about all possible samples will yield identical results.

To generate simple random amples with replacement in R, the previous calls are adjusted to include the argument replace = TRUE as in sample(N, m, replace = TRUE).

The difference between these two methods can be seen on the Australian shark encounter population. This time, we take $n=15$ which for sampling without replacement yields a population $\pop{P}_{\samp{S}}$ of size $M =$ 37,442,160. For sampling with replacement, $\pop{P}_{\samp{S}}$ is much larger containing $M = 5.097655 \times 10^{21} =$ 5,097,655,000,000,000,000,000 different sequences $s_{15}$ (from following the first interpretation). We will sample $m = 10,000$ from each of these and compare the averages as attribues.

### sample size
n <- 15
### number of samples
m <- 10000

### reproducibiity
set.seed(123415)
### samples without replacement
sampsWithout <- Map(function(i){sample(popSharksAustralia, size=n, replace = FALSE)}, 
                    1:m)
### attribute evaluated on each sample
aveWithout <- Map(function(s){mean(sharks[s,"Length"])}, sampsWithout)


### samples with replacement
sampsWith <- Map(function(i){sample(popSharksAustralia, size=n, replace = TRUE)}, 
                 1:m)
### attribute evaluated on each sample
aveWith <- Map(function(s){mean(sharks[s,"Length"])}, sampsWith)

### Note that in both cases, there are so many samples to choose from
### that we are not going to worry about whether we have repeated any
### in the m we have selected from M
###

### Now prepare to plot histograms
### 
### Use the same x scale in the plots
xlim <- extendrange(c(aveWith, aveWithout))
### and bins
bins <- hist(c(as.numeric(aveWithout), as.numeric(aveWith)),
             breaks = 30, plot=FALSE)
### And heights
ylim <- c(0, 2000)
  
### Without replacement
### 
hist(as.numeric(aveWithout), main = "Average without replacement", 
     xlim = xlim, xlab = "length (inches)", ylim = ylim,
     breaks = bins$breaks, col =  adjustcolor("grey", 0.75))
abline(v=avePop, col="red", lty=2)

### and with
hist(as.numeric(aveWith), main = "Average with replacement", 
     xlim = xlim, xlab = "length (inches)", ylim = ylim,
     breaks = bins$breaks, col =  adjustcolor("grey", 0.75))
abline(v=avePop, col="red", lty=2)

Simple random sampling without replacement produces a slightly more concentrated histogram. This is evidenced as well by the five number summaries:

### Without replacement
fivenum(as.numeric(aveWithout))

## [1] 128.0000 149.8000 155.7333 161.7333 183.6000

### With replacement
fivenum(as.numeric(aveWith))

## [1] 106.8000 147.4667 156.0000 164.4000 203.6000

3.2.3.1 A curious sampling mechanism

The following mechanism was first explored by (Basu 1958). Suppose we proceed as in simple random sampling with replacement except that we remove any duplicate units. The samples produced will have sizes anywhere from $1$ to $n$ according to how many distinct units were selected in a sample (sampling with replacement). For our example, this results in the following histogram

aveWithUnique <- Map(function(s){mean(sharks[unique(s),"Length"])}, sampsWith)

hist(as.numeric(aveWithUnique), main = "Average with replacement unique", 
     xlim = xlim, xlab = "length (inches)", ylim = ylim,
     breaks = bins$breaks, col =  adjustcolor("grey", 0.75))
abline(v=avePop, col="red", lty=2)

### With replacement
fivenum(as.numeric(aveWithUnique))

## [1] 107.8750 148.3846 155.8182 163.4306 192.2500

The strange thing is that this sampling mechanism produces a histogram that is more concentrated than simple random sampling with replacement!

To make this result as simple as possible, suppose that we had a box containing $N$ different balls that are either white or black. We wish to estimate the proportion of balls in the box which are black by drawing $n$ balls at random from the box. We proceed according to the three different sampling mechanisms.

Simple random sampling without replacement. Randomly draw $n$ balls from the box one after another, without replacing any at any time. For each black ball selected score $y = 1$ and for each white ball score $y = 0$. The estimate of the proportion of black balls in the box will be the average of the scores $\widebar{y}$.
Simple random sampling with replacement. Randomly draw $n$ balls from the box one after another, each time replacing the ball after scoring it as 1 or 0, as described above. Again the estimate of the proportion of black balls in the box will be the average of the scores $\widebar{y}$.
Randomly varying sample sizes. Proceed as in 2, selecting one ball at a time and recording its score before returning it to the box for possible future random selection. Except now, every time a ball is drawn mark it with an $X$ before returning it to the box, potentially to be randomly drawn again. If a ball drawn already has an $X$ marked on it, then it counts as a draw, is returned to the box, but no score is recorded for it. Continue in this way until $n$ draws have been made. The estimate of the proportion of black balls in the box is the average of the recorded scores (anywhere from 1 to $n$ such scores).
Note: the size of the sample of distinct units is itself a random variate taking values $1, \ldots, n$ with different probabilities summing to 1.

Exercise Using all of the shark encounters (not just those in Australian waters), generate 10,000 samples of size 20 according to the three different sampling mechanisms. In place of black and white balls, whether a shark encounter was fatal or not will be the $y$ recorded (1 if fatal, 0 otherwise). Compare how these three methods perform in estimating the proportion of the shark encounters in the population which were fatal.

3.2.3.2 Implementation of sampling mechanisms

We could implement any of the above sampling mechanisms as a single call to a creator function.

### This will create a sampling mechanism
createSamplingMechanism <- function (pop, method = c("withoutReplacement",
                                                     "withReplacement", 
                                                     "withUnique")) {

  method = match.arg(method)
  switch (
    method,
    "withReplacement"  = function (sampSize) {
      sample(pop, sampSize, replace=TRUE)
    },
    "withoutReplacement" = function (sampSize) {
      sample(pop, sampSize, replace=FALSE)
    },
    "withUnique" = function (sampSize) {
      unique(sample(pop, sampSize, replace=TRUE))
    },
    stop(paste("No sampling mechanism:", method))
      )
}

For example, for simple random sampling without replacement on the population of all sharks, we might define a function srswor(sampSize) as

### without replacement is the default method.
srswor <- createSamplingMechanism(popSharks)

which now allows us to generate a sample of any size containing units selected without replacement from the population of all sharks.

set.seed(354661)
### A sample of size 5
srswor(5)

## [1] "45" "30" "5"  "46" "40"

### of size 10
srswor(10)

##  [1] "1"  "50" "54" "31" "61" "28" "55" "32" "23" "26"

### Of size 30
srswor(30)

##  [1] "22" "4"  "3"  "8"  "58" "20" "52" "32" "31" "49" "30" "47" "33" "23"
## [15] "39" "28" "41" "15" "44" "21" "62" "57" "10" "56" "9"  "19" "54" "51"
## [29] "63" "36"

Similarly, for the unique units from a sample with replacement of some size.

set.seed(354661)
### create the sampling mechnism
uniquewr <- createSamplingMechanism(popSharks, method="withUnique") 
### A sample uniquely from size 30 with replacement
uniquewr(30)

##  [1] "45" "31" "5"  "48" "42" "1"  "51" "56" "32" "65" "60" "36" "26" "30"
## [15] "22" "4"  "3"  "8"  "62" "21" "57" "35" "41" "29" "49"

### again
uniquewr(30)

##  [1] "37" "55" "21" "61" "29" "11" "46" "16" "62" "13" "30" "51" "53" "5" 
## [15] "65" "63" "39" "12" "58" "15" "1"  "19" "43" "36" "52" "20"

### again
uniquewr(30)

##  [1] "15" "3"  "28" "55" "12" "25" "2"  "56" "49" "50" "45" "22" "6"  "61"
## [15] "20" "26" "8"  "48" "40" "1"  "36" "47" "13" "30"

Note that different sample sizes can result for this method.

Note also, that for all mechanisms the population has been specified so that the created function will only generate samples from that population. This allows us to write different sampling mechnisms that might actually depend on some features of the population.

3.2.4 Probability of a unit being in a sample

In addition to the probability, $p(\samp{S})$, of selecting a sample $\samp{S}$ from $\pop{P}_{\samp{S}}$, it can be of interest to determine the probability that any unit $u$ will appear in the sample. This can be derived from $p(\samp{S})$.

First, consider the function \[ D(u) = \left\{ \begin{array}{lcl} 1 & ~~~~~ &\text{if} ~~ u \in \samp{S} \\ 0 && \text{otherwise}. \end{array} \right.\] $D(u)$ is a binary random variate that takes value 1 with probability $Pr(\samp{S} \ni u)$ (read as the probability that the sample $\samp{S}$ contains $u$) and 0 otherwise. The probability of inclusion of $u$ in $\samp{S}$ is \[\begin{array}{rcl} \pi_u & = & E(D(u))\\ & = & 1 \times Pr(D(u) = 1) + 0 \times Pr(D(u) = 0) \\ & = & Pr(\samp{S} \ni u) \\ & = & \sum_{\samp{S} ~\ni~ u} ~p(\samp{S}) \end{array} \] This is called the inclusion probability of $u$ in the sample $\samp{S}$; it is the probability that the unit $u$ will be in a sample $\samp{S}$ selected according to $p(\samp{S})$. In the same way, \[\begin{array}{rcl} \pi_{uv} & = & Pr(~ \samp{S} \ni u \mbox{ and } \samp{S} \ni v~ ) \\ & = & E\left(~D(u) \times D(v)~ \right) \\ & = & \sum_{\samp{S} ~\ni~ u, v} ~p(\samp{S}) \end{array} \] denotes the probability that the sample $\samp{S}$ contains both units $u$ and $v$ from $\pop{P}$. Sums are over all $\samp{S} \in \pop{P}_{\samp{S}}$ containing the designated units.

Note that \[ \sum_{u \in \pop{P}} \pi_u = n, ~\mbox{ the size of } ~ \samp{S} \] and \[ \sum_{v \in \pop{P}} \pi_{uv} = n \pi_u. \] Exercise: Prove these two results.

In some sense, since $\sum_{u \in \pop{P}} \pi_u = n$, $\pi_u$ is the expected contribution of unit $u$ to the sample $\samp{S}$ that is randomly selected with probability $p(\samp{S})$.

Note also that when the sampling mechanism is simple random sampling without replacement the probability that a unit $u$ will be in the sample is \[ \pi_u = \frac{n}{N}. \] Exercise: Prove this.

For joint inclusion probabilities, we have for simple random sampling without replacement (assuming $u \ne v$) \[ \pi_{uv} = \frac{n (n-1)}{N (N-1)}.\]

Exercise: Prove this.

More challenging is the determination of the inclusion probabilities for simple random sampling with replacement. In this case \[ \pi_u = 1 - \left(\frac{N-1}{N} \right)^n. \] Exercise: Prove this.

For joint inclusion probabilities, we have for simple random sampling with replacement (assuming $u \ne v$) –> \[ \pi_{uv} = 1 - 2\left(\frac{N-1}{N}\right)^n + \left(\frac{N-2}{N}\right)^n.\]

Exercise: Prove this. Hint: Apply the same trick as in finding $\pi_u$. Also determinations of probabilities are become simpler if conditioned on units known not to be in the sample.

Exercise: The inclusion probabilities for sampling with replacement but using only the unique units selected (i.e. the “curious” mechanism discussed earlier and investigated by Basu) are identical to simple random sampling with replacement. Why?

We could also write functions to produce whatever inclusion probabilities are implied by the sampling design. For example, the inclusion probabilities for simple random sampling could be written as follows.

### some utility functions
popSize <- function(pop) {nrow(as.data.frame(pop))}
sampSize <- function(samp) {popSize(samp)}

### This function returns the function that will 
### give the inclusion probability for any unit in
### the population, when samples have been
### selected without replacement
### 
createInclusionProbFn <- function(pop, sampSize) {
  N <- popSize(pop)
  n <- sampSize
  function(u) { n/N }
}

### This function returns the function that will 
### give the inclusion probability for any pair of
### units in the population when samples have been
### selected without replacement
### 
createJointInclusionProbFn <- function(pop, sampSize) {
  N <- popSize(pop)
  n <- sampSize
  function(u,v) { 
    ## Note that the answer depends on whether u and v
    ## are the same or different
    if (u == v) {n/N} else {(n * (n-1)) / (N * (N-1))}
  }
}

3.2.5 Estimating totals

Many attributes are either a total \[ a(\pop{P}) = \sum_{u \in \pop{P}} y_u\] for some variate $y_u$ defined for any unit $u \in \pop{P}$, or a function of such a total. A population average $\widebar{z}$, for example, could be expressed either way depending on whether the variate is $y_u = z_u/N$, or $y_u = z_u$.

The attribute might focus on any subpopulation $\pop{A} \subset \pop{P}$ by simply taking the variate to be the variate multiplied by the appropriate indicator function: \[y_u \times I_{\pop{A}}(u).\] Note that, such a sub-population could be defined in a variety of ways, including having it depend on the value of another variate $x_u$ as in $y_u\times I_B(x_u)$.

If interest is in the size of a subpopulation $\pop{A}$, then the variate would be $y_u = I_{\pop{A}}(u)$. If $\pop{A} = \pop{P}$ this is the size of the population and $y_u =1$ for all $u \in \pop{P}$. (Recall that a variate $y$ is any function that when applied to any unit $u \in \pop{P}$ returns a value $y_u = y(u)$.)

Another important example is the population cumulative distribution function (cdf) $F_{\pop{P}}(y)$ at a specific value $y$ defined as \[F_{\pop{P}}(y) = \frac{1}{N} a(\pop{P}) = \frac{1}{N} \sum_{u \in \pop{P}} I_{(-\infty, ~y]}(y_u), \] where the total is now over the binary variate given by the indicator function.

If we have $F_{\pop{P}}(y)$, then a number of other attributes may be calculated from it such as any quantile via the inverse $Q_y(p) = F^{-1}_{\pop{P}}(p)$. For mathematical purposes, we could define the quantile (and hence “inverse”) as \[Q_y(p) = \inf \left\{ y_u \suchthat p \le F_{\pop{P}}(y_u) ~\mbox{ and }~ u \pop{P} \right\}\] although in practice we might also choose instead to interpolate between two successive ordered values $y_{(i)} \le y_{(i+1)}$ whenever $F_{\pop{P}}(y_{(i)}) \le p \le F_{\pop{P}}(y_{(i+1)})$.

Exercise: If $F_{\pop{P}}(y_{(i)}) \le p \le F_{\pop{P}}(y_{(i+1)})$ for $y_{(i)} < y_{(i+1)}$, give a mathematical expression for the value that would be returned by a simple linear interpolation.

A natural estimate of a population total $a(\pop{P}) = \sum_{u\in\pop{P}}y_u$, called the Horvitz-Thompson estimate after (Horvitz and Thompson 1952), is \[ \widehat{a}(\pop{P}) = a_{HT}(\samp{S}) = \sum_{u \in \samp{S}} \frac{y_u}{\pi_u}\] where each value in the sample sum is weighted inversely to its probability of inclusion in $\samp{S}$. So if the probability of inclusion is small, the weight will be high, and if it is large then the weight will be low.

Note that the Horvitz-Thompson estimate is not necessarily Fisher consistent, in the sense that it is not necessarily the same as $a(\samp{S})$ – hence the use of the subscript $HT$ and the “hat” above $a(\pop{P})$ in the definition.

Exercise: Determine whether the Horvitz-Thompson estimator is location-scale invariant.

Exercise: Determine whether the Horvitz-Thompson estimator is location-scale equivariant.

Exercise: Determine the sensitivity curve for the Horvitz-Thompson estimator (assume $y$ has inclusion probability $\pi$). Draw the curve for various values of $\pi$. Comment on the results.

Some of the repeated sampling properties of the Horvitz-Thompson estimator can be derived. As before, it will be convenient to work with the random variate \[ D(u) = \left\{ \begin{array}{lcl} 1 & ~~~~~ &\text{if} ~~ u \in \samp{S} \\ 0 && \text{otherwise}. \end{array} \right.\]

For example, the expectation of the estimator (“wig” or “tilde” now used to distinguish its random nature under repeated sampling) \[ \begin{array}{rcl} E(\bigwig{a}(\pop{P}) ) &=& E\left( \bigwig{a}_{HT}(\samp{S}) \right) \\ &&\\ &=& E\left(\sum_{u \in \samp{S}} \frac{y_u}{\pi_u} \right) \\ &&\\ &=& E\left(\sum_{u \in \pop{P}} D(u) \times \frac{y_u}{\pi_u}\right) \\ &&\\ &=&\sum_{u \in \pop{P}} \frac{y_u}{\pi_u} E\left(D(u)\right)\\ &&\\ &=&\sum_{u \in \pop{P}} \frac{y_u}{\pi_u} \pi_u\\ &&\\ &=&\sum_{u \in \pop{P}}y_u\\ &&\\ &=&a(\pop{P}).\\ \end{array} \] So the Horvitz-Thompson estimator is unbiased for an attribute that is a population total.

The estimator is very easily implemented in general, given the functions previously written. Namely, as

createHTestimator <- function(pi_u_fn) {
  function(samp, variateFn) {
    Reduce(`+`, 
           Map(function(u) {variateFn(u)/ pi_u_fn(u)}, samp),
           init = 0
    )
  }
}

Note that because of the generality of the Horvitz-Thompson estimator, the above implementation assumes that the estimator will be handed some function variateFn(u) that when applied to a unit $u$ in the population $\pop{P}$ will return the value $y_u$ of the variate whose total we are trying to estimate in the population.

Here is the simplest of variateFn creator.

### Assuming we can have a general variate
### whose total over the population we wish to estimate.
### It needs all of the population data in general
### for possibly complicately derived variates.
createvariateFn <- function(popData, variate) {
  function (u){popData[u, variate]}
}

For the population of shark encounters in Australian waters, the Horvitz-Thompson estimator for any sample of specified size can now be calculated using these functions.

### We'll use samples of size 5, and so will need inclusion
### probability functions for this sample size for 
### the appropriate samplingMechanism (here simple random
### sampling without replacement)
inclusionProb <- createInclusionProbFn(popSharksAustralia, 
                                       sampSize = 5)

### And that is all that is needed to create the
### Horvitz-Thompson estimator
sharksHTestimator <- createHTestimator(inclusionProb)

To make use of the estimator to actually calculate some estimates, a few more functions are needed.

### We need the sampling mechanism for 
### simple random sampling without replacement for this population
selectSharksWoR  <- createSamplingMechanism(popSharksAustralia, 
                                            method = "withoutReplacement")
### And we need a variate whose total we are interested
### in estimating.
sharkLength <- createvariateFn(sharks[popSharksAustralia,], "Length")

The calculation of an estimate for a particular sample is now straight forward.

### Here's a sample of size 5 (to match the inclusion probabilities)
sharkSample <- selectSharksWoR(sampSize = 5)
### And we need a variate whose total we are interested
### in estimating.
sharkLength <- createvariateFn(sharks[popSharksAustralia,], "Length")
###
### And the estimate is
sharksHTestimator(sharkSample, sharkLength)

## [1] 5370.4

which is an estimate of the total of all shark lengths for encounters in Australian waters. Compare this to the population total for 10,000 such samples:

popTotal <- sum(sharks[popSharksAustralia, "Length"])

totals <- Map(function(rep) {
  sharksHTestimator(selectSharksWoR(sampSize = 5), 
                    sharkLength)},
  1:10000)

hist(as.numeric(totals), col=adjustcolor("grey", alpha = 0.5),
     main="Horvitz-Thompson estimates (n = 5)",
     xlab="Total shark lengths in Australian encounters  (inches)",
     breaks=25
     )
### Mark the population attribute in red
abline(v=popTotal, col="red", lty=3, lwd=2)

The histogram looks a lot like the one calculated for the same number of samples selected with equal probability, except for the scale of the horizontal axis these are estimates of the total of shark lengths and not of the average. Of course dividing these estimates by the known population size $N=28$ would produce estimates of the average shark length, as before.

Exercise: Implement the Horvitz-Thompson estimators for the value of the cumulative proportion of the shark lengths in Australian encounters less than 175 inches. How might you estimate the whole cumulative distribution function?

Exercise: Implement an estimator of the median based on an appropriate Horvitz-Thompson estimator of a total.

As we did with the expectation, the variance of the Horvitz-Thompson estimator can also be determined. First note that \[ \begin{array}{rcl} Var(\bigwig{a}(\pop{P}) ) &=& Var\left(\bigwig{a}_{HT}(\samp{S}) \right) \\ &&\\ &=& E\left(~(\bigwig{a}_{HT}(\samp{S}))^2~ \right) -\left(~ E\left( \bigwig{a}_{HT}(\samp{S}) \right)~\right)^2 \\ &&\\ &=& E\left(~(\bigwig{a}_{HT}(\samp{S}))^2~ \right) -\sum_{u \in \pop{P}}\sum_{v \in \pop{P}} y_u y_v\\ \end{array} \] using the unbiasedness of the Horvitz-Thompson estimator just proved. The remaining expectation can be calculated as \[ \begin{array}{rcl} E\left(~(\bigwig{a}_{HT}(\samp{S}))^2~ \right) &=& E\left(~ \sum_{u \in \samp{S}}\sum_{v \in \samp{S}} \frac{y_u}{\pi_u} \frac{y_v}{\pi_v} ~ \right)\\ &&\\ &=& E\left(~ \sum_{u \in \pop{P}}\sum_{v \in \pop{P}} \frac{y_u}{\pi_u} \frac{y_v}{\pi_v} D(u)D(v) ~ \right)\\ &&\\ &=& \sum_{u \in \pop{P}}\sum_{v \in \pop{P}} \frac{y_u}{\pi_u} \frac{y_v}{\pi_v} E\left(~ D(u)D(v) ~ \right)\\ &&\\ &=& \sum_{u \in \pop{P}}\sum_{v \in \pop{P}} \frac{y_u}{\pi_u} \frac{y_v}{\pi_v} \pi_{uv}.\\ \end{array} \] Putting these two results together we have \[ Var\left( \bigwig{a}_{HT}(\samp{S}) \right) = \sum_{u \in \pop{P}}\sum_{v \in \pop{P}} (\pi_{uv} - \pi_{u}\pi_{v}) \frac{y_u}{\pi_u} \frac{y_v}{\pi_v} = \sum_{u \in \pop{P}}\sum_{v \in \pop{P}} \Delta_{uv}\frac{y_u}{\pi_u} \frac{y_v}{\pi_v} \] as the variance of the Horvitz-Thompson estimator, where $\Delta_{uv} = \pi_{uv} - \pi_{u}\pi_{v}$. The latter is just the covariance $Cov(D(u), D(v)) = \Delta_{uv}$.

Note that the joint inclusion probability can be written using a conditional inclusion probability $\pi_{v \given u}$ as $\pi_{uv} = \pi_u \pi_{v \given u}$. Whenever $u=v$, $\pi_{v \given u} = 1$ and $\Delta_{uu} = \pi_u(1- \pi_u)$, the latter being the variance $Var(D(u))$.

The variance of the Horvitz-Thompson estimator may now be equivalently written as \[ Var\left( a_{HT}(\samp{S}) \right) = \sum_{u \in \pop{P}} (1 - \pi_{u}) \frac{y_u^2}{\pi_u} + \sum_{u \in \pop{P}}\sum_{\begin{array}{c}v \in \pop{P} \\ v \ne u \end{array}} \Delta_{uv}\frac{y_u}{\pi_u} \frac{y_v}{\pi_v} \] The variance of the Horvitz-Thompson estimator can also be rewritten as \[ Var\left( \bigwig{a}_{HT}(\samp{S}) \right) = -\frac{1}{2}\sum_{u \in \pop{P}}\sum_{v \in \pop{P}}\Delta_{uv} \left(\frac{y_u}{\pi_u} - \frac{y_v}{\pi_v}\right)^2 \] which is sometimes called the Yates-Grundy formulation, or the Sen-Yates-Grundy formula after the authors Sen, and Yates and Grundy who independently showed it held.

Exercise: Prove the Sen-Yates-Grundy formulation gives the variance of the Horvitz-Thompson estimate.

Exercise: Show that when the sampling mechanism is simple random sampling without replacement that

(easy) the Horvitz-Thompson estimator of the population total is \[ a_{HT}(\samp{S}) = \frac{N}{n} \sum_{u \in \samp{S}} y_u\]
(harder) with variance \[ Var(\bigwig{a}_{HT}(\samp{S})) = N^2\left(1 - \frac{n}{N} \right) \frac{1}{n} \left( \frac{\sum_{u \in \pop{P}} (y_u - \widebar{y})^2}{N-1} \right) \]
Note that dividing this by $N^2$ gives the variance of the sample average estimator $\sum_{u \in \samp{S}} y_u /n$ for the population average $\sum_{u \in \samp{P}} y_u /N$. The variance formula should look somewhat familiar except for a finite population correction $\left(1 - \frac{n}{N} \right)$. Seeing this, explain the formula (divided by $N^2$) in words. What if $N>>n$?

Exercise: When the sampling mechanism is simple random sampling with replacement, mathematically determine the Horvitz-Thompson estimate and its variance. Comment on the differences (or not) between these values and those obtained when sampling without replacement.

Note that the variance of the Horvitz-Thompson estimator is a function of the units $u$ in the population $\pop{P}$. Hence it too is an attribute of that population. More importantly, if we consider the population $\pop{P}_{uv}$ to be the population of size $N^2$ consisting of all pairs $(u,v)$ where $u, v \in \pop{P}$, then the variance of the Horvitz-Thompson estimator can be written as \[ Var(\bigwig{a}_{HT}(\samp{S})) = \sum_{(u,v) \in \pop{P}_{uv}} q_{u,v} \] where \[ q_{u,v} = \Delta_{uv}\frac{y_u}{\pi_u}\frac{y_v}{\pi_v}. \] That is $Var(\bigwig{a}_{HT}(\samp{S}))$ is itself a total! Only now it is over a population of pairs $(u,v)$. That means that we can use a Horvitz-Thompson estimator of this total to get an unbiased estimator!

We take a sample from this population to be If we let $S_{uv} = \samp{S} \times \samp{S}$ where $\samp{S}$ is selected from $\pop{P}$ with probability $p(\samp{S})$ as usual. Then inclusion probability for each pair $(u,v)$ is simply $\pi_{uv} > 0$. The Horvitz-Thompson estimate of this is
\[ \begin{array}{rcl} \widehat{Var}(\bigwig{a}_{HT}(\samp{S})) &=& \sum_{(u,v) \in \samp{S}_{uv}}\frac{q_{u,v}}{\pi_{uv}}\\ &&\\ &=& \sum_{(u,v) \in \samp{S}_{uv}} \frac{\Delta_{uv}}{\pi_{uv}} \frac{y_u}{\pi_u}\frac{y_v}{\pi_v}\\ &&\\ &=& \sum_{u \in \samp{S}}\sum_{v \in \samp{S}} \frac{\Delta_{uv}}{\pi_{uv}} \frac{y_u}{\pi_u}\frac{y_v}{\pi_v}\\ &&\\ &=& \sum_{u \in \samp{S}}\sum_{v \in \samp{S}} \left(\frac{\pi_{uv} - \pi_u\pi_v}{\pi_{uv}}\right) \frac{y_u}{\pi_u}\frac{y_v}{\pi_v}.\\ \end{array} \] It follows that the estimator is unbiased for the variance! (Of course, the Sen-Yates-Grundy formula might also be used to derive an equivalent formula.)

Using Horvitz-Thompson estimation, we are able to construct an estimate of the population total and an estimate of the variance of these estimators. Both estimators are unbiased.

As with the Horvitz-Thompson estimator, we can write a function that produces the estimator of the variance of the Horvitz-THompson estimator.

createHTVarianceEstimator <- function(pop, pi_u_fn, pi_uv_fn) {
  function(samp, variateFn) {
    Reduce(`+`,
           Map(function(u) {
             pi_u <- pi_u_fn(u)
             y_u <- variateFn(u)
             Reduce(`+`, 
                    Map(function(v) {
                      pi_v <- pi_u_fn(v)
                      pi_uv <- pi_uv_fn(u, v)
                      y_v <- variateFn(v)
                      Delta_uv <- pi_uv - pi_u * pi_v
                      
                      result <- (Delta_uv  * y_u * y_v) 
                      result <- result/(pi_uv * pi_u * pi_v)
                      result
                    }, 
                    samp),
                    init = 0) 
           },
           samp
           ),
           init = 0)
    
  }
}

This allows an estimate of the variance of the estimator based on the same sample used to construct the original Horvitz-Thompson estimate.

We illustrate this on the Australian sharks.

### we have the sharkSample from before
sharkSample

## [1] "6"  "25" "61" "40" "22"

###
### To estimate the variance we need the joint inclusion probabilities
inclusionJointProb <- createJointInclusionProbFn(popSharksAustralia, 
                                                 sampSize = 5)
###
### The estimator of the variance is had
HTVarianceEstimator <- createHTVarianceEstimator(popSharksAustralia, 
                                                 pi_u_fn = inclusionProb, 
                                                 pi_uv_fn = inclusionJointProb)
###
### This function can now be used to calculate the
### variance estimate from the sample
HTVarianceEstimator(sharkSample, sharkLength)

## [1] 130242.6

which can be compared to the variance calculated on the 10,000 Horvitz-Thompson estimates of total shark length. We can also generate samples to see what the distribution of the variance estimator would be.

### The variance of the 10,000 estimates
var(as.numeric(totals))

## [1] 305610.4

### Similarly, 10,000 variance estimates can be produced
### 
variances <- Map(function(rep) {
  HTVarianceEstimator(selectSharksWoR(sampSize = 5), sharkLength)},
  1:10000)

The results of which can be plotted (with the value for the totals overlaid in red).

As can be seen, there is considerable variation in the estimates of the variance (or standard deviation).

Exercise: Have students generate 10,000 samples and for each $\samp{S}$ determine the interval \[ \left[ a_{HT}(\samp{S}) - 2 \widehat{SD}(\bigwig{a}_{HT}(\samp{S})), a_{HT}(\samp{S}) + 2 \widehat{SD}(\bigwig{a}_{HT}(\samp{S})) \right] \] and determine the proportion of these intevals that contain $a(\pop{P})$. They should comment on their findings.

Exercise: Write the above as one or more functions that can be reused with any population, sample size, and sampling design.

3.2.5.1 Some estimators which are not Horvitz Thompson estimators

Note that the Horvitz-Thompson estimator constructed above was for sampling without replacement. Using the sample average to estimate the population average when sampling with replacement is not a Horvitz-Thompson estimator.

Exercise Construct a Horvitz-Thompson estimator using the inclusion probabilities for sampling with replacement. How does this compare to using the sample average of the same sample? How does this compare to the sample average of the values from the unique units in the sample?

3.2.6 Sampling design

The pair $(\pop{P}_{\samp{S}}, p(\samp{S}))$ together determine which samples are possible and with what probability they are selected. Together they are called a sampling design. To avoid redundant designs we can, without loss of generality, take $p(\samp{S}) > 0$ for all $\samp{S} \in \pop{P}_{\samp{S}}$ by removing from $\pop{P}_{\samp{S}}$ any $\samp{S}$ for which $p(\samp{S})=0$.

The sampling design is ours to choose. We could choose $\pop{P}_{\samp{S}}$ so that the values $a(\samp{S})$ for $\samp{S} \in \pop{P}_{\samp{S}}$ are constrained to be near $a(\pop{P})$. Alternatively, or additionally, we could choose $p(\samp{S})$ so that samples $\samp{S} \in \pop{P}_{\samp{S}}$ that have $a(\samp{S})$ close to $a(\pop{P})$ have higher probability, $p(\samp{S})$, of being selected.

One measure of closeness is the mean squared error (MSE) of an estimator $\bigwig{a}(\samp{S})$: \[ \begin{array}{rcl} MSE(\bigwig{a}(\samp{S})) &=& E\left(~(\bigwig{a}(\samp{S}) - a(\pop{P})^2 \right)\\ &&\\ &=& E\left(~(\bigwig{a}(\samp{S}) - E(\bigwig{a}(\samp{S})))^2 \right) + \left(~(E(\bigwig{a}(\samp{S})) - a(\pop{P})^2 \right) \\ &&\\ &=& Var(~\bigwig{a}(\samp{S})~) + \left(Bias(~ \bigwig{a}(\samp{S})~) \right)^2\ \end{array} \] which depends on the sampling design $(\pop{P}_{\samp{S}})$, as well as the estimator $\bigwig{a}(\samp{S})$ and attribute of interest $a(\pop{P})$. All are in our control.

Exercise: Prove that the MSE decomposes into the sum of the variance and the squared bias.

For a Horvitz-Thompson estimator, the bias term is zero and its variance is \[ Var\left( a_{HT}(\samp{S}) \right) = -\frac{1}{2}\sum_{u \in \pop{P}}\sum_{v \in \pop{P}}\Delta_{uv} \left(\frac{y_u}{\pi_u} - \frac{y_v}{\pi_v}\right)^2. \] An advantage to this Sen-Yates-Grundy formulation is that it gives some insight into how we might best choose choose a design.

For example, the formulation suggests that if we could choose $\pi_u \propto y_u$ then the variance will be zero! Unfortunately, this is not the case since $y_u$ is unknown and, moreover, would violate the assumptions used to derive these formulas (e.g. the “unbiased Horvitz-Thompson estimate” of the population total would always be $n$!).

However, it does suggest that if there were auxiliary information whereby the difference \[\left(\frac{y_u}{\pi_u} - \frac{y_v}{\pi_v} \right)^2\] could be made smaller for all $u, v \samp{S}$ then the mean-square error would be smaller. For example, perhaps we have auxiliary information on all units in the population so that whenever $y_u \approx y_v$ we could arrange that $\pi_u \approx \pi_v$ (e.g. stratified sampling tries to effect this). Or perhaps there is another variate $x_u$ that is highly positively correlated with $y_u$ for all $u \in \pop{P}$. Then choosing $\pi_u \propto x_u$ could result in a smaller mean squared error.

Much of survey sampling is concerned with how best to choose the sampling design $(\pop{P}_{\samp{S}}, p(\samp{S}))$ to reduce the MSE. Different designs can be conceived, largely depending on the nature and quality of available auxiliary information.

TODO Write up a function that uses a data structure (named list will do) that represents a sampling design. This should then be used to simplify code.

3.2.7 Large populations revisited

Look examples of simple random sampling with replacement compared to population attribute for some larger population where all possible samples are impossible to calculate. The US census of agriculture is a good one.

Generate stratified random sampling (proportional to size) and show how it reduces the variance.

Exercise: Determine inclusion probabilities for stratified random sampling. Implement the corresponding functions in R using above code (tests programming ability).

Exercise: Introduce Neyman optimal allocation and do derivations and/or implement code. Compare by simulation with proportional allocation. Could also use another variate’s variance to determine allocations (compare variates that are correlated with $y$ to those that are not.)

4 Inductive inference

As has been seen, probabilistic reasoning can be brought to bear on discussing the potential magnitude of a sample error provided a probabilistic sampling mechanism is used to select a sample from the population. That is, the sampling mechanism constructs a sample $\samp{S}$ say from a set of possible samples, say $\left\{ \samp{S}_1, \samp{S}_2, \ldots , \samp{S}_M \right\}$ where each sample $\samp{S}_j$ has some probability $p_j \ge 0$ of being selected (with $\sum_{j=1}^M p_j = 1$).

In this framework, the sampling behaviour of any population attribute of interest can be examined by repeatedly drawing samples according to the given mechanism. By independently drawing samples according to the sampling mechanism, some sense of the variability of any attribute from sample to sample is easily obtained. This allows us to compare attributes and sampling mechanisms. We have already seen this for a variety of attributes and sampling mechanisms.

Whenever the attribute takes numerical values, the sampling behaviour of that attribute can also be described by a variety of numerical measures. One important such measure is the attribute’s sampling bias: \[ \begin{array}{rcl} \mbox{sampling bias} ~~&=& E_{\ve{p}}\left(a(\samp{S}) - a (\pop{P}) \right) \\ &&\\ &=& E_{\ve{p}}\left(a(\samp{S})\right) - a (\pop{P}) \\ &&\\ &=& \left( \sum_{j=1}^M a(\samp{S}_j) \times p_j \right) - a(\pop{P})\\ \end{array} \] for the given attribute $a(\cdot)$ where $E_{\ve{p}}(\cdots)$ indicates expectation over the probability distribution of the samples. Similarly we could define a measure of the sampling variability for the mechanism and attribute of interest (for example, the variance of the attribute values over the samples selected randomly according to their probabilities). The mean squared error of the attribute for a given population and sampling mechanism combines both the sampling bias and the sampling variability in a single summary (as was shown previously).

Note that measures such as sampling bias and variability necessarily depend on both the attribute $a(\cdot)$ and the probabilities $p_j$ of samples produced by the sampling mechanism. Knowing this, we might be able to choose an attribute and/or sampling mechanism to make this bias and/or variability as small as possible. For example, attributes that correspond to Horvitz-Thompson estimators have a sampling bias of zero; for other attributes this need not be the case. In other instances, we might wish to have some bias provided the resulting variability was much smaller so as to achieve an overall smaller mean squared error.

Probabilistic sampling allows us to quantify the relative frequency with which any sample might appear and hence the relative frequency with which any sample attribute value might be realized. Knowing the relative frequency with which sample attribute values might differ by some amount from the population attribute provides us a mechanism to quantify the uncertainty of how close a given sample attribute value might be to the unknown population attribute we hope to infer. Probabilistic sampling provides an insurance policy that what is learned about the attributes of the samples may be applied to the attributes of the population with some confidence. While there is no guarantee that this application will be without error, by careful planning the probability that the error is small can be made large.

When the sampling is not probabilistic, the insurance policy no longer applies. Of course, the nearer the sampling mechanism is to being probabilistic, the more it might be argued that the benefits of probabilistic sampling apply. Such an argument must be done with considerable care. Perhaps the most compelling case is where a deterministic mechanism is used to produce pseudo-random numbers that are in turn used to select samples. While strictly speaking these are not probabilistic, these mechanisms have been developed to be close enough that they are indistinguishable for most practical purposes.

Possible Exercise: Have students construct a simple linear congruential generator and oberve empirically how much care must be taken in choosing the parameter values. Alternatively, an examination of the randu data in R could reinforce the point.

4.1 Target and Study populations

Carefully designed probabilistic sampling can provide considerable assurance that the conclusions drawn from a sample will not likely be that different from those which would have been drawn were we able to access the entire population. In most applications however there is almost always another source of error in our inferences that is not resolved by probabilistic sampling.

The problem is that in most applications, the population which we are able to draw samples from is not the population about which we would like to draw inferences.

For example, in medical studies interest often lies in the progression of a disease or the efficacy of its treatment in humans. The set of all humans, then, is a target population. However, for ethical and other reasons, the study cannot be conducted on humans but must instead be conducted on some other animal, such as mice, which serve as a model for humans. The population of mice available from which we can select is a study population.

While probabilistic sampling from this population provides some assurance about the quality and uncertainty of our inferences about study population’s attributes, this assurance does not carry over to inferences about the corresponding attributes of the *target population. Mice, after all, might be fundamentally different from humans for these particular attributes. The quality of the inference about the attributes of the target population, $\pop{P}_{target}$, necessarily also depends upon how closely the attributes of the study population, $\pop{P}_{study}$, match those of the target population.

The entire inductive path is as shown below:

Path of Inductive inference

The difference between the attribute evaluated on the two populations is the study error: \[ \mbox{Study Error} = a(\pop{P}_{study}) - a(\pop{P}_{target}). \] Taking a sample $\samp{S}$ to draw inferences about the target population $\pop{P}_{target}$, the error for a given attribute $a(\cdot)$ is \[ \begin{array}{rcl} a(\samp{S}) - a(\pop{P}_{target}) &=& \left( a(\samp{S}) - a(\pop{P}_{study}) \right) + \left( a(\pop{P}_{study}) - a(\pop{P}_{target}) \right) \\ &&\\ &=& \left( \mbox{Sample error}\right) ~+~ \left( \mbox{Study error}\right). \end{array} \] Probabilistic sampling from the study population allows control of the sample error but not of the study error. Making the case that the study error is small remains a challenge. Unless the study population is itself a probabilistically selected sample from the target population, that case must rely on something else.

Perhaps the historically most familiar case is that where the target population includes future realizations of units which are not available at the time of study. For example, in the case of a natural phenomenon such as the number of sunspots appearing on the sun in a given earth month our study population consists of the historical record and the target population is the future. More human phenomena such as the monthly average of a financial index like the Dow-Jones Index, look very similar in that a historical record is the study population and the target population consists of months in the future. In either case, arguing that the study error must be small requires that the future should be much like the past (at least for these attributes). This is the problem of induction that notably troubled David Hume, amongst other great thinkers. Hume famously wrote in 1748:

For all inferences from experience suppose, as their foundation, that the future will resemble the past, and that similar powers will be conjoined with similar sensible qualities. If there be any suspicion that the course of nature may change, and that the past may be no rule for the future, all experience becomes useless, and can give rise to no inference or conclusion. It is impossible, therefore, that any arguments from experience can prove this resemblance of the past to the future; since all these arguments are founded on the supposition of that resemblance.

From Hume (1748) “An Enquiry Concerning Human Understanding” (Section IV, Part II) Also accessible as an ebook including here on page 37 of the archive edition

To get past this difficulty, many writers after Hume (including John Stuart Mill, Immanuel Kant, John Venn, and Bertrand Russell) have appealed to the uniformity of nature, either as observed by experience (Mill), or as an a fundamental truth (Kant), or as a postulate that is simply required to make progress (Venn, Russell).

While all others might agree on its necessity, as Hume pointed out centuries ago, there is no forceful argument that it must hold in general. Arguments based on the uniformity of nature are likely more plausible for natural phenomena such as the number of sunspots occurring in a given time period. If the phenomenon is human in nature, such as in financial markets or social networks, the argument via uniformity of nature is far less compelling. In particular, wherever there is feedback in that the phenomenon can change because it is being studied (e.g. people and investors change behaviour in light of what is observed about past behaviour) appeal to the uniformity of nature can be specious.

In practice, each case must be argued itself. For our purposes, it needs to be argued that the study error be small. Such argument is typically beyond any statistical argument; probabilisti sampling from the study population will not improve matters if the study population is very different from the target population in the attributes of interest.

That this need not always be the case is illustrated by a story from the Second World War of the twentieth century. During that war, the statistician Abraham Wald fled persecution in Hungary to immigrate to the United States where he helped U.S. military determine how to minimize bomber losses. Earlier analyses of damage to returning planes suggested that the bombers have heavy armour plating to those parts of the plane which were most damaged. Investigators would of course have access only to those planes which returned from bombing missions so that the returning planes constitute the study population. Unfortunately, these are not the planes of interest precisely because they survived whatever damage they received. In contrast, the target population are those planes which did not return; whatever damage they received caused them to not survive the mission. Given their sample of returning planes, the bullet holes and damage were marked on a diagram of the plane as shown below.

Dark areas mark damage found on the returning planes sampled

At left is the outline of the plane, at right the same outline with blackened areas indicating where damage had occurred on any plane in the sample. The diagram at right is then a sample (graphical) attribute of interest, one which provides an estimate of the same for the study population – the black area summarizes over all of the planes in the sample the damage they received. The light areas indicate those places where no damage was seen on any plane in the sample. It is likely to be a good estimate of the same attribute on the study population, that is the sample error is likely small, but possibly a very poor estimate of the same attribute on the target population (those planes which were shot down).

However, there is a clear relationship between the study population attribute and the target population attribute. Those planes which did not return likely had damage in the light areas of the right hand figure. These correspond roughly to where the crew sat and to where the fuel tank lay. Knowing the relation between the target population and the study population, we also know the nature of the study error involved and how to correct for it. The recommendation becomes clear. Armour only those areas which show no damage in the returning planes – the cockpit and fuel tank.

4.1.1 Example: Shark encounters

A few concrete examples of possible target and study populations are easily constructed from the dataset on the shark encounters.

Worldwide there were a total of 65 encounters. These might constitute the target population and the attribute of interest might be the average length of sharks involved in these encounters worldwide. We might, for a variety of reasons, have access only to those encounters which occurred in one part of the world, say those in Australian waters. In this case, the study population is the set of all encounters in Australian waters. The extent to which the average shark length in Australian encounters differs from the average shark length of all encounters worldwide is the study error.

Arguing that this error would be small would mean arguing that the great white sharks involved in encounters in Australian waters are very the same size as those involved in encounters anywhere. This might require, for example, some discussion of the nature of great whites and whether they are more or less likely to be involved in encounters depending on their size and the waters in which encounters might occur.

Alternatively, since in this case the study population is a subset of the target population, it might even be that the bulk of the encounters are in Australian waters in which case they could dominate the average anyway. Then the study error could not be too large because most of the target population would be in the study population. Unfortunately fewer than half of the encounters ($N =$ 28 ) occurred in Australian waters so this argument loses some force and we are back to arguments based on the nature of great white sharks, their habitats, and the behaviour of humans in the various waters.

Here the study error is $a(\pop{P}_{study}) - a(\pop{P}_{target}) \approx$ -4.03 inches. Great whites in Australian encounters appear to be larger on average than those in all encounters worldwide.
Suppose we are interested in the average length of sharks in great white shark encounters with humans in US waters but have access only to those encounters which were in Australian waters. Now the target population is the set of all encounters in US waters and the study population the set of all encounters in Australian waters.

There is no intersection between the two populations. Arguments that the study error is likely to be small (or large) will depend entirely upon arguing that the difference in stock, in habitat, and in human and shark behaviour is so small (or large) as to make no (some) differences. For example, the migratory pattern of great white sharks might be such that the stock was in fact identical (e.g. see migration patterns).

Here the study error is $a(\pop{P}_{study}) - a(\pop{P}_{target}) \approx$ -5.52 inches. Great whites in Australian encounters appear to be larger on average than those in US waters. Note that, as we might reasonably expect, this study error is greater than that between all encounters worldwide and only those in Australian waters.

Exercise Why is it the case that this study error need not be necessarily larger (in absolute value) than that when the target population is all encounters worldwide (as in 1)? When would it necessarily be larger in magnitude?
Suppose we are interested in the average length of great white sharks in all future encounters. The target population can not be observed at all, since it consists of all encounters yet to be realized anywhere in the world. Our entire data set of 65 is all the data we can access and so constitutes the study population.

Arguing that the study error is small here is much more difficult than either of the previous cases. Assuming a uniformity of nature postulate also seems specious, in spite of this being a natural phenomenon. It is not hard, for example, to imagine that humans through their intentional actions (e.g. killing sharks, decreasing or increasing activity in known shark waters) or otherwise (destroying shark habitat) might affect the number and/or nature of encounters, and hence change the target population. The change might be in response to what is learned about the study population or it might not.

Here the study error cannot be determined because the target population involves the future.

Exercise: Suppose the population attribute is the proportion of encounters which resulted in a fatality.

Determine the study error for this population attribute for cases 1 and 2 above. How do they compare?
Determine the study error when the target population is the set of all encounters involving females and the study population is all encounters involving males. Comment on your finding.

Exercise: Suppose the population attribute is average shark length.

Determine the study error when the target population is all encounters and the study population is only those encounters involving surfers. Comment on your finding.

4.2 Measurements

The inductive path (as shown in the figure) includes the set of measured values. It is important to not forget that this is part of the induction. Errors made in measurement can also affect conclusions drawn about attributes.

For example, in the population of shark encounters, there is the measurement of the length of the shark involved. How this measurement was taken is never described. For the sake of accuracy, one might imagine that this was a measure taken on shore with the shark involved measured while hanging vertically along a fixed scale. Of course, this would not likely be the case as this would involve capturing and likely killing the great white shark involved in each encounter. Even so, there would presumably be some uncertainty in some cases about whether even the correct shark had been identified and sacrificed. The figure below shows a histogram of the lengths of all sharks involved in encounters except that it shows $length \mod 12$.

As can be seen, the great bulk of measurements are divisible by 12. Since the measurements are in inches, this suggests that the length measurements were often given to the nearest foot. This in turn suggests that the measurements were likely taken with different accuracies.

Exercise: Using a common vertical scale, and the same 12 bins (one for each inch), construct and compare the histograms of shark lengths (mod 12) for encounters in each of a. Australian waters, b. US waters, and c. Other waters (neither Australian nor USA). Label the histograms appropriately. How do the histograms compare? What does it say about potential differences (if any) in measuring systens in the three regions? How might the measurement system differences affect the study error in cases 1 and 2 discussed above?

Exercise: In the “Where’s Waldo” dataset, the process for taking measurements was twofold. First Ben Blatt took the measurements from a collection of books himself (he describes the process here). These measurements he published online in the following figure:

Waldo’s locations as determined by Ben Blatt

The second step in the measuring process was then taken by Randal Olson. He took the locations published online by Ben Blatt (shown above), measured these himself and provided his measurements from that scatterplot as the data set we have here.

Plot the scatterplot (suitably labelled) of $(x, y)$ coordinates from the waldo dataset produced by Olson. How does this scatterplot compare to that given by Ben Blatt and what does this say about the measuring system?

4.2.1 Measuring systems

Every measuring system has at least three sources of potential error. These are

the measuring device, or gauge, used,
the person reading or recording the measurement, and
the method followed to take the measurement.

The last of these excludes qualities of the gauge and the person taking the measurement; for example, shark length measurements taken on sharks hanging vertically could differ systematically from those taken on sharks laying horizontally on the ground. Even the measurement of fatality, which should be straightforward, could vary. In the case of the shark encounters, all fatalities were recorded, whether they were immediately due to the severity of the attack or occurred later in hospital for whatever reason.

4.3 Comparing sub-populations

Oftentimes, interest lies in two or more sub-populations. For example, the encounters that occurred in Australian waters and those that occurred in US waters together constitute two sub-populations of all encounters worldwide. How different these are from one another, in terms of length of the sharks involved etc., can be of considerable interest. We might like to know whether encounters with great white sharks are essentially the same wherever they occur, in the USA or in Australia.

If the encounters are essentially the same, then the sub-populations observed should not look too different if we were to mix them up with one another. Suppose that the population is represented as a data structure pop containing two sub-populations, pop1 and pop2 say, that are accessible from pop as pop$pop1 and pop$pop2. Then we can write a function which takes this population and mixes the two sub-populations. The following function assumes additionally that each sub-population has units as rows and variates as columns.

### The population could have its sub-populations redefined
### at random via the following function
mixRandomly <- function(pop) {
  ## Pop is expected to be a list with a pop1 and a pop2 component
  ## (possibly of different numbers of rows)
  ## 
  ## Extract the first sub-population and size
  pop1 <- pop$pop1
  n_pop1 <- nrow(pop1)
  ## Same for the second sub-population
  pop2 <- pop$pop2
  n_pop2 <- nrow(pop2)
  ## Now put these together into a single structure
  ## assumed to be matrix or data frame
  mix <- rbind(pop1,pop2)
  ## Now sample without replacement to get
  ## the units to be in the first sub-population
  select4pop1 <- sample(1:(n_pop1 + n_pop2),
                     n_pop1,
                     replace = FALSE)
  ## Construct the first sub-population from the random selection
  new_pop1 <- mix[select4pop1,]  
  ## The second sub-population is all the rest.
  new_pop2 <- mix[-select4pop1,]
  ## Return the population with the sub-populations mixed
  list(pop1=new_pop1, pop2=new_pop2)
}

Note that the mixing of the two sub-populations maintains the population sizes. That is, the mixed population still has the same size pop1 and pop2 as before.

The population data structure can be a simple list. For the shark encounters consisting only of those from Australia or the USA, this can be constructed as:

### For the great white shark encounter data, we take our population
### to be just those two sub-populations of encounters from
### Australia and USA waters.  
### 
pop <- list(pop1 = sharks[sharks[,"Australia"] ==1, ],
            pop2 = sharks[sharks[,"USA"] ==1, ])
### 
### The two sub populations are now easily mixed, 
### returning the same population but divided randomly 
### between the two sub-populations as 
mixedPop <- mixRandomly(pop)

If the two sub-populations are essentially the same, then they should appear much like any shuffled pair of sub-populations.

What is needed, then, is some attribute of the combined population which measures each of the two sub-populations. A natural attribute that compares the two sub-populations is the difference between the averages of the two sub-populations for the shark lengths. Another might be the ratio of the standard deviations of the shark lengths from each sub-populations. These are easily calculated as

### The difference in the average shark lengths 
mean(pop$pop1[,"Length"]) - mean(pop$pop2[,"Length"])

## [1] 5.524436

### The ratio of the standard deviations of shark lengths 
sd(pop$pop1[,"Length"])/sd(pop$pop2[,"Length"])

## [1] 1.056418

It will be convenient to write functions that return functions which in turn calculate these attributes for any of the variates in the population, not just length.

### A function that returns the difference in averages for any
### variate in the population:
getAveDiffsFn <- function(variate) {
  function(pop) {mean(pop$pop1[, variate]) - mean(pop$pop2[,variate])}
}

### A function that returns the ratio of sds for any
### variate in the population:
getSDRatioFn <- function(variate) {
  function(pop) {sd(pop$pop1[, variate])/sd(pop$pop2[, variate])}
}

These functions can be used to create functions that will calculate the corresponding attribute for any variate. The difference in averages of the shark lengths, for example, is calculated as follows.

### Get the attribute function
diffAveLengths <- getAveDiffsFn("Length")
### which on the population reveals
diffAveLengths(pop)

## [1] 5.524436

These functions can now be put together to see how unusual the given pair of sub-populations are to any randomly shuffled pair. Ideally, we could look at all possible shufflings. This is the same as all possible permutations of the numbers 1 to $N$ where $N = N_1 + N_2$ is the sum of the two sub-population sizes. However in the case of the shark encounters, this would require about $2.6 \times 10^{59}$ shuffles. We can make do with many fewer simply by sampling. Here, we will use 5,000 shuffles (some might be repeats).

### Together, these two functions can be used to generate
### a sample of differences from random permutations of the
### two sub populations.

diffLengths <- sapply(1:5000, 
                      FUN = function(...){diffAveLengths(mixRandomly(pop))})
hist(diffLengths, breaks=20,
     main = "Randomly mixed populations", xlab="difference in averages",
     col="lightgrey")
abline(v=diffAveLengths(pop), col = "red", lwd=2)
legend("topright", legend=("Australia - USA"), lwd = c(2), col = c("red"))

As the histogram shows, at least by the measure of this attribute (the difference in sub-population average shark lengths), the separation of the population into the Australia and USA sub-populations (marked in red on the histogram) is very much like what we might observe were the two sub-populations formed at random.

Exercise: Construct the corresponding histograms for comparing the populations $\pop{P}_{Australia}$ and $\pop{P}_{USA}$ but using the variate “Surfing” rather than “Length”. Are the two populations more or less similar with respect to this variate than with respect to “Length”? Justify your conclusions.

Exercise: Construct the corresponding histograms for comparing the populations $\pop{P}_{Australia}$ and $\pop{P}_{USA}$ via the ratio of the standard deviations of the shark lengths. Show your code.

Exercise: Construct the same sorts of histograms for comparing the number of acres devoted to farms in 1992 according to the US Census of Agriculture data for the following pairs of sub-populations:

the south and northeast regions,
the north central and the northeast regions

4.3.1 Anatomy of a test of significance

The same data can be used to construct a formal measure of how unusual the difference between the average shark lengths in Australia and the same average in the USA, and least in comparison to the sub-populations created by randomly mixing the original ones. This measure is called the observed significance level and can be calculated as:

\[ SL = Pr\left( ~\abs{a(\pop{P}_1) - a(\pop{P}_2)} \ge \abs{a(\pop{P}_{Australia}) - a(\pop{P}_{USA})} \right) \] where the populations $\pop{P}_1$ and $\pop{P}_2$ are randomly drawn (with equal probability) from the set of all pairs $(\pop{P}_1, \pop{P}_2)$ where \[ \pop{P}_1 \union \pop{P}_2 = \pop{P}_{Australia} \union \pop{P}_{USA}, \] \[ \pop{P}_1 \intersect \pop{P}_2 = \varnothing, \] \[ size(\pop{P}_1) = size(\pop{P}_{Australia}), ~~ \mbox{ and } ~~ size(\pop{P}_2) = size(\pop{P}_{USA}). \] The smaller is the value, $SL$, of this probability the more unususal is the observed pair of populations $(\pop{P}_{Australia}), \pop{P}_{USA})$.

Because we do not enumerate all possible permutations, we do not have the exact value of $SL$. It can of course be well approximated by using the sample of 5,000 pairs $(\pop{P}_1, \pop{P}_2)$ that we generated according to this probability mechanism. Calculating this approximation as sum(abs(diffLengths) >= abs(diffAveLengths(pop))) / length(diffLengths) gives $SL \approx \widehat{SL} =$ 0.6976.

The interpretation is that, supposing that the pair $(\pop{P}_{Australia}, \pop{P}_{USA})$ constituted a random draw from the above set of pairs $(\pop{P}_1, \pop{P}_2)$, then the probability of seeing at least as large a difference as we observed in $(\pop{P}_{Australia}, \pop{P}_{USA})$ is approximately 0.6976. A chance of seeing $(\pop{P}_{Australia}, \pop{P}_{USA})$ being as large as that indicates no evidence against the hypothesis that the pair $(\pop{P}_{Australia}, \pop{P}_{USA})$ was randomly drawn. We have no evidence against the hypothesis that the two populations $\pop{P}_{Australia}$ and $\pop{P}_{USA}$ are indistinguishable.

This is called a test of significance and we note that it has three parts.

A hypothesis, here expressed equivalently as
- $\pop{P}_{Australia}$ and $\pop{P}_{USA}$ are drawn from the same population of shark encounters, or
- the pair of sub-populations $(\pop{P}_{Australia}, \pop{P}_{USA})$ were created by randomly assigning units in the same population to one or other of the sub-populations,
A measure of discrepancy $D = D(\pop{P}_1, \pop{P}_2)$ where large values indicate evidence against the hypothesis,
The observed discrepancy $d = D(\pop{P}_{Australia}, \pop{P}_{USA})$, and
The probability of $D \ge d$ when the hypothesis is true.

The observed significance level, $SL$, is then \[ SL = Pr \left( D \ge d \given ~\mbox{the hypothesis is true} \right). \] If $SL$ is very small then either the hypothesis is true and we have observed a very unusual value of $d$, or, the hypothesis is false. Hence, the smaller is $SL$ the greater the evidence against the hypothesis.
In the extreme case where $SL = 0$, then we have observed something impossible and the hypothesis must therefore be false – this would be a proof by contradiction.

Note that $SL$ is also called the $p$-value by many writers.

There are also a few scientifically important things to note:

the observed significance level provides a common (probabilisty) scale on which to measure the evidence against the hypothesis assumed;
the observed significance level does not therefore measure evidence in favour of the hypothesis (in science, we try to falsify hypotheses and entertain only those which remain standing);
a test of significance therefore neither accepts nor rejects a hypothesis but simply provides a measure of the evidence against;
there is no magic level for $SL$ such as 0.05 or 0.01, there being no practical or scientific difference between $SL = 0.048$ and $SL = 0.051$ for example;
if the significance level was observed to be very near 1, that is the discrepancy observed is almost certain were the hypothesis to be true, then this too could be unusual and might legitimately cast doubt on whether the observed data were fraudulent or otherwise somehow tainted (it would certainly suggest investigating that possibility);
the fact that the evidence against the hypothesis is statistically significant based on some discrepancy measure does not imply that the discrepancy is practically significant (that is, the $SL$ measures how unusual a discrepancy of that size might be when the hypothesis holds, it says nothing about whether a discrepancy of that size matters for any practical or scientific purpose);
every test of significance is based on some measure of discrepancy and different discrepancy measures can detect different departures from the hypothesis, so one needs to understand the nature of the departure from the hypothesis that the discrepancy is trying to measure.

On the last of these points, the discrepancy measure focused only on the difference in the average shark length of the to populations of encounters. Any other departure such as the ratio of the standard deviations of the shark lengths is completely ignored. So too is any difference in the population attributable to values of any other variates.

Exercise: Perform the same significance test comparing $a(\pop{P}_{Australia})$ and $a(\pop{P}_{USA})$ when

the discrepancy measure is the difference in averages as before but with the variate “Fatality” rather than “Length”
$a(\pop{P})$ is the standard deviation of the “Length” for that population and the discrepancy measure is the ratio of standard deviations for $\pop{P}_{Australia}$ (numerator) to $\pop{P}_{USA}$ (denominator).

4.3.2 A t-like discrepancy measure

Another discrepancy measure, one that is very much like the differences in the average shark lengths is

\[D(\pop{P}_1, \pop{P}_2) = \frac{a(\pop{P}_1) - a(\pop{P}_2)}{SD(a(\pop{P}_1) - a(\pop{P}_2))}. \] This discrepancy measure is “physically dimensionless” in that whatever scale the numerator is measured in (e.g. inches as in the shark lengths), the scale of the denominator will match, leaving the ratio free of any measurement scale. This naturally makes this discrepancy measure scale-invariant.

Exercise: What conditions on $a(...)$ would be required for the measure to also be location-invariant?

This discrepancy measure will lead to the same conclusions as the difference in attribute values alone whenever the denominator is known (and constant), since it would simply be constant rescaling of that difference. However, should the denominator need to be estimated, say using information from $\pop{P}_1$ and $\pop{P}_2$, then the results might be different.

Suppose that this is the case and that we are imagining that the populations $\pop{P}_1$ and $\pop{P}_2$ are independently drawn from the same larger population. Then the discrepancy measure would become

\[ \begin{array}{rcl} D(\pop{P}_1, \pop{P}_2) &=& \frac{a(\pop{P}_1) ~-~ a(\pop{P}_2)}{\bigwig{SD}\left(a(\pop{P}_1) ~-~ a(\pop{P}_2)\right)}\\ &&\\ &=& \frac{a(\pop{P}_1) ~-~ a(\pop{P}_2)}{\left(\bigwig{SD}^2(a(\pop{P}_1)) ~ + ~ \bigwig{SD}^2(a(\pop{P}_2)) \right)^{\frac{1}{2}}} \end{array} \] where $\bigwig{SD}(\cdots)$ denotes an estimator of the standard deviation of its argument.

Suppose that $\pop{P}_1$ is always of size $n_1$, $\pop{P}_2$ is always of size $n_2$, and $a(\pop{P}_i) = \widebar{Y}_i$ is the arithmetic average of the variate $Y$ for units of $\pop{P}_i$, $i = 1, 2$. In this special case, the discrepancy measure can be written as \[ \begin{array}{rcl} D(\pop{P}_1, \pop{P}_2) &=& \frac{\widebar{Y}_1 ~-~ \widebar{Y}_2}{\bigwig{\sigma} \left(\frac{1}{n_1} ~ + ~ \frac{1}{n_2} \right)^{\frac{1}{2}}} \end{array} \] where $\bigwig{\sigma}$ is an estimator of the standard deviation of the $Y$ values in the population $\pop{P}$ from which the units of each of $\pop{P}_1$ and $\pop{P}_2$ were randomly and independently drawn. If $\bigwig{\sigma}_1$ and $\bigwig{\sigma}_2$ denote the estimators of the standard deviations of the $Y$ values from each of $\pop{P}_1$ and $\pop{P}_2$ respectively, then the standard pooled estimator of $\sigma$ would be \[ \bigwig{\sigma} = \left(\frac{(n_1 - 1) \bigwig{\sigma}_1^2 + (n_2 - 1) \bigwig{\sigma}_2^2}{(n_1 - 1) + (n_2 - 1) } \right)^{\frac{1}{2}}. \] In this special case, the resulting discrepancy measure will look like the standard “two-sample” Student $t$ statistic used to test the equality of the means of two Gaussian (or “normal”) distributions with common (but unknown) standard deviation $\sigma$ based on a sample of size $n_1$ from the first and of $n_2$ from the second. If the $Y$ values were in fact Gaussian distributed, the discrepancy would follow a Student $t$ distribution on $n_1 + n_2 - 2$ degrees of freedom under the hypothesis that the means were identical.

Note, however, that we are making no such Gaussian assumption. Nevertheless, we could proceed with this discrepancy measure just as we did with the earlier measures. A function that will return this discrepancy measure for any variate var is

###
### The t statistic

getDiscrepancyFn <- function(var) {
  function(pop) {
    ## First sub-population
    pop1 <- pop$pop1
    n1 <- nrow(pop1)
    m1 <- mean(pop1[, var])
    v1 <- var(pop1[, var])
    ## Second sub-population
    pop2 <- pop$pop2
    n2 <- nrow(pop2)
    m2 <- mean(pop2[, var])
    v2 <- var(pop2[, var])
    ## Pool the variances
    v <- ((n1 - 1) * v1 + (n2 - 1) * v2)/(n1 + n2 - 2)
    ## Determine the t-statistic
    t <- (m1 - m2) / sqrt(v * ( (1/n1) + (1/n2) ) )
    ## Return the t-value
    t
  }
}
### Get the t-function for "Length"
tStatLengths <- getDiscrepancyFn("Length")

### The value for the two sub-populations is
tStatLengths(pop)

## [1] 0.3886752

Now generate 5,000 mixtures from our specific sub-populations $\pop{P}_{Australia}$ and $\pop{P}_{USA}$ and plot the histogram as before. Only this time, we will overlay the histogram with the Student $t$ density on $n_1 + n_2 -2$ degrees of freedom which we would use if the Gaussian models applied.

tVals <- sapply(1:5000, FUN = function(...){tStatLengths(mixRandomly(pop))})
xvals <- extendrange(tVals)
xvals <- seq(from = min(xvals), to = max(xvals), length.out = 200)

### We will overlay the histogram with the theoretical t-density
n1 <- nrow(pop$pop1)
n2 <- nrow(pop$pop2)
densityVals <-dt(xvals, df = (n1 +  n2 - 2))
histHeights <- hist(tVals, breaks=20, plot = FALSE)$density
heightRange <- c(0, max(densityVals, histHeights))

### Plot the histogram
hist(tVals, breaks=20, probability = TRUE, 
     ylim = heightRange,
     main = "Permuted populations", xlab="t-statistic",
     col="lightgrey")
abline(v=tStatLengths(pop), col = "red", lwd=2)
### Add the density to the plot
lines(xvals, densityVals, col = "black")
legend("topright", 
       legend=c("Observed t value", "t density"),
       lwd = c(2, 2), col = c("red", "black"))

Remarkably, the Student $t$ density closely approximates the histogram! In many instances, even when no Gaussian distribution is assumed, the Student $t$ distribution will roughly approximate the histogram that arises from randomly mixing the sub-populations. This in fact was one of the early justifications (by R.A. Fisher) for using the $t$ distribution broadly in application; namely that it approximated the randomly mixed distribution.

The significance level observed for this discrepancy measure in this example is 0.698. This is so large that the observed deiscrepancy measure is not at all unusual when the hypothesis is true. This test provides no evidence against the hypothesis.

4.3.3 Multiple testing

In testing any hypothesis, we might consider any number of discrepancy measures, $D_1, D_2, \ldots, D_K$ say, each of which has an associated observed significance level say ${SL}_1, {SL}_2, \ldots, {SL}_K$. Unlike the several discrepancy measures, these significance levels are on a common and interpretable scale, namely that of probability. Each ${SL}_k$ is the probability of observing something at least as unusual as was observed given that the hypothesis was true.

Because each discrepancy measure measures the evidence against the hypothesis in a different way, it makes sense to look at all of the significance levels to assess the evidence against the hypothesis. Since the significance levels are on a common and interpretable scale, we might consider the smallest of these as measuring the combined evidence against the hypothesis. That is, consider \[ {SL}_{min} = \min_{k = 1, \ldots, K} SL_k . \] The smaller is ${SL}_{min}$ the greater is the evidence against the hypothesis.

This is perfectly legitimate, provided one does not now interpret ${SL}_{min}$ as a significance level; it is not. Still, ${SL}_{min}$ is a measure of the evidence against the hypothesis; it is just no longer the probability of observing something as unusual as we did observe assuming that the hypotheis is true. Rather, it is more like a test statistic than a significance level. That is, if we define \[ D^\star = 1 - {SL}_{min}\] then $D^{\star}$ is itself a discrepancy measure (arranged so that large values again indicate evidence against the hypothesis). If its observed value is $d^{\star}$, then the significance that describes this combined evidence is \[ SL = Pr\left( D^{\star} \ge d^{\star} ~\leftgiven~ \mbox{Hypothesis is true} \right. \right) = {SL}^{\star} , ~\mbox{say}, \] which will necessarily be larger than ${SL}_{min}$ ( i.e. ${SL}_{min}$ necessarily exaggerates the evidence against the hypothesis and so is misleading as a significance level).

The hypothesis being tested by the various discrepancy measures here is that the two sub-populations could each have been a random selection, without replacement, from the large population made up of their union. As was done for any of the individual discrepancy measures, the significance level ${SL}^{\star}$ can be determined (and approximated) by a randomly mixing the two sub-populations and making the appropriate calculation.

Some simplification in the notation will help make the logic clear. First, denote the $n_1$ units of $\pop{P}_1$ by an $n_1 \times 1$ vector $\ve{u}$ whose elements are ordered from smallest to largest ( $\ve{u}$ describes a set so using a unique order simply allows the vector to be uniquely identified with that set). Similarly, denote by $\ve{u}$ the ordered $n_2$ elements of $\pop{P}_2$. The $k^{th}$ discrepancy measure is now a function of $\ve{u}$ and $\ve{v}$, or equivalently and more simply as a function of $\ve{u}$ alone (since $\pop{P}_1 \union \pop{P}_2$ always stays the same allowing $\ve{v}$ to always be determinable from $\ve{u}$). Denote this function as $D_k(\ve{u})$ for $k=1, \ldots, K$. The corresponding signifcance levels are also functions of $\ve{u}$ and denoted as ${SL}_k(\ve{u})$ for $k=1, \ldots, K$. Finally, the functions $D^{\star}(\ve{u})$ and ${SL}^{\star}(\ve{u})$ are analogously as \[D^{\star}(\ve{u}) = 1 - \min_{k=1, \ldots, K} {SL}_k(\ve{u}) \] and \[ {SL}^{\star}(\ve{u}) = Pr\left( D^{\star}(\ve{U}) \ge D^{\star}(\ve{u}) ~\leftgiven~ \begin{array}{l} \ve{U}~ \mbox{ is a random sample of size } n_1 \mbox{ selected} \\ \mbox{ without replacement from } ~ \pop{P}_1 \union \pop{P}_2 \end{array} \right. \right) . \] This significance level will need to be determined exactly as with any individual discrepancy measure, from all possible samples (without replacement) of $n_1$ units from $\left\{ 1, 2, \ldots, n_1, (n_1 + 1), \ldots , (n_1 + n_2) \right\}$. Supposing there to be $B$ of these, denote the samples as $\ve{u}_1, \ldots, \ve{u}_B$. (Exercise: What is the value of $B$?) Then the significance level $SL^{\star}(\ve{u})$ is \[ \begin{array}{rcl} SL^{\star}(\ve{u}) &=& \frac{1}{B} \#\{ \ve{u}_b \suchthat D^{\star}(\ve{u}_b) \ge D^{\star}(\ve{u}) \mbox{ for } b= 1, \ldots, B \} \\ &&\\ &=& \frac{1}{B} \#\{ \ve{u}_b \suchthat 1 - {SL}_{min}^{\star}(\ve{u}_b) \ge 1 - {SL}_{min}(\ve{u}) \mbox{ for } b= 1, \ldots, B \} \\ &&\\ &=& \frac{1}{B} \#\{ \ve{u}_b \suchthat {SL}_{min}^{\star}(\ve{u}_b) \le {SL}_{min}(\ve{u}) \mbox{ for } b= 1, \ldots, B \} \\ \end{array} \] which suggests how the correct significant level can be calculated:

Determine the significance level ${SL}_k(\ve{u})$ for each discrepancy measure $D_k(\ve{u})$ for the observed $\ve{u}$ from $\pop{P}_1$ as before. This will involve a loop over all $\ve{u}_i$ for $i=1, \ldots, B$. Find ${SL}_{min}(\ve{u})$.
For $b=1, \ldots, B$, determine ${SL}_{min}(\ve{u}_b)$ for each $\ve{u}_b$, exactly as was done for the observed $\ve{u}$ in step 1.
Determine the significance level \[ SL^{\star}(\ve{u}) = \frac{1}{B} \#\{ \ve{u}_b \suchthat {SL}_{min}^{\star}(\ve{u}_b) \le {SL}_{min}(\ve{u}) \mbox{ for } b= 1, \ldots, B \}.\]

As with the individual measures, it will rarely be practicable to have $B$ be all possible mixtures of the two sub-populations so that a random sample will have to suffice. Indeed, given the double looping that is now occurring, it will be even more impractical to look at all possible mixtures. Instead, $B$ random samples will be used for a large but computationally tractable value of $B$ such as 1,000 to provide an estimate $\widehat{SL}^{\star}(\ve{u})$ of the significance level.

This can be implemented as follows

calculateSLmulti <- function(pop, discrepancies, B_outer = 1000, B_inner){
  if (missing(B_inner)) B_inner <- B_outer
  ## Local function to calculate the significance levels
  ## over the discrepancies and return their minimum
  ## 
  getSLmin <- function(basePop, discrepanies, B) {
    
    ## Determine observed vals assuming this basePop
    observedVals <- sapply(discrepancies, 
                           FUN = function(discrepancy) {discrepancy(basePop)})
    
    ## Number of discrepancy functions
    K <- length(discrepancies)
    
    ## calculate the results over the set of sub-populations
    ## shuffled from the basePop
    total <- Reduce(function(counts, i){
      NewPop <- mixRandomly(basePop)
      ## calculate the discrepancy values (K of them)
      ## and update counts
      Map(function(k) {
        ## Note how the functions are simply called
        Dk <- discrepancies[[k]](NewPop)
        ## Note the <<- used to ensure that counts
        ## is updated in the lexical closure and not
        ## just as a local variable in this function
        if (Dk >= observedVals[k]) counts[k] <<- counts[k] +1 },
        1:K) # end discrepancies
      counts
    }, 
    1:B, init = numeric(length=K)) # end set of shuffled sub-populations
    ## calculate significance levels
    SLs <- total/B
    ## return the minimum
    min(SLs)
  }
  
  ## First step, get results for given population
  SLmin <- getSLmin(pop, discrepancies, B_inner)
  ## Second and third steps
  total <- Reduce(function(count, b){
    basePop <- mixRandomly(pop)
    if (getSLmin(basePop, discrepancies, B_inner) <= SLmin) count + 1 else count
  },
  1:B_outer, init = 0)
  ## Step 3
  SLstar <- total/B_outer
  SLstar
}

This function can now be given a list of discrepancy measures and applied to the sub-populations to get an estimate of the true significance level from taking the minimum of several significance levels. Note that the discrepancy measures are expected to be constructed so that large values indicate greater evidence against the hypothesis (i.e. greater discrepancy between the observed data and the hypothesis).

### Some discrepancy functions
### These must now be such that larger values
### correspond to greater evidence against the hyothesis

getAbsAveDiffsFn <- function(variate) {
  function(pop) {abs(mean(pop$pop1[, variate]) - mean(pop$pop2[,variate]))}
}

discrepancies <- list(getAbsAveDiffsFn("Length"), getSDRatioFn("Length"))

### The following takes a long time (about 20 minutes)
### for B_outer = B_inner = 1,000 say
### So for illustration much smaller values than would be sensible are
### used here
calculateSLmulti(pop, discrepancies, B_outer = 100, B_inner=100)

## [1] 0.68

Exercise: Since whether a value exceeds a cut off or not is a Bernoulli variate, the coefficient of variation of the estimated significance level could be determined as a function of the number of replicates $B$. Explain how this might be used to choose $B$.

4.3.4 An important variation on comparisons

Consider again the population of US counties from the agricultural census. Interest might lie in comparing regions, say northeast versus north central. Each region is a sub-population and we might, for example, consider comparing the two regions on acres92, the number of acres devoted to farms in 1992. This is the same type of comparison as was made with Length and the great white shark encounters in Australia and the US.

Alternatively, interest might lie in comparing acreage to farms between years. Consider, for example, only those counties that are in the northeast (NE) and suppose interest lies in how the number of acres devoted to farms compares between 1982 qnd 1992. While the counties now constitute a single sub-population there still seems to be two sub-populations in play, namely the first being the counties in 1982 and the second the counties in 1992. And indeed there are, provided we take the year to be part of the definition of the two populations (and hence part of the definition of the units in the population).

These two sub-populations might therefore be mixed as various significance tests of their differences be performed just as before. Except there is something clearly wrong with this. For example, having the same county from two different years appear in the same sub-population after mixing does not seem to take advantage of the fact that it is indeed the same county. It would be better to preserve the all counties in each sub-population and instead randomly swap the variate values of a county in 1982 and with those of the same county in 1992. That way, we will have matched the counties appear in the same sub-population seems to Having, for example, the acreage devoted to farming appear for a county in 1982 and for the same county in 1992 appear in the same sub-population seems to

Same population two different variates. Example would be acreage of farms in two different years. The randomization would require pairing.

Exercise: Implement this kind of significance test for any population attribute.

4.4 Random intervals and coverage probabilities

Recall that when we look at $a(\samp{S})$ for all possible samples $\samp{S}$ of some size $n$ from a population $\pop{P}$ that the values of $a(\samp{S})$ have some distribution. For example, the average shark length of samples from the sharks data produced a histogram which was fairly symmetric and centred about the population average $a(\pop{P})$.

hist(avesSamp, col=adjustcolor("grey", alpha = 0.5), freq = FALSE,
     main="Gaussian over histogram of averages (n = 5)",
     xlab="Average shark length (inches)",
     ylim=c(0, 0.022),
     breaks=25
     )
### Mark the population attribute in red
abline(v=avePop, col="red", lty=3, lwd=2)

### Add a Gaussian density
tmpAve <- mean(avesSamp)
tmpSD <- sd(avesSamp)
tmpX <- extendrange(avesSamp)
tmpX <- seq(tmpX[1], tmpX[2], length.out = 200)
lines(tmpX, dnorm(tmpX, mean = tmpAve, sd = tmpSD))

All possible samples: great white encounters in Australia

This can also be the case for many other attributes, but is especially so for those attibutes expressible as (weighted) averages. The latter should be no surprise if one recalls the ``Central Limit Theorem’’ and its consequence that the distribution of sample averages tend to become more like a Gaussian distribution as $n$ increases.

In the above Figure, the Gaussian density is overlaid a histogram of average shark lengths for all 98280 samples of size $n = 5$ from 28 possible shark lengths. To get the right Gaussian, we need the average and the standard deviation of all of the averages. These are mean(avesSamp) = 155.8928571, and sd(avesSamp) = 19.7985268 respectively. The Gaussian density provides a model for the histogram of averages and so might be used in place of it to simplify calculations.

For example, if the Gaussian model holds then we can think of the average shark length for a sample of size $n = 5$ as a random variate $\widebar{Y} \sim G(\mu, \sigma/\sqrt(n))$ where (in this case) $n = 5$, $\mu =$ 155.8928571, and $SD(\widebar{Y}) = \sigma/\sqrt{n}$. This means we can find a constant $c >0$ for any $p \in (0,1)$ such that \[ \begin{array}{rcl} 1 - p &=& Pr \left( -c \le \frac{\widebar{Y} - \mu}{\sigma /\sqrt{n}} \le c \right) \\ &&\\ &=& Pr \left( \left[\widebar{Y} - c \times \frac{\sigma}{\sqrt{n}}, \widebar{Y} + c \times \frac{\sigma}{\sqrt{n}} \right] \ni \mu \right). \end{array} \] That is, the random interval $\left[\widebar{Y} - c \times \frac{\sigma}{\sqrt{n}}, \widebar{Y} + c \times \frac{\sigma}{\sqrt{n}} \right]$ contains $\mu$ with probability $1 - p$. Intervals generated according to this random interval generating mechnism will contain $\mu$, $100(1-p)\%$ of the time. The probablity $1-p$ is therefore called the coverage probability of these random intervals. Notice that these particular intervals all have the same width, just different (and random) centres.

Note that the Gaussian is symmetric about its mean so that $p$ and $c$ are related through \[(1 - p) + p/2 = 1 - p/2 = Pr(Z \le c) \] where $Z \sim G(0, 1)$ is a standard Gaussian random variate. This allows us to determine $p$ (and hence $(1-p)$) for any $c$. In R pnorm(c) will give $Pr(Z \le c)$ for any constant c. Similarly, for any $p$ the value of $c$ can be had from the quantile function of a standard Gaussian random variate in that \[ c = Q_Z\left(1 - \frac{p}{2}\right)\] which in R is calculated as qnorm(1 - p/2). For example $c \approx 1.96$ when $1 - p = 0.95$ for a standard Gaussian random variate.

In practice, we will have only one sample, and hence the single numerical average $\widebar{y}$ for that sample. Which means that we will have only one of these randomly generated intervals, namely \[\left[\widebar{y} - c \times \frac{\sigma}{\sqrt{n}}, \widebar{y} + c \times \frac{\sigma}{\sqrt{n}} \right].\] Because we know, according to the Gaussian model, that $100(1-p)\%$ of the intervals generated in this way will contain $\mu$ we have some confidence that this particular interval will as well. The larger is $1 - p$, the more confident we are that the interval will contain $\mu$. Note that the probability statement is attached to the method used to generate the intervals and not to the particular interval in hand. The one in hand is therefore called a $100(1-p)\%$ confidence interval and not a probability interval.

Knowing that $\mu =$ 155.8928571 and sd(avesSamp) = 19.7985268 we could generate intervals by randomly selecting 100 samples of size $n$, calculate the corresponding $\widebar{y}$ and form the interval for any level of confidence measured by the coverage probability $(1 - p)$. Approximately $100(1-p)\%$ of the intervals should contain (or cover) $\mu =$ 155.8928571.

The Figure below shows 100 such intervals when $1 - p = 0.95$.

100 confidence intervals

Fully 93 of these 100 intervals cover the value $\mu$. Those which do not cover $\mu$ (the red dashed line in the centre) are marked with a star and the $\widebar{y}$ value and the sample to which it corresponds marked by a red diamond and solid vertical line, respectively. Those samples which produce confidence intervals that do not contain $\mu$ are taken from either tail of the distribution of possible $\widebar{y}$ values and have low probability of being selected.

Note: Any particular interval (or set of 100 intervals) generated in this way may or may not contain the value $\mu$. We only know that $100(1-p)\%$ of the intervals constructed in this way will contain the value $\mu$. There is no probability attached with any particular interval, just a confidence level.

Exercise: What confidence interval would give us $100\%$ confidence? What would give us $0\%$?

Exercise: Give a mathematical expression for the width of the confidence interval. Plot this as a function of $(1-p)$. Comment on your observations.

Note: Suppose that for some scientific purpose, the value $\mu = a$ is of interest and we would like to assess the plausibility of this particular value – that is, we would like to test the hypothesis $H: \mu = a$. Confidence intervals can be used to construct an appropriate test. The reasoning is as follows. We know that only a small proportion, $p$, of the $100(1-p)\%$ random intervals will not contain the true value of $\mu$. If our $100(1-p)\%$ confidence interval does not contain $a$ then we have reason to suspect that the hypothesis does not hold. Somehow $(1-p)$ or equivalently $p$, measures this suspicion. Whatever the value of $a$, there will be some confidence level $(1-p)$ such that $a$ is an end point of the confidence interval. The value of $p$ defining this confidence interval is called, unimaginatively, the p-value of this test. The smaller is $p$, the larger $(1-p)$ had to be for the $100(1-p)\%$ confidence interval to contain the hypothesized value $\mu = a$. The smaller $p$ is, the greater is the evidence against the hypothesis $H: \mu = a$. The p-value is an observed level of significance $SL$ and this is a test of significance. (To reinforce that this is a significance test, these notes will only use $SL$ for the observed significance and will not use the term “p-value” for this concept; “p-value” is common usage by others.)

Exercise: Write down a formal discrepancy measure that corresponds to using confidence intervals in this way to test the hypothesis $H: \mu = a$. Derive the formula for the observed significance level $SL$ and show that its value is $p$ for some $100(1-p)\%$ confidence interval for $\mu$.

4.4.1 Student $t$ based intervals

The confidence intervals just calculated presupposed that we actually knew the value of $SD(\widebar{Y})$. This will not be so in general. However, for many sample attributes $a(\samp{S})$ (e.g. Horvitz-Thompson estimators), we have an estimate of the standard deviation, $SD(\bigwig{a}(\samp{S}))$, of its corresponding estimator $\bigwig{a}(\samp{S})$. Using this estimated $SD$ in place of the known $SD$ will introduce greater variability into any random interval.

In particular, suppose we have a sample attribute $a(\samp{S})$ and its corresponding estimat${\bf or}$ $\bigwig{a}(\samp{S})$. Similarly, let $\bigwig{SD}(\bigwig{a}(\samp{S}))$ be the estimator corresponding the the estimated standard deviation $\widehat{SD}(\bigwig{a}(\samp{S}))$. Then we might construct intervals in much the same way as before, choosing $p \in (0,1)$ and a corresponding $c > 0$ with \[ \begin{array}{rcl} p &=& Pr \left( -c \le \frac{\bigwig{a}(\samp{S}) - a(\pop{P})}{\bigwig{SD}(\bigwig{a}(\samp{S}))} \le c \right) \\ &&\\ &=& Pr \left( \left[\bigwig{a}(\samp{S}) - c \times \bigwig{SD}(\bigwig{a}(\samp{S})), \bigwig{a}(\samp{S}) + c \times \bigwig{SD}(\bigwig{a}(\samp{S})) \right] \ni \mu \right). \end{array} \] Note that this random interval now has a random centre and random length.

Perhaps the simplest set of intervals are again for the sample average $a(\samp{S}) = \widebar{y}$ and its corresponding estimator $\widetilde{a}(\samp{S}) = \widebar{Y}$. If $\sigma$ denotes the standard deviation of the $y_u$s for $u \in \pop{P}$, then $SD(\widebar{Y}) = \sigma/n$. An unbiased estimate of $\sigma^2$ is $\widehat{\sigma}^2 = \sum_{u \in \samp{S}} (y_u - \widebar{y})^2/(n-1)$; the corresponding estimator of $\sigma$ is \[ \bigwig{\sigma} = \sqrt{\frac{\sum_{u \in \samp{S}}(Y_u - \widebar{Y})^2}{n - 1}}.\] The random intervals constructed from these estimators resemble those constructed from the Gaussian model: \[ \begin{array}{rcl} p &=& Pr \left( \left[\widebar{Y} - c \times \frac{\bigwig{\sigma}}{\sqrt{n}}, \widebar{Y} + c \times \frac{\bigwig{\sigma}}{\sqrt{n}} \right] \ni \mu \right). \end{array} \] except that $\sigma$ is replaced by $\bigwig{\sigma}$, thus making the ends (and hence width) of the intervals random as well as their centres. Of course the value of $c$ for any $p$ will now also be different from that determined before since the quantity \[ \frac{\widebar{Y} - \mu}{\bigwig{\sigma}/n} \] is no longer a standard Gaussian random variate. Nevertheless, under the Gaussian model, this ratio has a known distribution, namely the Student~$t_{n-1}$ distribution. This means that this ration is still a pivotal statistic in that it is a function of the unknown parameter $\mu$ and the sample values $Y_u$ (for $u \in \samp{S}$) whose sampling distribution is completely known.

Now the values of $p$ and $c$ are determine using the $t$ distribution on $n$ degrees of freedom. For any $c$, $p$ is found in R as $pt(c, df = n-1)$. And for any $p$ the value of $c$ can be had from the quantile function of a Student $t$ random variate in R as qt((p+1)/2, df = n-1). For example $c \approx 2.78$ when $p = 0.95$ for a $t_4$ random variate.

We can produce confidence intervals as \[\left[\widebar{y} - c \times \frac{\widehat{\sigma}}{\sqrt{n}}, \widebar{y} + c \times \frac{\widehat{\sigma}}{\sqrt{n}} \right].\] Note that this interval is a function of variate values on a single sample $\samp{S}$ and so is itself an attribute of the sample, just an interval valued one. As with any other sample attribute, some sense of the sampling distribution of its values can be had by generating a large number of samples and evaluating the attribute on each one. Now, rather than a histogram, we are interested in how many such intervals contain the population value $\mu$ – about $100(1-p)\%$ of the intervals should.

The Figure below shows 100 such intervals when $p = 0.95$ and $n=5$ as before.

100 confidence intervals

93 of these 100 intervals cover the value $\mu$. Again, those which do not cover $\mu$ (the red dashed line in the centre) are marked with a star and the $\widebar{y}$ value and the sample to which it corresponds marked by a red diamond and solid vertical line, respectively. These intervals are typically wider than those of the Gaussian but are of random length due to the different estimates of the $SD$ from each sample.

Exercise: plot the pairs $(\widebar{y}_i, \widebar{\sigma}_i)$ (the data are available as avesSamp and sdsSampabove). Comment on what you observe about the relationship between these two estimates over all possible samples.

Exercise: This exercise continues the one above by investigating the relationship between sample averages and sample variances (or standard deviations, their square root). As above, the study population will be $\pop{P}_{Australia}$ (great white shark encounters that occurred in Australian waters) whose values $u$ are stored as the variable popSharksAustralia .

From these, randomly generate (without replacement) a single sample, $\samp{S}_{common}$ of size 4 ($= n - 1$). This set will be common to all samples considered in this exercise. Let $\samp{S}_{rest} = \pop{P}_{Australia} - \samp{S}_{common}$ denote the set difference containing the rest of the population. For every $u \in \samp{S}_{rest}$, calculate the average (mean(...)) and standard deviation (sd(...)) of the shark length over the $\samp{S} = \samp{S}_{common} \union \left\{u\right\}$. Plot the results as plot(aves, sds, xlim=range(avesSamp), ylim=range(sdsSamp)) Repeat this for several different choices of $\samp{S}_{common}$.
Now, take $\samp{S}_{common}$ to consist of any single unit $u \in \pop{P}$ (in general) and denote the value $y_u$ by $y$ for $u \in \samp{S}_{common}$ (just to simplify notation). Let $t = \sum_{u \in \samp{S}_{rest}} y_u$ and $q = \sum_{u \in \samp{S}_{rest}} y_u^2$ denote the total and total squared values of $y_u$ for $u \in \samp{S}_{rest}$ respectively. For sample size $n$, derive a mathematical formula for the estimated standard deviation \[ \widehat{\sigma} = \sqrt{\sum_{u \in \samp{S}} \frac{(y_u - \widebar{y})^2}{n-1}} \] as a function $f(t,q)$ for fixed $y$ and $n$. Describe $f(t,q)$ as a function of $t$ for a fixed $q$ (N.B. $[f(t,q)]^2$ provides the same information).
Use your findings to describe the structure seen in the plot of the pairs $(\widebar{y}_i, \widebar{\sigma}_i)$ produced in the previous exercise.

Intervals like these can be constructed for any ratio \[ \frac{\bigwig{a}(\samp{S}) - a(\pop{P})}{\bigwig{SD}(\bigwig{a}(\samp{S}))} \] provided it is a pivotal function or even approximately pivotal. Key to this of course is that the sampling distribution is known or approximately known. (Note: the word “pivotal” is meant to be suggestive that with this measure it is possible to “pivot” from information about the sample attribute to inference about the population attribute.) When conducting significance tests on comparing sub-population, we observed that $t$-like discrepancy measures sometimes appeared to be approximately distributed as a Student $t$ random variate. This also sometimes holds for the above ratios as well, especially when the $a(\samp{S})$ is an average.

4.5 Resampling

As previous sections have shown, understanding the sampling behaviour of sample attributes is essential to making inferences about any population attribute. This is true in the case of a discrepancy measure whose sampling distribution allows us to test hypotheses and is true about a pivotal quantity whose sampling distribution allows us to construct confidence intervals.

So far, however, we have been able to undertake repeated sampling from the population ourselves, or mathematically through distributional model that approximates that sampling distribution. In practice, however, the population from which our sample was taken cannot be repeatedly sampled for our purposes, we have only one sample. Or, there may be a technical or scientific concern about the mathematical model for the sampling distribution which could render suspect any inferences which relied on the correctness of the model.

One possibility is to mimic the sampling procedure and hence inferences by drawing samples from the sample $\samp{S}$ in hand as if it were the population $\pop{P}$ from which it was drawn.

Recall that a $\samp{S}$ of size $n$ has been drawn from a study population $\pop{P}$ according to some sampling mechanism and the sample attribute $a(\samp{S})$ used, for example, to estimate its population counterpart $a(\pop{P})$. To understand the sampling distribution of any attribute $a(\samp{S})$, we draw $m$ samples $\samp{S}_1, \ldots, \samp{S}_m$ independently using the same mechanism from $\pop{P}$ and use the values $a(\samp{S}_1), \ldots, a(\samp{S}_m)$ to inform us about the sampling distribution of $a(\samp{S})$.

We propose to mimic this process by drawing a $B$ samples $\samp{S}^{\star}_1, \ldots, \samp{S}^{\star}_B$ of size $n$ independently from a population $\pop{P}^\star$. (Note, typically $B=m$ but we use $B$ here to distinguish that the number of samples here could be different.) Ideally, $\pop{P}^\star$ will be the study population $\pop{P}$, and the sampling mechanism to select each $\samp{S}_i^{\star}$ will be identical to that used to select $\samp{S}$ from $\pop{P}$.

Since the only data available is that associated with the sample $\samp{S}$, we take $\pop{P}^{\star} = \samp{S}$. Now this population has only $n$ units, so samples of size $n$ using any without replacement sampling mechanism will immediately exhaust the population. To get around this, simply replicate all units in $\pop{P}^{\star}$ many times over so that this cannot happen. If each unit is replicated $k$ times, say, then the resulting size of $\pop{P}^\star$ would be $N^\star = k \times n$. (Note replication invariant attributes preserve their value $a(\pop{P}^{\star})$ on this replicated population.) Alternatively, we could restrict the mechanism to sampling with replacement only (effectively equivalent to having $k \rightarrow \infty$). Either way, there would have to be repeated units $u$ in any sample $\samp{S}_i^{\star}$ of size $n$ drawn from $\pop{P}^{\star}$. Independently drawing $m$ samples $\samp{S}^{\star}_1, \ldots, \samp{S}^{\star}_m$ of size $n$ from $\pop{P}^\star$, the sample attribute values $a(\samp{S}_1^\star), \ldots, a(\samp{S}_B^\star)$ now provide the information on the sampling distribution of interest.

This approach to mimicking the sampling distribution was named the bootstrap method when it was first proposed in 1977 by Brad Efron (B. Efron 1979). The word “bootstrap” conveys the notion of starting something up from nothing as in “pulling oneself over a fence by one’s bootstraps”. The image is of lifting yourself off the groud by grabbing your boots (via attached straps) and lifting them (and yourself) into the air. It suggests something for nothing, or something impossible to achieve, and is also the source of the more familiar modern phrase" booting a computer" (and “rebooting” a crashed one) in which the very complex computer system starts from a relatively simple initial program as the first in a chain of increasingly complex programs, each creating the next complex link in the chain until the operating system is in place and running.

The boostrap method effects a similar chaining of the inductive path. as shown in the Figure below. Bootstrap path of Inductive inference
The usual inductive path $\samp{S} \rightarrow \pop{P} \rightarrow \pop{P}_{Target}$ remains with all of its attendant sources of potential error. To this the bootstrap method adds the link $\samp{S}^\star \rightarrow \pop{P}^\star \rightarrow \pop{P}$ on the righthand side of the Figure. The bootstrap addition $\pop{P}^\star$ is not connected by a dashed arrow as is the study population $\pop{P}$ to the target population $\pop{P}_{Target}$ but rather by a solid arrow. This is because $\pop{P}^\star$ is $\samp{S}$ and the $\samp{S}$ is a sample selected from $\pop{P}$ according to some random sampling mechanism. We know, for example, that as $n \rightarrow N$, that $\pop{P}^\star \rightarrow \pop{P}$ and so any error introduced by the bootstrap method will decrease as $n \rightarrow N$.

4.5.1 Bootstrap distributions

For a concrete example suppose we consider again the average shark lengths for encounters in Australian waters where the sample size is $n=5$. One would not expect the bootstrap method to work so well when $n$ is so small. Recall however, that the population has only $N=28$ units so the sample size is a little under $20\%$ of the population.

There will only be one sample $\samp{S}$ drawn from $\pop{P}_{Australia}$ using simple random sampling without replacement. The $B$ bootstrap samples will be drawn from this one sample.

### First a helper function to calculate the size of the population
popSize <- function(pop) {
  if (is.vector(pop))
  {if (is.logical(pop)) 
    ## then assume TRUE values identify units
    sum(pop) else length(pop)}
  else nrow(pop)
}
### The function getSample will return a sample
### from the pop
getSample <- function(pop, size, replace=FALSE) {
  N <- popSize(pop)
  pop[sample(1:N, size, replace = replace)]
}

### Get the sample S
set.seed(1435345)
n <- 5
S <- getSample(popSharksAustralia, n, replace = FALSE)

### The sample unit labels are in S
S

## [1] "6"  "34" "20" "58" "59"

This sample will be the bootstrap population $\pop{P}^\star = \samp{S}$. There are only $n^n =$ 3125 possible (with replacement) bootstrap samples of size $n = 5$ to select. Suppose we choose $B=1000$ bootstrap samples $S_1^\star, \ldots, S_{1000}^\star$:

Pstar <- S
B <- 1000
Sstar <- sapply(1:B, FUN =function(b)  getSample(Pstar, n, replace = TRUE))
### Every bootstrap sample will contain repeated units from Pstar
### For example, the first boostrap sample, Sstar_1, is
Sstar[,1]

## [1] "59" "34" "6"  "58" "59"

The samples $\samp{S}_1^\star, \ldots, \samp{S}_B^\star$ appear as columns of Sstar. We can now use these bootstrap samples and compute whichever attribute might be of interest in each.

For example, we might compute the averages of the shark length in each sample.

avesBootSamp <- sapply(1:B, FUN = function(i) mean(sharks[Sstar[,i], "Length"]))

The distribution of the attribute over these bootstrap samples $\samp{S}^\star_i$ from $\pop{P}^\star$ is a bootstrap estimate of the distribution of the same attribute over all possible samples $\samp{S}_i$ from $\pop{P}$. In this example, we know both. The histograms of these two are shown below:

Bootstrap versus all sample distibutions: averages

As can been seen, the bootstrap distribution gives a sense of how variable the attribute value can be. For example, to get some idea of the variability we can take the standard deviation of the bootstrap distribution. An estimate of the standard deviation of the sample attribute can be had directly from the bootstrap distribution. For any attribute $a(\pop{P})$, the standard deviation of the corresponding sample attribute can be estimated from the bootstrap distributions as \[\widehat{SD}_{\star}\left(\bigwig{a}(\samp{S}^{\star})\right) = \sqrt{\frac{\sum_{b=1}^B \left( a(\samp{S}_b^{\star}) - \widebar{a}^{\star} \right)^2 } {B-1} }\] where $\widebar{a}^{\star} = \sum_{b=1}^B a(\samp{S}_b^{\star}) /B$ is the average of the attribute on the bootstrap samples $\samp{S}_1^\star, \ldots, \samp{S}_B^\star$. This is the bootstrap estimate of the standard deviation of the sample attribute $a(\samp{S})$.

For the present example, the attribute is the arithmetic average of the shark lengths and the bootstrap estimate of the its standard deviation is \[\widehat{SD}_{\star}(\widebar{Y})= \sqrt{\frac{\sum_{b=1}^B \left( \widebar{y}_b{\star} - \widebar{y}^{\star} \right)^2 } {B-1} }\] where $\widebar{y}^{\star} = \sum_{b=1}^B \widebar{y}_b{\star} /B$. Evaluating this on the bootstrap sample yields

sd(avesBootSamp)

## [1] 11.08546

In the case of the arithmetic average $a(\samp{S}) = \sum_{u\in \samp{S}} y_u / n$, the standard deviation can also be estimated as
\[\widehat{SD}(\widebar{Y}) = \frac{\widehat{\sigma}}{\sqrt{n}}\] where \[\widehat{\sigma} = \sqrt{\frac{\sum_{u \in \samp{S}}\left(y_u - \widebar{y}\right)^2}{n-1}}. \] Evaluating this on the sample $\samp{S}$ ($= \pop{P}^{\star}$) gives

sd(sharks[S, "Length"])/sqrt(5)

## [1] 12.47638

The two values are not that different from one another. The huge advantage that the bootstrap estimate has is that it applies to any sample attribute not just the arithmetic average.

Comparing estimators of the standard deviation of the average

The two histograms are not too dissimilar in shape but note that the estimates of $SD(\widebar{Y})$ based on bootstrap samples seem to be slightly lower in value than those of the standard estimate $\widehat{\sigma}/\sqrt{n}$. This can be seen more clearly from the histogram of the differences between these two estimates over different samples.

Paired differences in SD(Ybar) esimates. The bootstrap estimator of standard deviation has produced smaller values than the standard.

Exercise: Reproduce these plots for all shark encounters. That is, generate 200 samples $\samp{S}$ and from each form both the standard estimate of the standard deviation of $\widebar{Y}$ and a bootstrap estimate (with $B = 1000$) of the standard deviation. Plot the histogram of the 200 values for each estimator (use the same value of xlim in both histograms). On each histogram add a vertical line marking $\sigma/\sqrt{n}$ where $\sigma$ is the value standard deviation of the shark lengths for all encounters in Australian waters. Comment on any similarities or differences between the two histograms.

4.5.2 Bootstrap confidence intervals

The bootstrap distribution provides a proxy for the distribution of any sample attribute $a(\samp{S})$. As such, it might be used to construct (at least approximate) confidence intervals for the unknown population attribute $\pop{P}$.

Noting that confidence intervals for sample averages, for example, have the following structure \[ [ \widebar{y} - c \widehat{SD}(\widebar{Y}), ~~ \widebar{y} + c \widehat{SD}(\widebar{Y}) ] \] we might just construct intervals like this with the constants $c$ chosen from a Gaussian distribution. Rather than use the usual $\widehat{\sigma}/\sqrt{n}$ for $\widehat{SD}(\widebar{Y})$, we might use standard deviation of the bootstrap distribution of $\widebar{Y}$ to produce the standard deviation estimate $\widehat{SD}(\widebar{Y})$. The attraction of this approach, if it works, is that the same approach could be used for any attribute $a(\samp{S})$. This is also one of the earliest proposals for constructing a bootstrap confidence interval for the population average.

The reasoning is that if $\widebar{Y}$ is (approximately) Gaussian, then the above interval would, for example, correspond to a $95\%$ (approximate) confidence interval for $c = 1.96$. We have some assurance from the Central Limit Theorem that this will be approximately correct for $\widebar{Y}$ provided $n$ is large enough. To assess the quality of these intervals, their coverage probability can be determined whenever $a(\pop{P})$ is known.

For example, the average shark length of great whites involved in Australian water encounters was 155.8928571 inches. To get an estimate of the coverage probability in this case, randomly and independently select $100$ different samples $\samp{S}_i$ and, for each, produce a bootstrap distribution of the attribute $\widebar{Y}$ and evaluate its standard deviation. Use this bootstrap standard deviation as $\widehat{SD}(\widebar{Y})$ and construct the interval with $c=1.96$. Ideally, about $95\%$ of the samples $\samp{S}_i$ will produce an interval which contains 155.8928571, the average shark length.

The experiment can be carried out as follows:

### Set a seed to use later as well
OurBootSeed <- 543341241
set.seed(OurBootSeed)

### Confidence interval parameters
confidence <- 0.95
cValue <- qnorm((confidence +1)/2)

### Get samples
numIntervals <- 100
sampleSize <- 5
samps <- sapply(1:numIntervals, 
                FUN = function(i) getSample(popSharksAustralia, sampleSize))

### Get the bootstrap intervals
B <- 1000
intervals <- sapply(1:numIntervals,
                    FUN = function (i) {
                      Pstar <- samps[,i]
                      ybar <- mean(sharks[Pstar, "Length"])
                      ybarStars <- sapply(1:B, 
                                       FUN = function(b) {
                                         Sstar <- getSample(Pstar, sampleSize, 
                                                            replace = TRUE)
                                         ybarStar <- mean(sharks[Sstar, "Length"])
                                         ybarStar
                                         })
                      SDhat <- sd(ybarStars)
                      interval <- c(lower = ybar - cValue * SDhat,
                                    middle = ybar,
                                    upper = ybar + cValue * SDhat)
                      interval
                      
                    })

### Plot the intervals over all of the averages
### 
### Determine the histogram heights, etc.
xlim <- extendrange(intervals)
histInfo <- hist(intervals["middle",],  breaks=25, plot=FALSE)
ylim <- c(0, max(extendrange(histInfo$density)))

### Place the histogram of the ybars from
### the samples
hist(intervals["middle",], col=adjustcolor("grey", alpha = 0.5), freq = FALSE,
     main=paste0(numIntervals, " individual ", round(100 * confidence), "% bootstrap confidence intervals"),
     xlab="Average shark length (inches)",
     ylim=ylim, xlim = xlim,
     breaks=25
)
### Mark the population attribute in red
abline(v=avePop, col="red", lty=3, lwd=2)
numIntervalsMissed <- 0

### Add the intervals and mark those which miss
heights <- seq(diff(ylim)/numIntervals, max(ylim),
               length.out = numIntervals)

for(i in 1:numIntervals) {
lines(intervals[c("lower", "upper"), i], rep(heights[i],2),
      col = "steelblue")
  if (avePop > intervals["upper", i]) {
    points(intervals["lower", i], heights[i], pch=8, cex=1.2, col="red")
    points(intervals["middle", i], heights[i], pch=18, cex=1.2, col="red")
    lines(rep(intervals["middle", i], 2), c(0, heights[i]), col = "red")
    numIntervalsMissed <- numIntervalsMissed + 1
  } else if (avePop < intervals["lower", i]) {
    points(intervals["upper", i], heights[i], pch=8, cex=1.2, col="red")
    points(intervals["middle", i], heights[i], pch=18, cex=1.2, col="red")
    lines(rep(intervals["middle", i], 2), c(0, heights[i]), col = "red")
    numIntervalsMissed <- numIntervalsMissed + 1
  }  
}

As can be seen, only 85 of the 100 bootstrap intervals actually cover the average in the population. This is a much lower coverage proportion than the putative 95 $\%$ suggested by $c = $ 1.96.

Exercise: Using the above results as data, formally test the hypothesis that the true coverage probability of these bootstrap intervals is 0.95. That is, formally test $H:(1-p) =$ 0.95.

This undercoverage is perhaps not unexpected for two reasons. First, the choice of $c$ value is based on a Gaussian model which may not apply. If for example, the $c$ value was based on a Student $t$ distribution then $c >$ 1.96 and the resulting intervals would have been wider. Note however, that previous experiments suggested that even for this small samples size ($n = 5$) that the all possible sample values of $\widebar{y}$ was not too far from a Gaussian distribution. Second, it has already been observed that the bootstrap estimate of the standard deviation seems to often be on the low side. If, for example, the standard estimate of $\widehat{\sigma}/\sqrt{n}$ had been used, the intervals would have been typically wider.

Note: In the above interval calculations the function sd(...) was used which is an implementation of \[\widehat{\sigma} = \sqrt{\frac{\sum_{u \in \samp{S}} \left(y_u - \widebar{y}\right)^2 } {n-1} } \] which has $n-1$ as a divisor. For bootstrap interval calculations, a divisor of $n$ is preferred giving the estimate \[\widehat{\sigma} = \sqrt{\frac{\sum_{u \in \samp{S}} \left(y_u - \widebar{y}\right)^2}{n}}. \] This estimate has the advantage of being replication invariant. Replication invariant estimates are preferred and are often called plug in estimates in the bootstrap literature (e.g. see Efron and Tibshirani (Bradley Efron and Tibshirani 1994)). When $n$ is reasonably large, there will be little practical difference between the two. For this reason, and to make the code above (and below) more readable, we will use the first estimate rather than the second.

4.5.2.1 Bootstrap-$t$ confidence intervals

Previous sampling experiments on this population also showed that \[ Z = \frac{\bigwig{a}(\samp{S}) - a(\pop{P})}{\bigwig{SD}(\bigwig{a}(\samp{S}))} \] was approximately pivotal for $a(\samp{S}) = \widebar{y}$, its histogram (over all possible samples) being fairly well approximated by a $t$-density. This suggests that rather than looking up the $c$-values from a Gaussian distribution, it might be better to use a $t$-distribution on $n-1$ degrees of freedom. Of course better still would be to use its actual distribution to determine the $c$ value.

Just as a $t$ approximates the actual distribution, so too could a bootstrap distribution of this potential pivotal. The $t$ approximation is much less general than the bootstrap distribution approximation, requiring constraints on $a(\samp{S})$ for it to reasonably hold (e.g. $\bigwig{a}(\samp{S})$ is approximately Gaussian over all possible samples). The great advantage of the bootstrap distribution is that it automatically adjusts its shape (and hence quantiles, etc.) to the form of $a(\samp{S})$.

For our example, the bootstrap distribution of the attribute $Z$ can be had from the bootstrap samples drawn from our original sample $\samp{S}$ as

ZBoot <- (avesBootSamp - mean(sharks[Pstar,"Length"]))/sd(avesBootSamp)

and compared to the corresponding distribution over all possible samples.

ZPop <- (avesSamp - mean(sharks[popSharksAustralia,"Length"]))/sd(avesSamp)

Together these look like

We would like to use the bootstrap distribution of $T$ to find values $c_{lower}$ and $c_{upper}$ such that \[ Pr(c_{lower} \le Z \le c_{upper} ) = (1-p) \] with $(1-p)$ being the intended coverage probability.

The bootstrap-$t$ method is as follows. For a given sample $\samp{S}$:

Calculate $a(\samp{S})$ and $\widehat{SD}\left(a(\samp{S})\right)$.
Set $\pop{P}^{\star} = \samp{S}$.
Generate $B$ bootstrap samples $\samp{S}_1^\star, \ldots, \samp{S}_B^\star$ from $\pop{P}^{\star}$ (with replacement).
For the $b$th bootstrap sample:
1. Calculate $a(\samp{S}_b^{\star})$ and $\widehat{SD}\left(a(\samp{S}_b^{\star})\right)$.
2. Determine \[ z_b = \frac{a(\samp{S}_b^{\star}) - a(\samp{S})} {\widehat{SD}\left(a(\samp{S}_b^{\star})\right)}. \]
From the values $z_1, \ldots, z_B$ find constants $c_{lower}$ and $c_{upper}$ such that \[ \frac{\left\{\# z_b \le c_{lower} \right\}}{B} = \frac{p}{2} ~~ \mbox{and}\] \[ \frac{\left\{\# z_b \le c_{upper} \right\}}{B} = 1 -\frac{p}{2}.\]
The $100(1-p)\%$ bootstrap-t confidence interval is \[ \left[ a(\samp{S}) - c_{upper} \times \widehat{SD}\left(a(\samp{S})\right), a(\samp{S}) - c_{lower} \times \widehat{SD}\left(a(\samp{S})\right) \right]. \] (Note the signs and positions of the constants $c_{lower}$ and $c_{upper}$ in the interval defiition.)

We can check the coverage probability of these intervals just as before. In this case, $a(\samp{S}) = \widebar{y}$ is the sample average and so $\widehat{SD}\left(a(\samp{S})\right) = \widehat{\sigma}/\sqrt{n}$ can be calculated from the sample values $y_u$ for $u \in \samp{S}$. This simplifies things considerably.

The code for the experiment is as follows:

### Use the same seed so before
set.seed(OurBootSeed)
### Confidence interval parameters and samps are calculated earlier
### Get the bootstrap intervals
B <- 1000
intervals <- sapply(1:numIntervals,
                    FUN = function (i) {
                      Pstar <- samps[,i]
                      aPstar <- mean(sharks[Pstar, "Length"])
                      SDhat <- sd(sharks[Pstar, "Length"])/sqrt(sampleSize)
                      ## get bootstrap z values
                      zVals <- sapply(
                        1:B, 
                        FUN = function(b) {
                          Sstar <- getSample(Pstar, sampleSize, 
                                             replace = TRUE)
                          aSstar <- mean(sharks[Sstar, "Length"])
                          SD_aSstar <- sd(sharks[Sstar,
                                                 "Length"])/sqrt(sampleSize)
                          z <- (aSstar - aPstar)/SD_aSstar
                          z
                        })
                      ## Now use these zVals to get the lower and upper
                      ## c values.
                      cValues <- quantile(zVals,
                                          probs = c( (1 - confidence)/2,
                                                     (confidence +1)/2),
                                          na.rm = TRUE)
                      cLower <- min(cValues)
                      cUpper <- max(cValues)
                       interval <- c(lower = aPstar - cUpper * SDhat,
                                    middle = aPstar,
                                    upper = aPstar - cLower  * SDhat)
                      interval
                    })

Now 93 of the 100 bootstrap intervals cover the average in the population. This is an improvement on those seen earlier and arises because these intervals will be wider (due to the bootstrap estimates of $c_{lower}$ and $c_{upper}$). Note that in this particular instance, many of these intervals could be made shorter simply from the scientific context. For example, great white sharks will never be less than $0$ inches in length so that no interval should have a negative lower value. Being impossible physically, left-truncating an interval at $0$ will have no effect on whether that interval covers the true average shark length ot not.

Exercise: Using the above results as data, formally test the hypothesis that the true coverage probability of these bootstrap intervals is 0.95. That is, formally test $H:(1-p) =$ 0.95.

In the case just considered $a(\samp{S}) = \widebar{y}$ and we have an analytic form for its standard deviation, namely $SD(\widebar{Y}) = \sigma/\sqrt{n}$. Replacing $\sigma$ by $\widehat{\sigma}$ gives an estimate $\widehat{SD}(\widebar{Y})$ based on the sample values $y_u$ for $u \in \samp{S}$. However whenever, as is often the case in practice, an analytic solution is not available for $\widehat{SD}(a(\samp{S}))$ in step 1 of the bootstrap-$t$ method, an estimate can be had from the standard deviation of the bootstrap values $a(\samp{S}_1^\star), \ldots, a(\samp{S}_B^\star)$.

Similarly, at step 4a, an estimate $\widehat{SD}(a(\samp{S}_b^\star))$ has to be determined for each bootstrap sample $\samp{S}_b^{\star}$. Again, this is straightforward for $a(\samp{S}) = \widebar{y}$ because an analytic form is available but is not so for general $a(\samp{S})$.

A general solution is to appeal again to the bootstrap method. For each bootstrap sample $\samp{S}_b^{\star}$ we calculate a bootstrap estimate of the standard deviation – a bootstrap within a bootstrap. Just as with the original bootstrap, this additional bootstrap adds another link in the chain of the inductive path as shown below.

Path of Inductive inference: bootstrap within a bootstrap

In step 4a of the bootstrap-$t$ method, then, we generate $D$ samples $\samp{S}_1^{\star\star}, \ldots, \samp{S}_{D}^{\star\star}$, each with replacement from a population now defined as $\pop{P}^{\star\star} = \samp{S}_b^\star$. The standard deviation of the corresponding values $a(\samp{S}_1^{\star\star}), \ldots, a(\samp{S}_{D}^{\star\star})$ will provide the estimate $\widehat{SD}(a(\samp{S}_b^\star))$.

While this provides bootstrap-$t$ confidence intervals for any attribute $a(\pop{P})$, it does so at the computational cost of another bootstrap within every original bootstrap replicate.

A fairly general function can now be written to calculate a bootstrap-$t$ confidence interval for any attribute $a(\samp{S})$.

bootstrap_t_interval <- function(S, a,
                                  confidence = 0.95,  
                                  B = 1000, D = 30){
  ## Here S is the sample, a is a scalar-valued function a(S) of a sample S 
  ## which returns the value for S of that attribute of interest
  ## confidence is the level of confidence
  ## B is the outer bootstrap count of replicates used to
  ## calculate the lower and upper limits
  ## D the inner bootstrap count of replicates used to
  ## estimate the standard deviation of the sample attribute
  ## for each (outer) bootstrap sample
  Pstar <- S
  aPstar <- a(Pstar)
  ## get (outer) bootstrap values
  bVals <- sapply(
    1:B, 
    FUN = function(b) {
      Sstar <- getSample(Pstar, sampleSize, replace = TRUE)
      aSstar <- a(Sstar)
      ## get (inner) bootstrap values to
      ## estimate the SD
      Pstarstar <- Sstar
      SD_aSstar <- sd(
        sapply(1:D,
               FUN = function(d){
                 Sstarstar <- getSample(Pstarstar, sampleSize, replace = TRUE)
                 ## return the attribute value
                 a(Sstarstar)
               }
        )
      )
      z <- (aSstar - aPstar)/SD_aSstar
      ## Return the two values
      c(aSstar = aSstar, z = z)
    }) 
  SDhat <- sd(bVals["aSstar",])
  zVals <- bVals["z",]
  ## Now use these zVals to get the lower and upper
  ## c values.
  cValues <- quantile(zVals,
                      probs = c((1 - confidence)/2,  (confidence +1)/2),
                      na.rm = TRUE)
  cLower <- min(cValues)
  cUpper <- max(cValues)
  interval <- c(lower = aPstar - cUpper * SDhat,
                middle = aPstar,
                upper = aPstar - cLower  * SDhat)
  interval
}

Exercise: Rewrite the above function to use Map(...) rather than sapply(...).

The bootstrap_t_interval function requires another function a which will calculate $a(\samp{S})$ for any $\samp{S} \subset \pop{P}$ as an argument. In the case of the average shark length, this can be written as

### For average shark length
### the attribute function a(S) is defined as
a <- function(S) {mean(sharks[S, "Length"])}

A single bootstrap-$t$ $95\%$ confidence interval is had as

S <- getSample(popSharksAustralia, size = 5, replace = TRUE)

bootstrap_t_interval(S, a)

##   lower  middle   upper 
## 111.621 169.600 218.909

As before, we can conduct an experiment to assess the coverage probability for these intervals

confidence <- 0.95
B <- 1000
### Use the same seed so before
set.seed(OurBootSeed)
### Get the bootstrap intervals
intervals <- sapply(1:numIntervals,
                    FUN = function (i) {
                      ## Use the same samples as before, namely `samps`
                      bootstrap_t_interval(samps[,i], a = a, 
                                           confidence = confidence,
                                           B=B, D=30)
                    }
                    )

This experiment takes a fair time to

General bootstrap-t intervals

As luck would have it, exactly 95 of the 100 bootstrap intervals cover the population attribute.

Using either $SD(a(\samp{S})))$ estimate, the bootstrap-$t$ intervals perform better than the simple Gaussian based boostrap intervals. According to (Bradley Efron and Tibshirani 1994, 161–62) the bootstrap-$t$ intervals are best suited to attributes which measure “location”, attributes like the average, the median, a particular quantile, et cetera.

Exercise: By appropriately defining a function a(...), conduct an experiment to examine the coverage probabilities for the bootstrap-$t$ intervals for the correlation coefficient $a(\samp{S}) = r$ with \[ r = \frac{\sum_{u \in \samp{S}} (x_u - \widebar{x})(y_u - \widebar{y})} {\sqrt{\sum_{u \in \samp{S}}(x_u - \widebar{x})^2} \times \sqrt{\sum_{u \in \samp{S}}(y_u - \widebar{y})^2} } \] where $\pop{P}$ is the US Census of Agriculture with $x$ and $y$ denoting the variates acres82 and acres92, respectively. In this experiment, fix the sample size at $n = 50$.

For attributes that do not measure location, it may be necessary to first transform the attribute to a scale that produces bootstrap-$t$ that have good coverage probabilities. This too can be automated as described for example by Algorithm 12.1 of (Bradley Efron and Tibshirani 1994).

4.5.2.2 Percentile method

So far all of the bootstrap confidence intervals for $a(\pop{P})$ have been based on intevals inspired by the familiar structure \[ \left[ a(\samp{S}) - c_{upper} \times \widehat{SD}(a(\samp{S})),~ a(\samp{S}) - c_{lower} \times \widehat{SD}(a(\samp{S})) \right] \] where $c_{lower}$ and $c_{upper}$ are constants associated with the confidence level $100(1-p)\%$ through an (approximate) probability statement \[ (1 - p) = Pr\left(c_{lower} \le Z \le c_{upper} \right) \] for some (approximate) pivotal $Z$. The constants $c_{lower}$ and $c_{upper}$ are typically taken to be the $p/2$ and $1 - p/2$ quantiles of the distribution of $Z$ so that the random intervals cover the population value $a(\pop{P})$ with probability $(1-p)$.

However, had we not had some familiarity with confidence intervals constructed in this way, it might be more natural to go directly to the distribution of all possible values $a(\samp{S})$ itself. We saw that when exploring all possible samples for a variety of attributes $a(\pop{P})$ the distribution of $a(\samp{S})$ often covered the value $a(\pop{P})$.

This suggests that if we had the values $a(\samp{S}_1), \ldots, a(\samp{S}_{N_{\samp{S}}})$ for all $N_{\samp{S}}$ possible samples $\samp{S}_1, \ldots, \samp{S}_{N_{\samp{S}}}$ that we might choose $a_{lower}$ and $a_{upper}$ to be the smallest values of the sample attribute such that \[ \frac{p}{2} = \frac{\# \{a(\samp{S}_i) \le a_{lower} \}} {N_{\samp{S}}} \] and \[ 1 - \frac{p}{2} = \frac{\# \{a(\samp{S}_i) \le a_{upper} \}} {N_{\samp{S}}}. \] The interval $[a_{lower}, a_{upper}]$ covers the value $a(\pop{P})$ for many attributes. A notable exception would be an attribute like the range $a(\pop{P}) = y_{max} - y_{\min}$. In this case, for all samples of size $n$ from a population of size $N$, (assuming no ties on the $y_{min}$ and $y_{max}$ values) fully \[ 1 - \frac{{{N-2}\choose{n-2}}}{{N \choose n}} \] of the samples will have $a(\samp{S}) < a(\pop{P})$. For $N=28$ and $n=5$, this is about 0.974, implying that a $95\%$ interval constructed as above, would not contain $a(\pop{P}) = y_{max} - y_{min}$.

Exercise: What value, in general, must $(1-p)$ take for the corresponding interval $[a_{lower}, a_{upper}]$ to contain the range $a(\pop{P}) = y_{max} - y_{\min}$?

One important feature of this approach is that the interval will be equivariant to any one to one transformation of the attribute, say $T(a(\pop{P}))$. That is, the corresponding interval for $T(a(\pop{P}))$ is simply $[T(a_{lower}), T(a_{upper})]$ ! So, we only need to determine the values $a_{lower}$ and $a_{upper}$ once for $a(\pop{P})$ and we have them for any $T(a(\pop{P}))$.

Exercise: The intervals as described have equal probability $p/2$ in each tail. Suppose instead we had $p_{lower}$ in the lower tail and $p_{upper}$ in the higher tail, with $p_{lower} + p_{upper} = p$. Explain why the corresponding intervals would no longer be equivariant to all one to one transformations $T(a(\pop{P}))$. What restriction, if any, must be placed on $T(\cdots)$ for the equivariance to be reinstated?

4.5.2.3 Corrected bootstrap confidence intervals

BCa and ABC intervals.

4.5.3 Bootstrap hypothesis test

5 Prediction

Oftentimes interest lies in predicting the value of a variate (the response variate) given the value of some explanatory variates. We build a response model that encodes how that prediction is to be carried out. Response models separate variates into two groups, those that are explanatory and those that are response. The values of the explanatory variates are used to explain or predict the values of the response. To predict, we use our observed data to construct a function, $\mu(\ve{x})$, which can be used to predict $y$ at any given value $\ve{x}$.

For example, one might want to predict the number of acres in some county devoted to farms in a given year based only on the number of acres being farmed in the same county 5, or 10, years prior. Using the US Census of Agriculture data, the response variate $y$ might then be acres92 and the explanatory variate vector $\tr{\ve{x}}=(x, z)$ where $x$ is acres87 and $z$ acres82. In this case the predictor function could be as simple as \[\mu(\ve{x}) = \mu(x,z) = \alpha + \beta x + \gamma z \] for some unknown values of the parameters $\alpha$, $\beta$, and $\gamma$. The predictor function will be estimated as \[ \widehat{\mu}(\ve{x}) = \widehat{\mu}(x,z) = \widehat{\alpha} + \widehat{\beta} x + \widehat{\gamma} z \] typically by least-squares based on some sample $\samp{S}$ of observed values of $\ve{x}_u, y_u$ for all $u \in \samp{S}$. For any new value $\ve{x}_{New}$, we would predict the value of $y_{New}$ as $\widehat{\mu}(\ve{x}_{New})$.

The predictor function can be more complicated. For example, for the great white shark data we might be interested predicting whether an encounter will be fatal based on the length of the shark and whether the person was surfing or not. Here the response variate, $y$, takes binary outcome values 1 for Fatality and 0 otherwise, and the explanatory variates given by $\ve{x} = \tr{(x, z)}$ is a vector of two values, $x$ being the shark Length in inches and $z$ the binary indicator Surfing (1 for yes, 0 otherwise). Since $y$ is binary, a common response model would be to model the probability $\mu(\ve{x})$ that the encounter is fatal. We imagine a random variate $Y$ taking the value 1 with probability $\mu(\ve{x})$ and value 0 with probability $1 - \mu(\ve{x})$; that is, $Y$ is a Bernoulli random variate with probability function \[Pr(Y=y) = \mu(\ve{x})^y ~(1 - \mu(\ve{x}))^{(1-y)} ~ \mbox{ for } ~ ~y \in \left\{ 0, 1 \right\}. \] For this model, $\mu(\ve{x}) = E(Y \given \ve{x}) \in \left[0, 1 \right]$ and \[ \log \left( \frac{\mu(\ve{x})}{1-\mu(\ve{x})} \right) = \alpha + \beta x + \gamma z \] which is sometimes called a logistic function. The logistic function has the advantage that it maps the interval $\left[0, 1 \right]$ to the extended real line $\left[ -\infty, +\infty \right]$. This response model is an example of a generalized linear model (or glm) – generalized in the sense that a function of $\mu(\ve{x})$ is modelled as a linear function of unknown parameters.

If we have observed $\ve{x}_u$, $y_u$ for $u \in \samp{S}$ and let $\sv{\theta} = \tr{(\alpha, \beta, \gamma)}$ denote the vector of unknown parameters, then $\mu(\ve{x})$ could be estimated by maximizing the log-likelihood function: \[\ell(\alpha, \beta, \gamma) = \sum_{u \in \samp{S}} y_u (\alpha + \beta x_u + \gamma z_u) + \sum_{u \in \samp{S}} \log (1 - \mu(\ve{x}_u)) ~. \] The resulting estimates $\widehat{\alpha}$, $\widehat{\beta}$, and $\widehat{\gamma}$ determine the value $\widehat{\mu}(\ve{x}_{New})$, for any $\ve{x} = \ve{x}_{New}$, provides an estimate of the probability of a fatal encounter when $\ve{x}= \ve{x}_{New}$. Based on this estimated probability, we might predict a fatality if and only if $\widehat{\mu}(\ve{x}_{New}) \ge \frac{1}{2}$.

These are just two examples of possible prediction functions.

5.1 Accuracy of prediction

A predictor function $\mu(\ve{x})$ is a function intended to predict the response variate value $y_u$ for any unit $u$ selected from a population $\pop{P}$. There are numerous possibilities for the set $\pop{P}$ – imagine for example, any study or target population. How well it predicts will be a function of the choice of $\mu(\ve{x})$ and the variate values for the units in the population $\pop{P}$.

Of course, the problems previously outlined on inductive inference do not go away. Typically, we do not have $\mu(\ve{x})$, but rather an estimate of it, $\widehat{\mu}(\ve{x})$, constructed from the variate values of the units in some sample $\samp{S}$ drawn according to some sampling mechanism from a study population $\pop{P}_{study}$. Moreover, predicting $y$ values for units from a target population has all the difficulties usually associated with study error plus the extraordinary difficulty that each prediction has its own study error which may or may not be related to that of any other prediction.

Any measure of the quality of the predictions will be confined to some measure taken over the units of a population $\pop{P}$. Rather than ask how accurate predictions are, we might ask instead how inaccurate are the predictions – this is usually easier to measure. For example, we could measure the inaccuracy by the average squared error over $\pop{P}$, that is \[Ave_{u \in \pop{P}}~(y_u - \widehat{\mu}(\ve{x}_u))^2\] which treats every unit in $\pop{P}$ with equal weight.

So how might we determine this? A simple approach would be to take $\pop{P}$ to be the observed set for which we have measurements on both $\ve{x}$ and $y$. That is calculate \[Ave_{i = 1, \ldots, N}~(y_i - \widehat{\mu}(\ve{x}_i))^2.\] This is proportional to the more familiar residual sum of squares $\sum_{i = 1, \ldots, N} \widehat{r}_i^2$ with estimated residuals $\widehat{r}_i = y_i - \widehat{\mu}(\ve{x}_i)$. This average is sometimes called the estimated “residual mean squared error”.

The problem with this approach is that the estimate of the predictor function is based on the same set of observations and the data pairs $(\ve{x}_1, y_1), (\ve{x}_2, y_2), \ldots , (\ve{x}_N, y_N)$ as the set $\pop{P}$ over which the average is taken. This doesn’t seem to be the most honest way to estimate a prediction’s inaccuracy. After all, we are using the same $x$ values and the values $y$ we want to predict to construct the predictor function. Surely this will typically underestimate the average squared error for prediction of any other values of $x$.

The average prediction squared error (APSE) of $\widehat{\mu}$ will be written as \[APSE(\pop{P}, \widehat{\mu}_{\samp{S}}) = {Ave}_{u \in \pop{P}}(y_u - \widehat{\mu}_\samp{S}(\ve{x}_u))^2.\] The notation emphasizes that the average prediction squared error depends on the the predictor function $\widehat{\mu}$, itself based on a set of observed $(\ve{x}, y)$ pairs for units in some set $\samp{S}$, and on the set $\pop{P}$ over which the predictions are being evaluated.

Typically $\samp{S}$ is a sample of size $n$ and $\pop{P}$ is a study population of size $n \le N$. With only some abuse of notation, the same formulation could also have $\samp{S}$ represent an entire study population and $\pop{P}$ the target population, in which case the APSE would give some measure prediction quality that included study error.

The problem with the simple residual mean squared error is that it is essentially the following estimate of average prediction squared error \[\widehat{APSE}(\pop{P}, \widehat{\mu}_{\samp{S}}) = APSE(\widehat{\pop{P}}, \widehat{\mu}_{\samp{S}}) = APSE(\samp{S}, \widehat{\mu}_{\samp{S}}) \] and this estimate uses $\pop{S}$ twice over – once as the population $\pop{P}$ and once as the sample used to construct $\widehat{\mu}$. The result is very likely an overly optimistic value for the inaccuracy of $\widehat{\mu}$ on $\pop{P}$.

5.2 Examples

Here we will look at some examples where we have complete knowledge of the population $\pop{P}$ and the values of $x_u$ and $y_u$ for all $u \in \pop{P}$.

5.2.1 US census: predicting farm acreage

Suppose we consider predicting the acres under agrigulture in 1992, acres92, from the acreage under agriculture ten years earlier in 1982, acres82.

### The data is in agpop
xlim <- extendrange(agpop$acres82)
ylim <- extendrange(agpop$acres92)
### First plot the data
plot(x = agpop$acres82, y = agpop$acres92, pch = 19, 
     col = adjustcolor("black", 0.5), xlim = xlim, ylim = ylim,
     xlab = "Acreage in 1982",  ylab = "Acreage in 1992",
     main = "Predictor based on the whole population")

### Now fit a straight line 
### We could use our own least-squares fitting procedures or 
### use the built-in function for fitting linear models: lm(...)
fit <- lm(acres92 ~ acres82, data = agpop)
### Add the fit to the plot
abline(fit, col = adjustcolor("firebrick", 0.5), lwd = 4 )

Clearly this line, fitted on the whole population, would provide a pretty good prediction for all points in the population. The average squared residuals is mean(fit$residuals^2) $\approx$ 2.9 $\times 10^9$. (Note that the residual mean square error traditionally divides the sum of squared residuals by $N-2$ here instead of by $N$. Exercise: Why? What is the equation for $\widehat{\mu}(\ve{x})$?)

But what if the predictor is instead $\widehat{\mu}_{\samp{S}}(\ve{x})$, the fitted line based on some sample?

### Get a sample of size 100
N <- nrow(agpop)
set.seed(12346423)  # for reproducibility
samp <- rep(FALSE, N)
samp[sample(1:N, 100)] <- TRUE

sampData <- agpop[samp,]
### First plot the sample (on the same scale as the population)
plot(x = sampData$acres82, y = sampData$acres92, pch = 19, 
     col = adjustcolor("blue", 0.5), xlim = xlim, ylim = ylim,
     xlab = "Acreage in 1982",  ylab = "Acreage in 1992",
     main = "Prediction from a sample")

### Now fit a straight line 
### We could use our own least-squares fitting procedures or 
### use the built-in function for fitting linear models: lm(...)
fitSamp <- lm(acres92 ~ acres82, data = sampData)
### Add the fit to the plot
abline(fitSamp, col = adjustcolor("steelblue", 0.5), lwd = 4 , lty = 3)
### Compare with the predictor on the whole population
abline(fit, col = adjustcolor("firebrick", 0.5), lwd = 4 )
legend("topleft", title = "Fitted predictor", 
       legend = c("sample", "population"),
       lwd = c(4,4), lty = c(3,1),
       col = adjustcolor(c("blue", "firebrick"), 0.5)
       )

As can be seen, the fitted predictor on the sample can be quite different from that based on the entire population.

A predictor function for lm models can be defined programmatically as follows

lmPredictor <- function(formula, data) {
  ## Get the fit for the given 
  fit <- lm(as.formula(formula), data=data, na.action = na.exclude)
  ## Return a predictor function
  function(data) {
    predict(fit, newdata = data)
  }
}

### The predictor function for our sample
muhat <- lmPredictor("acres92 ~ acres82", data=sampData)

The function muhat(...) can now be used to calculate the prediction for any new data that has acres82 as a named variate.

Similarly, we can write a function to calculate the average predicted squared error (apse) as

apse <- function(y, data, predfun) {
  {mean((y - predfun(data))^2, na.rm = TRUE)}
}

The average prediction squared error is now easily calculated. First, over the sample \[APSE(\samp{S}, \widehat{\mu}_{\samp{S}}) = Ave_{u \in \samp{S}}(y_u - \widehat{\mu}_\samp{S}(\ve{x}_u))^2\] we have

### First over the sample
apseSamp <- apse(y = sampData[,"acres92"], data = sampData, predfun = muhat)
apseSamp

## [1] 4124709601

And now over the population \[APSE(\pop{P}, \widehat{\mu}_{\samp{S}}) = Ave_{u \in \pop{P}}(y_u - \widehat{\mu}_\samp{S}(\ve{x}_u))^2\] we have

### and then over the population 
apsePop <- apse(y = agpop[,"acres92"], data = agpop, predfun = muhat)
apsePop

## [1] 7732387450

The result is considerably larger than that estimated from the sample alone, namely about 1.9 times as large. Clearly, the average prediction squared error over the sample would be misleading.

5.2.2 US census: Comparing predictors of acreage

Rather than use the acreage from 1982, to predict that of 1992, we might use the more recent acreage of 1987. The predictor is easily constructed and its average prediction squared error evaluated over the population:

### The predictor function for our sample
muhat87 <- lmPredictor("acres92 ~ acres87", data=sampData)
apsePop87 <- apse(y = agpop[,"acres92"], data = agpop, predfun = muhat87)
apsePop87

## [1] 3347267402

which is a considerably smaller value, namely 43 percent of the predictor based on the 1982 acreage. By this measure, the better predictor of the acreage in 1992 is had from the 1987 acreage than from the 1982 acreage.

We might also consider a more complicated predictor that included both the 1982 and the 1987 acreages. This would be constructed as follows:

### The predictor function for our sample
muhat8287 <- lmPredictor("acres92 ~ acres82 + acres87", data=sampData)
apsePop8287 <- apse(y = agpop[,"acres92"], data = agpop, predfun = muhat8287)
apsePop8287

## [1] 3243069169

which produces an even smaller average prediction squared error.

5.2.3 Alternative: Out of sample only assessment

Note that $\pop{P}$ includes the sample $\samp{S}$ and we could write the average prediction squared error as \[ \begin{array}{rcl} APSE(\pop{P}, \widehat{\mu}_{\samp{S}}) &=& Ave_{u \in \pop{P}}(y_u - \widehat{\mu}_{\samp{S}}(\ve{x}_u))^2 \\ &&\\ &=&\left(\frac{n}{N} \right) Ave_{u \in \samp{S}}(y_u - \widehat{\mu}_{\samp{S}}(\ve{x}_u))^2 ~+~ \left(\frac{N-n}{N} \right) Ave_{u \in \samp{T}}(y_u - \widehat{\mu}_{\samp{S}}(\ve{x}_u))^2 \\ &&\\ &=&\left(\frac{n}{N} \right) APSE(\samp{S}, \widehat{\mu}_{\samp{S}}) ~+~ \left(\frac{N-n}{N} \right) APSE(\pop{T}, \widehat{\mu}_{\samp{S}}) \end{array} \] where $\pop{T} = \pop{P} - \samp{S}$ is the complement set of the sample in the population and $n$ and $N$ are the number of units in $\samp{S}$ and $\pop{P}$ respectively.

Given that interest often lies in the quality of the predictions outside of the sample, rather than over the whole population one might choose to evaluate a predictor’s accuracy based only on its average prediction squared error over $\pop{T}$, that is, based only on \[APSE(\pop{T}, \widehat{\mu}_{\samp{S}}) = Ave_{u \in \samp{T}}(y_u - \widehat{\mu}_\samp{S}(\ve{x}_u))^2 .\] Clearly, if $n<<N$ the value will not be that different from that averaged over the whole population $\pop{P}$.

For the three predictor functions of the 1992 acreage this would be evaluated as

### First, get the complement of the sample
Tpop <- !samp
Tdata <- agpop[Tpop,]
results <- data.frame( acres82 =  apse(y = Tdata[,"acres92"], 
                                       data = Tdata, predfun = muhat),
                       acres87 =  apse(y = Tdata[,"acres92"], 
                                       data = Tdata, predfun = muhat87),
                       acres8287 =  apse(y = Tdata[,"acres92"], 
                                       data = Tdata, predfun = muhat8287)
                       )
kable(100 * round(results/max(results), 2), 
      caption = "Error on out of sample population as a percent of the maximum")

Error on out of sample population as a percent of the maximum
acres82	acres87	acres8287
100	43	42

Clearly, the predictor based on both the 1982 and the 1987 acreages has the smallest average prediction squared error over the population that is outside the sample. Note however that the predictor based only on the 1987 acreage is a close second and might be preferred on grounds of simplicity.

5.2.4 Increasing complexity of predictors

A predictor function based only on acres82 is simpler than the one based on both acres82 and acres87. The latter is a more complex predictor in that it has more parameters to be estimated. This can be seen from the mathematical definition of the two predictor functions. Letting $x$ denote acres82 and $z$ denote acres87, two predictor functions are \[ \widehat{\mu}(x) = \widehat{\alpha} + \widehat{\beta} x \] and \[ \widehat{\mu}(x) = \widehat{\alpha} + \widehat{\beta} x + \widehat{\gamma} x \] respectively. The former is simpler in having fewer parameters estimated from the sample $\samp{S}$. In this sense, the complexity of the predictor increases with the number of parameters estimated.

The complexity can be increased without adding more variates. For example, the variate acres82 may be the only variate in the predictor but the complexity can be made arbitrarily large by using a higher order polynomial in $x$: \[ \widehat{\mu}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x + \widehat{\beta}_2 x^2 + \widehat{\beta}_3 x^3 + \cdots + + \widehat{\beta}_p x^p . \] A $p$-degree polynomial has $p+1$ parameters to be estimated from the sample.

Being more complex, the predictor is also more flexible and able to fit the data more closely. For example, if $p=3$ a cubic predictor function can constructed from our sample:

### First plot the sample (on the same scale as the population)
plot(x = sampData$acres82, y = sampData$acres92, pch = 19, 
     col = adjustcolor("blue", 0.5), xlim = xlim, ylim = ylim,
     xlab = "Acreage in 1982",  ylab = "Acreage in 1992",
     main = "A cubic predictor function")

### Now fit a cubic polynomial using the "poly(x, 3)" formula
### NOTE: the argument "raw=TRUE" appears so that ordinary (not orthogonal)
### polynomials are used.  This is important whenever there may be NAs in the
### data.
muhat82cubic <- lmPredictor("acres92 ~ poly(acres82,3,raw=TRUE)", data = sampData)

### Need to get newdata to draw the curve
newXvals <- seq(xlim[1], xlim[2], length.out = 200)

### Draw the curve
lines(newXvals, 
      ## predictor fun needs a data frame with the same variate name
      muhat82cubic(data.frame(acres82 = newXvals)), 
      col = adjustcolor("steelblue", 0.5), lwd = 4 , lty = 3)

As can be seen from the above plot, the cubic certainly fits the sample data well and will predict the values in the sample better than would the simple straight line predictor used earlier. The question of course is how well it predicts on the population. To check we need only calculate its $APSE$.

### On the whole population, the value is
apsePop82cubic <- apse(y = agpop[,"acres92"], data = agpop, predfun = muhat82cubic)
apsePop8287

## [1] 3243069169

which is about 21 times the size of the simpler straight line predictor (for the out of sample predictions the multplier is slightly larger). In this case, the increased complexity has not brought about a better predictor.

We might investigate all possible powers $p \le3$, calculating the value of the $APSE$ for each. The higher the degree the greater the complexity of the predictor function, so in what follows the argument specifying the degree will be named complexity.

### First get the possible values of p
p <- 0:3
### Now get the predictor functions for each power
predFuns <- Map(function(complexity) {
  formula <- if (complexity==0) {
    "acres92 ~ 1"
  } else 
      { paste0("acres92 ~ poly(acres82,", complexity, ", raw = TRUE)") }
  lmPredictor(formula, data = sampData)
  }, 
  p)
### And the average prediction squared errors for each
apsePop82poly <- Map(function(predictor) apse(y = agpop[,"acres92"], 
                                              data = agpop, 
                                              predfun = predictor),
                     predFuns)
### Turn the result from a list 
apsePop82poly <- unlist(apsePop82poly)

### which we can plot as a function of the complexity of the predictor
### Numbers are so large that we plot the logarithms of the APSE values
plot(p, log10(apsePop82poly),
     xlab="complexity (degree of polynomial)", 
     ylab="log_10 APSE", 
     main="APSE as a function of complexity",
     type="l", col="grey75",
     ## suppress x-axis tics temporarily
     xaxt="n")
### Add the x axis
axis(side = 1, at  = p)
points(p, log10(apsePop82poly), pch = 19)

As the plot shows, the average prediction squared error is high for the constant ($p=0$) predictor, much lower for a straight line predictor ($p=1$), slightly higher for a quadratic predictor ($p=2$), and much larger again for a cubic predictor ($p=3$). This pattern is typical of predictors. Increasing complexity can improve the predictor but only to some point (here $p=1$). Beyond that point, increasing complexity causes the predictions to degrade. Increasing complexity will continue to improve predictions on the sample but not on the rest of the population. This effect is called overfitting in that the predictor has been too closely tailored to the peculiarities of the sample to be value in predicting out of the sample.

5.3 Comparison over multiple samples.

The function $\widehat{\mu}_\samp{S}(\ve{x})$ is based on a single sample $\samp{S}$. Its average prediction squared error over $\pop{P}$ provides a measure of how well it predicts the values in the population $\pop{P}$ and can be used to assess its performance compared to other predictor functions.

However, this performance might be peculiar to the particular choice of sample. The average prediction errors might be very different for another choice as might which predictors perform best, or worst. It would be of interest, then, to choose predictor functions that perform well no matter which sample was used to construct the predictor.

Suppose that we have many samples, say $N_\samp{S}$ samples $\samp{S}_j$ for $j=1, \ldots, N_\samp{S}$. For each sample, $\samp{S}_j$, there will be a predictor function $\widehat{\mu}_{\samp{S}_j}(\ve{x})$ and a corresponding average prediction squared error \[APSE(\pop{P}, \widehat{\mu}_{\samp{S}_j}) \] The average over all $N_{\samp{S}}$ samples of the individual average prediction squared errors would be a better measure of the quality of a predictor function.

Recall that the predictor function $\widehat{\mu}_{\samp{S}_j}$ is an estimate based on a single samples $\samp{S}_j$. We now average over many (ideally all) samples $\samp{S}_j$: \[\begin{array}{rcl} APSE(\pop{P}, \widetilde{\mu}) &=& Ave_{j=1, \ldots, N_\samp{S}}~~ APSE(\pop{P}, \widehat{\mu}_{\samp{S}_j}) \\ &&\\ &=& \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~APSE(\pop{P}, \widehat{\mu}_{\samp{S}_j})\\ &&\\ &=& \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N}\sum_{i \in \pop{P}}(y_i - \widehat{\mu}_{\samp{S}_j}(\ve{x}_i))^2.\\ \end{array} \] Note that in $APSE(\pop{P}, \widetilde{\mu})$ the estimator notation $\widetilde{\mu}$ is used in the second argument in place of the estimate notation $\widehat{\mu}$; this is to emphasize that the function is looking at the values of $\widehat{\mu}$ for all possible samples ${cal S}_j$ of $\pop{P}$ for $j=1, \ldots, N_\samp{S}$.

5.4 Decomposing the $APSE({\cal P},\widetilde{\mu})$

A better understanding of the $APSE$ can be had by breaking it into interpretable parts. The average predicted squared error for the estimator $\widetilde{\mu}(\ve{x})$ can be shown to be composed of three separate and interpretable pieces.

First, let $\mu(\ve{x})$ denote a conditional average of $y$ given $\ve{x}$ defined as the average of all of the $y$ values in $\pop{P}$ that share that value of $\ve{x}$. To be precise, suppose that there are $K$ different values of $\ve{x}$ in the population $\pop{P}$ so that $\pop{P}$ can be partitioned according to the different values of $k$ as \[ \pop{P} = \bigcup_{k=1}^{K} \pop{A}_k \] where the unique values of $\ve{x}$ are $\ve{x}_1, \ldots, \ve{x}_K$ and \[\pop{A}_k = \left\{ u \st u \in \pop{P}, ~ \ve{x}_u = \ve{x}_k \right\}. \] (Note that $\pop{A}_1 \ldots \pop{A}_K$ partition $\pop{P}$ since $\pop{A}_k \bigcap \pop{A}_j = \varnothing$ for all $k \ne j$.)

The conditional average $\mu(\ve{x})$ can now be expressed for each distinct $x_k$ as \[\mu(\ve{x}_k) = Ave_{i \in \pop{A}_k} ~y_i.\]

Second, let $\overline{\widehat \mu} (\ve{x})$ denote the average of the estimated predictor function over all samples \[\overline{\widehat \mu} (\ve{x}) = \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}}\widehat{\mu}_{\samp{S}_j}(\ve{x}).\]

With these two together, the average predicted squared error for the estimator $\widetilde{\mu}(\ve{x})$ can now be written as \[\begin{array}{rcl} APSE(\pop{P}, \widetilde{\mu}) &=& \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N}\sum_{i \in \pop{P}}(y_i - \widehat{\mu}_{\samp{S}_j}(\ve{x}_i))^2\\ &&\\ &=& \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N}\sum_{i \in \pop{P}}(y_i - {\mu}(\ve{x}_i))^2\\ &&\\ && ~~~~~~~~~~+ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N}\sum_{i \in \pop{P}}(\widehat{\mu}_{\samp{S}_j}(\ve{x}_i) - {\mu}(\ve{x}_i))^2\\ &&\\ &&\\ &=& \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N}\sum_{i \in \pop{P}}(y_i - {\mu}(\ve{x}_i))^2\\ &&\\ && ~~~~~~~~~~+ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N}\sum_{i \in \pop{P}}(\widehat{\mu}_{\samp{S}_j}(\ve{x}_i) - \overline{\widehat{\mu}}(\ve{x}_i))^2\\ &&\\ && ~~~~~~~~~~+ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N}\sum_{i \in \pop{P}}(\overline{\widehat{\mu}}(\ve{x}_i) - {\mu}(\ve{x}_i))^2\\ &&\\ &&\\ &=& \frac{1}{N}\sum_{i \in \pop{P}}(y_i - {\mu}(\ve{x}_i))^2\\ &&\\ && ~~~~~~~~~~+ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N}\sum_{i \in \pop{P}}(\widehat{\mu}_{\samp{S}_j}(\ve{x}_i) - \overline{\widehat{\mu}}(\ve{x}_i))^2\\ &&\\ && ~~~~~~~~~~+ \frac{1}{N}\sum_{i \in \pop{P}}(\overline{\widehat{\mu}}(\ve{x}_i) - {\mu}(\ve{x}_i))^2\\ &&\\ &&\\ &=& \sum_{k=1}^{K}\frac{n_k}{N}\sum_{i \in \pop{A}_k}\frac{1}{n_k}(y_i - {\mu}(\ve{x}_k))^2\\ &&\\ && ~~~~~~~~~~+ ~\sum_{k=1}^{K}\frac{n_k}{N}~ \sum_{j=1}^{N_\samp{S}}\frac{1}{N_\samp{S}}(\widehat{\mu}_{\samp{S}_j}(\ve{x}_k) - \overline{\widehat{\mu}}(\ve{x}_k))^2\\ &&\\ && ~~~~~~~~~~+ \sum_{k=1}^{K}\frac{n_k}{N}(\overline{\widehat{\mu}}(\ve{x}_k) - {\mu}(\ve{x}_k))^2\\ &&\\ &&\\ &=& Ave_{\ve{x}}(Var(y|\ve{x})) + Var(\widetilde{\mu}) + Bias^2(\widetilde{\mu}).\\ \end{array} \] Note that each term is an average over all $\ve{x}$ values and over all samples. The first is the average of the conditional variance of the response $y$, the second the average of the variance of the estimator and the last the squared bias. Different estimators $\bigwig{\mu}_{\samp{S}}$ will produce different values of each of the last two terms. So too will different sets of samples (so that the choice of sampling plan used to sample from the population will also affect the predictive accuracy). No matter what the estimator, the first term (the residual sampling variability) remains unchanged.

Note also that we could separate $\pop{P}$ into the parts which are used to construct the estimates (viz. samples $\samp{S}$) and the parts which are not (viz. $\pop{T} - \pop{P} - \samp{S}$). The average predictive accuracy could then be written as

\[\begin{array}{rcl} APSE(\pop{P}, \widetilde{\mu}) &=& \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N}\sum_{i \in \pop{P}}(y_i - {\mu}({\bf x_i}))^2\\ &&\\ && ~~~~~~~~~~+ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N}\sum_{i \in \pop{P}}(\widehat{\mu}_{\samp{S}_j}({\bf x_i}) - \overline{\widehat{\mu}}({\bf x_i}))^2\\ &&\\ && ~~~~~~~~~~+ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N}\sum_{i \in \pop{P}}(\overline{\widehat{\mu}}({\bf x_i}) - {\mu}({\bf x_i}))^2\\ &&\\ &&\\ &=& \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N}\left(\sum_{i \in \samp{S}_j}(y_i - {\mu}({\bf x_i}))^2 + \sum_{i \in \pop{T}_j}(y_i - {\mu}({\bf x_i}))^2 \right)\\ &&\\ && ~~~~~~~~~~+ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N} \left(\sum_{i \in \samp{S}_j}(\widehat{\mu}_{\samp{S}_j}({\bf x_i}) - \overline{\widehat{\mu}}({\bf x_i}))^2 + \sum_{i \in \pop{T}_j}(\widehat{\mu}_{\samp{S}_j}({\bf x_i}) - \overline{\widehat{\mu}}({\bf x_i}))^2\right)\\ &&\\ && ~~~~~~~~~~+ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{1}{N}\left( \sum_{i \in \samp{S}_j}(\overline{\widehat{\mu}}({\bf x_i}) - {\mu}({\bf x_i}))^2 + \sum_{i \in \pop{T}_j}(\overline{\widehat{\mu}}({\bf x_i}) - {\mu}({\bf x_i}))^2 \right) \\ &&\\ &&\\ &=& \left\{ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{N - N_{\pop{T}_j}}{N} \left( \frac{1}{N-N_{\pop{T}_j}}\sum_{i \in \samp{S}_j}(y_i - {\mu}({\bf x_i}))^2 \right) \right.\\ &&\\ && ~~~~~~~~~~+ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{N-N_{\pop{T}_j}}{N} \left( \frac{1}{N-N_{\pop{T}_j}}\sum_{i \in \samp{S}_j}(\widehat{\mu}_{\samp{S}_j}({\bf x_i}) - \overline{\widehat{\mu}}({\bf x_i}))^2 \right) \\ &&\\ && ~~~~~~~~~~+ \left. \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{N-N_{\pop{T}_j}}{N} \left( \frac{1}{N-N_{\pop{T}_j}} \sum_{i \in \samp{S}_j} (\overline{\widehat{\mu}}({\bf x_i}) - {\mu}({\bf x_i}))^2 \right) \right\} \\ &&\\ && + \left\{ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{N_{\pop{T}_j}}{N} \left( \frac{1}{N_{\pop{T}_j}}\sum_{i \in \pop{T}_j} (y_i - {\mu}({\bf x_i}))^2 \right) \right.\\ &&\\ && ~~~~~~~~~~+ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{N_{\pop{T}_j}}{N} \left(\frac{1}{N_{\pop{T}_j}}\sum_{i \in \pop{T}_j} (\widehat{\mu}_{\samp{S}_j}({\bf x_i}) - \overline{\widehat{\mu}}({\bf x_i}))^2 \right) \\ &&\\ && ~~~~~~~~~~+ \left. \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} ~\frac{N_{\pop{T}_j}}{N} \left( \frac{1}{N_{\pop{T}_j}}\sum_{i \in \pop{T}_j} (\overline{\widehat{\mu}}({\bf x_i}) - {\mu}({\bf x_i}))^2 \right) \right\} \\ &&\\ &&\\ &=& \left\{\mbox{something based on the} ~\mathbf{same}~ \mbox{samples used by } ~\widehat{\mu} \right\} \\ &&\\ && ~~~~~~~~~~+ \left\{\mbox{something based on samples} ~\mathbf{not}~ \mbox{used by } ~\widehat{\mu} \right\}. \end{array} \]

If, for example, all samples $\samp{S}_j$ were of the same size $n$ and similarly $N - n$ for $\pop{T}_j$, then this could be written as

\[ \begin{array}{rcl} APSE(\pop{P}, \widetilde{\mu}) &=& \frac{n}{N} \left\{ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} \left( \frac{1}{n}\sum_{i \in \samp{S}_j}(y_i - {\mu}({\bf x_i}))^2 \right) \right.\\ &&\\ && ~~~~~~~~~~+ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} \left( \frac{1}{n}\sum_{i \in \samp{S}_j}(\widehat{\mu}_{\samp{S}_j}({\bf x_i}) - \overline{\widehat{\mu}}({\bf x_i}))^2 \right)\\ &&\\ && ~~~~~~~~~~+ \left. \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} \left( \frac{1}{n} \sum_{i \in \samp{S}_j} (\overline{\widehat{\mu}}({\bf x_i}) - {\mu}({\bf x_i}))^2 \right) \right\} \\ &&\\ && + \left( 1 - \frac{n}{N} \right) \left\{ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} \left( \frac{1}{N -n}\sum_{i \in \pop{T}_j} (y_i - {\mu}({\bf x_i}))^2 \right) \right.\\ &&\\ && ~~~~~~~~~~~~~~~~~~~~+ \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} \left(\frac{1}{N - n}\sum_{i \in \pop{T}_j} (\widehat{\mu}_{\samp{S}_j}({\bf x_i}) - \overline{\widehat{\mu}}({\bf x_i}))^2 \right)\\ &&\\ && ~~~~~~~~~~~~~~~~~~~~+ \left. \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} \left( \frac{1}{N-n}\sum_{i \in \pop{T}_j} (\overline{\widehat{\mu}}({\bf x_i}) - {\mu}({\bf x_i}))^2 \right) \right\}\\ &&\\ &&\\ &=& \left(\frac{n}{N}\right) \left\{\widehat{APSE}(\pop{P},\bigwig{\mu})~~ \mbox{based on the} ~\mathbf{same}~ \mbox{samples used by } ~\widehat{\mu} \right\} \\ &&\\ && ~~~~~~~~~~+ \left(1-\frac{n}{N}\right) \left\{\widehat{APSE}(\pop{P},\bigwig{\mu}) ~~~ \mbox{based on samples} ~\mathbf{not}~ \mbox{used by } ~\widehat{\mu} \right\} \end{array} \] with the last line being a small abuse of the notation.

Clearly, if $n << N$, then the second term dominates. Sometimes, even if $n \approx N$ we might also want to focus our evaluation only on the second term since this evaluation is based on values not used in the actual estimation process.

We can use these quantities to compare predictors as before.

5.5 Calculating the $APSE({\cal P},\widetilde{\mu})$

Having decomposed $APSE({\cal P},\widetilde{\mu})$ into a number of interpretable parts, we can write a number of functions that will allow us to investigate the behaviour of a predictor according to each of these several parts.

5.5.1 Samples

To begin, we need some function to generate a sample $\samp{S}$ and its complement $\pop{T}$ from a population $\pop{P}$.

### The function getSampleComp will return a logical vector
### (that way the complement is also recorded in the sample)
getSampleComp <- function(pop, size, replace=FALSE) {
  N <- popSize(pop)
  samp <- rep(FALSE, N)
  samp[sample(1:N, size, replace = replace)] <- TRUE
  samp
}

### This function will return a data frame containing
### only two variates, an x and a y
getXYSample <- function(xvarname, yvarname, samp, pop) {
  sampData <- pop[samp, c(xvarname, yvarname)]
  names(sampData) <- c("x", "y")
  sampData
}

5.5.2 Predictor functions

The predictor function could be created by any function. Here focus will be on polynomial functions as considered in the acres92 prediction example. To make this appear more general, we write a getmuhat(...) function which will in this instance just produce a least-squares polynomial fit.

getmuhat <- function(sampleXY, complexity = 1) {
  formula <- paste0("y ~ ",
                    if (complexity==0) {
                      "1"
                      } else 
                      paste0("poly(x, ", complexity, ", raw = TRUE)") 
  )
  
  fit <- lm(as.formula(formula), data = sampleXY)
  
  ## From this we construct the predictor function
  muhat <- function(x){
    if ("x" %in% names(x)) {
      ## x is a dataframe containing the variate named
      ## by xvarname
      newdata <- x
    } else 
      ## x is a vector of values that needs to be a data.frame
    {newdata <- data.frame(x = x) }
    ## The prediction
    predict(fit, newdata = newdata)
  }
  ## muhat is the function that we need to calculate values 
  ## at any x, so we return this function from getmuhat
  muhat
}

To illustrate the point, using the sample we previously generated, we can get a function $\widehat{\mu}_{\samp{S}_j}(\ve{x})$ and evaluate it on any $x$.

### Set the variate names once
xvarname <- "acres82"
yvarname <- "acres92"

### Set the population once
pop <- agpop

### Construct the sample
samp <- getSampleComp(pop, 100)
sampData <- getXYSample(xvarname, yvarname, samp, pop)

### Set the degree of the polynomial
complexity <- 5

### Get the predictor function
muhat <- getmuhat(sampData, complexity)

### have a look
xlim <- extendrange(pop[, xvarname])

plot(sampData, 
     main=paste0("muhat (p=", complexity,") and its sample"),
     xlab = xvarname, ylab = yvarname,
     pch=19, col= adjustcolor("black", 0.5))
curve(muhat, from = xlim[1], to = xlim[2], 
      add = TRUE, col="steelblue", lwd=2)

The degree 5 polynomial predictor tracks the sample data very well (as might be expected). More complex (i.e. higher degree) polynomials have more ups and downs in their curves which provides them with more shapes to choose from so that they might be fit to predict better.

Given many samples $\samp{S}_j$, $j=1, \ldots, N_S$, and hence many $\widehat{\mu}_{\samp{S}_j}(\ve{x})$, we also need the function \[\overline{\widehat \mu} (\ve{x}) = \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}}\widehat{\mu}_{\samp{S}_j}(\ve{x}).\] The following function takes a list of the functions $\widehat{\mu}_{\samp{S}_j}(\ve{x})$ and returns the function $\overline{\widehat \mu} (\ve{x})$.

# We also need a function that will calculate the average
# of the values over a set of functions ... mubar(x) = ave (muhats(x)) 
# The following function does that.
# It returns a function mubar(x) that will calculate the average
# of the functions it is given at the value x.
getmubar <- function(muhats) {
  # the muhats must be a list of muhat functions
  # We build and return mubar, the function that 
  # is the average of the functions in muhats
  # Here is mubar:
  function(x) {
    # x here is a vector of x values on which the
    # average of the muhats is to be determined.
    # 
    # sapply applies the function given by FUN
    # to each muhat in the list muhats
    Ans <- sapply(muhats, FUN=function(muhat){muhat(x)})
    # FUN calculates muhat(x) for every muhat and
    # returns the answer Ans as a matrix having
    # as many rows as there are values of x and
    # as many columns as there are muhats.
    # We now just need to get the average 
    # across rows (first dimension)
    # to find mubar(x) and return it
    apply(Ans, MARGIN=1, FUN=mean)
  }
}

To illustrate this, we need many samples $\samp{S}_j$ and the corresponding predictor functions $\widehat{\mu}_{\samp{S}_j}({\bf x})$.

N_S <- 3
set.seed(36756781)  # for reproducibility
Ssamples <- lapply(1:N_S, FUN= function(i) {getSampleComp(pop, size = 100)})
muhats <- lapply(Ssamples, 
                 FUN=function(sample){
                   getmuhat(getXYSample(xvarname, yvarname, sample, pop),
                            complexity)
                   }
                 )
mubar <- getmubar(muhats)
# have a look
xvals <- seq(xlim[1], xlim[2], length.out = 200)

ylim <- range(sapply(muhats, 
                     FUN=function(muhat) range(muhat(xvals))
                     )
              )

ylim <- extendrange(c(ylim, pop[, yvarname]))

cols <- colorRampPalette(c(rgb(0,0,1,1), rgb(0,0,1,0.5)), 
                         alpha = TRUE)(N_S)

# plot the points from one sample
plot(pop[,c(xvarname, yvarname)], 
     pch=19, col= adjustcolor("black", 0.5),
     xlab="x", ylab="predictions",
     main= paste0(N_S, " muhats (degree = ", complexity, ") and mubar")
)

for (i in 1:N_S) {
  curveFn <- muhats[[i]]
  curve(curveFn, from = xlim[1], to = xlim[2],
        add=TRUE, col=cols[i], lwd=3, lty=(i+1))
}

curve(mubar,  from = xlim[1], to = xlim[2],
      add=TRUE, col="firebrick", lwd=3)

legend("topleft", 
       legend=c(paste0("muhat", 1:N_S),"mubar"),
       col=c(cols, "firebrick"), 
       ncol=2, cex=0.5,
       lwd=3, lty=c(2:(N_S + 1),1))

As can be seen, the accuracy of the predictor function will depend on the sample on which it is based. In the above plot, one of the three appears to be much better than the other two. The average $\overline{\widehat{\mu}}$ is largely affected by one of the three $\widehat{\mu}$ functions.

5.5.3 Average squared differences

Finally, we need some functions to calculate the average squared differences over any given sample. One of these will be based on the difference between $y$ and any predictor function, the other will be between any two predictor functions.

### We also need some average squared error functions.
### 
### The first one we need is one that will return the
### average squared error of the predicted predfun(x) 
### compared to the actual y values for those x values.
### 
ave_y_mu_sq <- function(sample, predfun, na.rm = TRUE){
  mean((sample$y - predfun(sample$x))^2, na.rm = na.rm)
}

### We will also need to calculate the average difference
### between two different predictor functions over some set
### of x values: Ave ( predfun1(x) - predfun2(x))^2
### 
ave_mu_mu_sq <- function(predfun1, predfun2, x, na.rm = TRUE){
  mean((predfun1(x) - predfun2(x))^2, na.rm = na.rm)
}

These would be used as follows:

### For  y - muhat(x)
ave_y_mu_sq(sampData, muhat)

## [1] 1478953743

### For muhat - mubar
ave_mu_mu_sq(muhat, mubar, sampData$x)

## [1] 524325140565

These functions are then used to determine the average predicted square errors and its three components.

5.5.4 $APSE$ and its components

The average predicted squared errors over many samples from $\pop{P}$, $APSE(\pop{P}, \widetilde{\mu})$ will be a function of the samples $\samp{S}_j$ and the samples $\pop{T}_j$, as well as the predictor function, which for our examples is determined entirely by the complexity parameter (here the degree of the polynomial that is the predictor).

### To determine APSE(P, mu_tilde) we need to average over all samples
### the average over all x and y in the test sample
### the squared error of the prediction by muhat based on each sample.
### Since each mutilde in our examples has a complexity parameter
### (degree of the polynomial degree) so too will apse.

apse <- function(Ssamples, Tsamples, complexity){
  ## average over the samples S
  ## 
  N_S <- length(Ssamples)
  mean(sapply(1:N_S, 
              FUN=function(j){
                S_j <- Ssamples[[j]]
                ## get the muhat function based on 
                ## the sample S_j
                muhat <- getmuhat(S_j, complexity = complexity)
                ## average over (x_i,y_i) in a
                ## single sample T_j the squares
                ## (y - muhat(x))^2
                T_j <- Tsamples[[j]]
                ave_y_mu_sq(T_j,muhat)
              }
  )
  )
}

Similar functions can be defined for each of the components of $APSE(\pop{P}, \widetilde{\mu})$.

For $Ave_{\bf x}\left\{ Var(y|{\bf x}) \right\}$ we have

### To determine Var(y) we need to average over all samples
### the average over all x and y in the test sample
### the squared error of the prediction by mu.

var_y <- function(Ssamples, Tsamples, mu){
  ## average over the samples S
  ## 
  N_S <- length(Ssamples)
  mean(sapply(1:N_S,
              FUN=function(j){
                ## average over (x_i,y_i) in a
                ## single sample T_j the squares
                ## (y - muhat(x))^2
                T_j <- Tsamples[[j]]
                ave_y_mu_sq(T_j, mu)
              }
  )
  )
}

Note that var_y(...) requires the conditional mean function $\mu(\ve{x})$ to be implemented and passed as the argument mu.

### For a given population, and choice of x and y
### mu(x) returns the average of the ys 

getmuFun <- function(pop, xvarname, yvarname){
  ## First remove NAs
  pop <- na.omit(pop[, c(xvarname, yvarname)])
  x <- pop[, xvarname]
  y <- pop[, yvarname]
  xks <- unique(x)
  muVals <- sapply(xks,
                   FUN = function(xk) {
                     mean(y[x==xk])
                   })
  ## Put the values in the order of xks
  ord <- order(xks)
  xks <- xks[ord]
  xkRange <-xks[c(1,length(xks))]
  minxk <- min(xkRange) 
  maxxk <- max(xkRange)
  ## mu values
  muVals <- muVals[ord]
  muRange <- muVals[c(1, length(muVals))]
  muFun <- function(xVals){
    ## vector of predictions
    ## same size as xVals and NA in same locations
    predictions <- xVals
    ## Take care of NAs
    xValsLocs <- !is.na(xVals)
    ## Just predict non-NA xVals
    predictions[xValsLocs] <- sapply(xVals[xValsLocs],
                                    FUN = function(xVal) {
                                      if (xVal < minxk) {
                                        result <- muRange[1]
                                      } else
                                        if(xVal > maxxk) {
                                          result <- muRange[2]
                                        } else
                                        {
                                          xlower <- max(c(minxk, xks[xks < xVal]))
                                          xhigher <- min(c(maxxk, xks[xks > xVal]))
                                          mulower <- muVals[xks == xlower]
                                          muhigher <- muVals[xks == xhigher]
                                          interpolateFn <- approxfun(x=c(xlower, xhigher),
                                                                     y=c(mulower, muhigher))
                                          result <- interpolateFn(xVal)
                                        }
                                      result
                                    }
    )
    ## Now return the predictions (including NAs)
    predictions
  }
  muFun
}

For $Var(\widetilde{\mu})$:

### To determine Var(mutilde) we need to average over all samples
### the average over all x and y in the test sample
### the squared difference of the muhat and mubar.

var_mutilde <- function(Ssamples, Tsamples, complexity){
  ## get the predictor function for every sample S
  muhats <- lapply(Ssamples, 
                   FUN=function(sample){
                     getmuhat(sample, complexity)
                   }
  )
  ## get the average of these, mubar
  mubar <- getmubar(muhats)
  
  ## average over all samples S
  N_S <- length(Ssamples)
  mean(sapply(1:N_S, 
              FUN=function(j){
                ## get muhat based on sample S_j
                muhat <- muhats[[j]]
                ## average over (x_i,y_i) in a
                ## single sample T_j the squares
                ## (y - muhat(x))^2
                T_j <- Tsamples[[j]]
                ave_mu_mu_sq(muhat, mubar, T_j$x)
              }
  )
  )
}

And for $Bias^2(\widetilde{\mu})$:

### To determine bias(mutilde)^2 we need to average over all samples
### the average over all x and y in the test sample
### the squared difference of the muhat and mubar.

bias2_mutilde <- function(Ssamples, Tsamples, mu, complexity){
  ## get the predictor function for every sample S
  muhats <- lapply(Ssamples, 
                   FUN=function(sample) getmuhat(sample, complexity)
  )
  ## get the average of these, mubar
  mubar <- getmubar(muhats)
  
  ## average over all samples S
  N_S <- length(Ssamples)
  mean(sapply(1:N_S, 
              FUN=function(j){
                ## average over (x_i,y_i) in a
                ## single sample T_j the squares
                ## (y - muhat(x))^2
                T_j <- Tsamples[[j]]
                ave_mu_mu_sq(mubar, mu, T_j$x)
              }
  )
  )
}

If all three components and the total $APSE$ are wanted at once, then it would be best to try to minimize the number of loops. The following function will be more efficient than running each of the four functions separately.

apse_all <- function(Ssamples, Tsamples, complexity, mu){
  ## average over the samples S
  ##
  N_S <- length(Ssamples)
  muhats <- lapply(Ssamples, 
                   FUN=function(sample) getmuhat(sample, complexity)
  )
  ## get the average of these, mubar
  mubar <- getmubar(muhats)
  
  rowMeans(sapply(1:N_S, 
                  FUN=function(j){
                    T_j <- Tsamples[[j]]
                    muhat <- muhats[[j]]
                    ## Take care of any NAs
                    T_j <- na.omit(T_j)
                    y <- T_j$y
                    x <- T_j$x
                    mu_x <- mu(x)
                    muhat_x <- muhat(x)
                    mubar_x <- mubar(x)
                    
                    ## apse
                    ## average over (x_i,y_i) in a
                    ## single sample T_j the squares
                    ## (y - muhat(x))^2
                    apse <- (y - muhat_x)
                    
                    ## bias2:
                    ## average over (x_i,y_i) in a
                    ## single sample T_j the squares
                    ## (y - muhat(x))^2
                    bias2 <- (mubar_x -mu_x)
                    
                    ## var_mutilde
                    ## average over (x_i,y_i) in a
                    ## single sample T_j the squares
                    ## (y - muhat(x))^2
                    var_mutilde <-  (muhat_x - mubar_x)
                    
                    ## var_y :
                    ## average over (x_i,y_i) in a
                    ## single sample T_j the squares
                    ## (y - muhat(x))^2
                    var_y <- (y - mu_x)
                    
                    ## Put them together and square them
                    squares <- rbind(apse, var_mutilde, bias2, var_y)^2
                    
                    ## return means
                    rowMeans(squares)
                  }
  ))
}

5.5.5 Putting it all together

The above set of functions can now be put together to evaluate the $APSE(\pop{P}, \widetilde{\mu})$ and its components for any predictor function. We need to determine each of the following:

choice of xvarname, yvarname, population, and sample size $n$
mu the conditional mean of $Y|X=x$
$N_S$, the number of samples $\samp{S}_j$ to generate
$N_T$, the number of observations in each sample $\pop{T}_j$
Details of the predictor function, which will depend on a complexity parameter (i.e. the degree of the polynomial)

For the sake of illustration, we take $n=1000$ to be 10 times the previous example. Also $N_{\samp{S}} = 30$ samples and their complement “test” samples will be generated.

### The generative model details
### 
### n, the size of each sample S_j
n <- 1000
### N_S, the number of samples of S_j
N_S <- 30
### The function mu(x) will be as before
mu <- getmuFun(pop, xvarname, yvarname)

With these values, samples are generated as

set.seed(34243411)
### We get N_S samples S_j of size n,  j=1, ..., N_S
samps <- lapply(1:N_S, FUN= function(i){getSampleComp(pop, n)})
Ssamples <- lapply(samps, 
                   FUN= function(Si){getXYSample(xvarname, yvarname, Si, pop)})

### And then matching N_S samples T_j, each of size N_T,  j=1, ..., N_S
Tsamples <- lapply(samps, 
                   FUN= function(Si){getXYSample(xvarname, yvarname, !Si, pop)})

And using these samples, $APSE(\pop{P}, \widetilde{\mu})$ and its components can be calculated for any predictor function (by specifying complexity).

First $APSE(\pop{P}, \widetilde{\mu})$

complexity <- 5
apse(Ssamples, Tsamples, complexity)

## [1] 3.045007e+12

The $Ave_{\bf x}\left\{Var(y | {\bf x}) \right\}$

var_y(Ssamples, Tsamples, mu=mu)

## [1] 4898337403

The $Var(\widetilde{\mu})$

var_mutilde(Ssamples, Tsamples, complexity)

## [1] 2.56281e+12

and finally $Bias^2(\widetilde{\mu})$

bias2_mutilde(Ssamples, Tsamples, mu, complexity)

## [1] 280997489373

the last three of which should (approximately) sum to $APSE(\pop{P}, \widetilde{\mu})$. (Differences can be due to floating point calculations, sample sizes changing due to missing values, and use of the $\pop{T}$s rather than $\pop{P}$.)

Again, if we want all of these, it is more efficient to calculate them at all on the same pass through the data as in

apse_all(Ssamples, Tsamples, complexity = complexity, mu = mu)

##         apse  var_mutilde        bias2        var_y 
## 3.045007e+12 2.562810e+12 2.809975e+11 4.898337e+09

The result will be the same. Note that the last three components (approximately) sum to the first.

5.6 Complexity example revisited

We now have the tools to use $APSE(\pop{P}, \widetilde{\mu})$ as a means to compare different polynomial complexity predictor functions for the complexity example. The same samples will be used as above but now the values for several degrees of polynomial predictor will be compareed.
We need just calculate the values of $APSE(\pop{P}, \widetilde{\mu})$ and its constituent components of which $Var(\widetilde{\mu})$ and $Bias^2(\widetilde{\mu})$ will be of most interest.

### the degrees of the polynomial
complexities <- 0:5

apse_vals <-  sapply(complexities, 
                     FUN = function(complexity){
                       apse_all(Ssamples, Tsamples, 
                                complexity = complexity, mu = mu)
                     }
)

# Print out the results
t(rbind(complexities, apse=round(apse_vals,5)))

##      complexities         apse  var_mutilde        bias2      var_y
## [1,]            0 1.843789e+11 1.111654e+08 182644565298 4898337403
## [2,]            1 2.968278e+09 2.608415e+07   1785439231 4898337403
## [3,]            2 3.247282e+09 2.500862e+08   1802728832 4898337403
## [4,]            3 1.172492e+10 8.146902e+09   2151540789 4898337403
## [5,]            4 1.708310e+11 1.668576e+11   2290318245 4898337403
## [6,]            5 3.045007e+12 2.562810e+12 280997489373 4898337403

Plotting the values of $APSE(\pop{P}, \widetilde{\mu})$ as a function of the complexity (as measured by df) will allow a visual comparison of the predictors.

Note that the values of $APSE$ are plotted here on a (base 10) logarithmic scale. Again complexity = 1 has the smallest $APSE$ and more complex models have much poorer predictions over the various samples.

5.6.1 Components

The components of the $APSE$ provide some insight into the effect of increasing the complexity of the predictor function. The components can be plotted on a logarithmic scale again.

The $APSE$ function and its four components are shown in the figure. $APSE$ follows the``U’’ shape seen earlier: beginning high at degree 0, quickly dropping to its minumum at degree 1, and then rising thereafter as the degree of the polynomial increases. This suggests that degree 1 (a straight line predictor) is the best choice of polynomial predictor function.

Of the three components, the average conditional variance var_y is least interesting and is constant regardless of the degree (as it must be matematically). More interesting is the behaviour of the other two components.

The red triangles show the variance of the predictor function from sample to sample. This variance decreases to its minimum at degree 1 and rises rapidly thereafter. As the degree increases the estimated predictor becomes more variable as it adapts more to the peculiarities of each sample.

In contrast, the blue squares show the squared bias of the predictor function. It too decreases quickly at the beginning and then tends to level off. This is typical of the bias; it decreases as the complexity of the predictor function increases. The reason for this is that more complex functions are typically able to adapt to the data in the sample.

The $APSE$ function combines both the variablility and the bias in the predictor function. Choosing the predictor function by the smallest $APSE$ value trades bias and variance off against one another, allowing one to increase provided the other decreases sufficiently to achieve a lower sum.

An interesting feature of this example is that the bias does not strictly decrease as the complexity increases. It gradually increases and at some point does so dramatically, much like the variance. Two things combine to cause this. First, the acreage data has a small number of outliers, counties that have a great many acres. The scatterplot of all counties is reproduced below.

If large acreage counties are not included in the sample then the polynomial predictor will use most of its complexity to better fit the small acreage (appearing in the lower left corner of the scatterplot). Second, polynomials notoriously head off to $\pm \infty$ at their left and right extremes so that if the outlying points are not in the sample they are likely to be very poorly fit by a polynomial. The greater the degree of the polynomial the more it adapts to the data used to fit it allowing it to head to $\pm \infty$ as soon as it leaves the range of the sample.

Exercise Repeat the above but with sample size $n=100$. Draw the graph of the components of $APSE$ and describe how it differs from that when $n=1000$. Explain whatever differences you see.

Exercise Use the sample $\samp{S}$ as the population $\pop{P}$ and calculate $APSE(\samp{S}, \widehat{\mu}_{\samp{S}}(\ve{x}))$ (i.e. instead of $APSE(\pop{P}, \widehat{\mu}_{\samp{S}}(\ve{x}))$) for the polynomial predictors. Graph the resulting value of $APSE(\samp{S}, \widehat{\mu}_{\samp{S}}(\ve{x}))$ as a function of increasing complexity. Describe what you see. Explain why this is different from the graph for $APSE(\pop{P}, \widehat{\mu}_{\samp{S}}(\ve{x}))$.

Exercise: Instead of polynomials, consider predictors which are “smoothing splines”. These adapt much more to the local behaviour of the relationship between a response variate $y$ and an explanatory variate $x$. They are fit in R using smooth.spline(x, y, df = df). The parameter df is the “effective degrees of freedom” and is a measure of the complexity of the smoother (and hence its adaptability to the data). Rewrite whatever functions (including getmuhat(...) ) so that the $APSE$ function, and its components, may be graphed for this predictor function and a number of complexity values (here df). Describe the bias-variance tradeoff for these predictors (try both $n=100$ and $n=1000$).

Exercise: Change the getSampleComp(...) function so that it does stratified random sampling where strata are the states. How does this change (if at all) the behaviour of $APSE$ and its components for the polynomial predictors of acreage? Comment on your findings.

5.7 The harsh reality

As the examples show, predictive accuracy provides insight into the performance of a predictor and can be used to choose between competing ones. The key to this usefulness however is that the predictive accuracy can be measured on population $\pop{P}$ about which we we want to make inference.

Unfortunately, reality rarely provides us with data in $\pop{P}$ that is not in our sample $\samp{S}$. That is, we typically have no more than $\samp{S}$ and so have neither $\pop{P}$ nor $\pop{T}= \pop{P} - \samp{S}$.

The situation is not however unfamiliar – it is the basic problem of inductive inference. Experience there says that whenever interest lies in some attribute of the population, $a(\pop{P})$ say, we might use $a(\samp{S})$ as an estimate of that attribute. This can be fairly reliable for many attributes based on properly selected samples $\samp{S}$.

5.7.1 Predictive accuracy as a population attribute

Analogously, we cast predictive accuracy as an attribute of the $\pop{P}$ and then use the corresponding attribute evaluated on $\samp{S}$ as an its estimate.

We could, for example choose to make comparisons based on the population attribute: \[ APSE(\pop{P}, \widehat{\mu}_\samp{S}) = \frac{1}{N} \sum_{i \in \pop{P}} (y_i - \widehat{\mu}_\samp{S}({\bf x}_i))^2 \] which has, as part of its definition, the particular sample $\samp{S}$ that was chosen. Call this the single subset version.

Alternatively, we might choose to population attribute based on multiple (e.g. all possible) samples $\samp{S}$ from $\pop{P}$. This can be written as \[ APSE(\pop{P}, \widetilde{\mu}) = \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} APSE(\pop{P}, \widehat{\mu}_{\samp{S}_j}). \] This depends on all of the samples $\samp{S}_1, \ldots, \samp{S}_{N_\samp{S}}$ but remains a population attribute nevertheless. Call this the multiple subset version.

In either case, the attribute is a function of both the estimated predictor function $\widehat{\mu}(\ve{x})$ and of the samples used. These are two distinct population attributes, each a slightly different measure of an average prediction squared error.

Finally, since we are really more concerned with how well each estimator performs on that part of the population which was not used to construct the estimate, rather than average over all units in the population $\pop{P}$ we average over those units in $\pop{T}=\pop{P} - \pop{S}$. This change can be particularly valuable when $N_\pop{T} << N$.

Even with $\pop{P}$ replaced by $\pop{T}$ (or by $\pop{T}_j$) in the above, the two accuracy measures remain attributes of the population $\pop{P}$.

We now visit each of these in turn.

5.7.1.1 The single subset version

The first is based on a single subset $\samp{S}$ and a single predictor function $\widehat{\mu}_\samp{S}({\bf x})$ constructed from that one sample $\samp{S}$. The prediction errors are evaluated on the single set $\pop{T} = \pop{P} - \samp{S}$. \[ APSE(\pop{T}, \widehat{\mu}_\samp{S}) = \frac{1}{N_\pop{T}} \sum_{i \in \pop{T}} (y_i - \widehat{\mu}_\samp{S}({\bf x}_i))^2 \] is seen to be a population attribute like any other $a(\pop{P})$. It might seem unusual because it distinguishes subsets of $\pop{P}$ and uses them in different ways.

Following the same logic, we use the sample $\samp{S}$ in place of $\pop{P}$ to estimate the $APSE(\pop{P}, \widehat{\mu}_\samp{S})$. To emphasize that we now want to think of $\samp{S}$ as if it were the population $\pop{P}$, we let $\pop{P}_0$ denote $\samp{S}$.

Given our population substitute $\pop{P}_0$, we choose a subset $\samp{S}_0$ from $\pop{P}_0$ and denote its complement in $\pop{P}_0$ by $\pop{T}_0$. Then evaluating the attribute using $\pop{T}_0$ gives \[ APSE(\pop{T}_0, \widehat{\mu}_{\samp{S}_0}) = \frac{1}{N_{\pop{T}_0}} \sum_{i \in {\pop{T}_0}} (y_i - \widehat{\mu}_{\samp{S}_0}({\bf x}_i))^2. \] This could then serve as an estimate of $APSE(\pop{T}, \widehat{\mu}_\samp{S})$ and so used to choose between competing predictors $\widehat{\mu}$ as before. Given this notation, we could also write \[ \widehat{APSE}(\pop{P}, \widehat{\mu}_\samp{S}) = APSE(\pop{P}_0, \widehat{\mu}_{\samp{S}_0}). \]

Because the estimate $\widehat{\mu}_{\samp{S}_0}({\bf x})$ is determined only from observations in $\samp{S}_0$, this collection of observations is sometimes called the training set. The language is based on the metaphor that estimation of a prediction function is like learning the predictor function from the data (and we sometimes say $\samp{S}_0$ is used to “train” the predictor function).

Analogously, the out of sample set $\pop{T}_0$ is also often called the test set since it is used to asess the quality of the “learning”. The test set is also more traditionally called a hold-out sample to not be used in estimation but to assess the quality of prediction. For the same reason it has also long been called a validation set.

Of course arises the question as to how to pick $\samp{S}_0$ from $\pop{P}_0$ now arises.

5.7.1.2 The multiple subset version

The second predictive accuracy measure differs from the first principally in that it is based on many subsets, $\samp{S}_j$ of $\pop{P}$, for each of which a predictor $\widehat{\mu}_{\samp{S}_j}({\bf x})$ is constructed and its performance evaluated on the complement set, $\pop{T}_j$. The average is taken of these performances over all $N_\samp{S}$ possible samples.
\[ APSE(\pop{P}, \widetilde{\mu}) = \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} APSE(\pop{P}, \widehat{\mu}_{\samp{S}_j}) \] is a population attribute that uses many subsets in its definition.

The sample version of this simply replaces $\pop{P}$ by the sample in hand, again now denoted as $\pop{P}_0 = \samp{S}$. That is we write it as \[ APSE(\pop{P}_0, \widetilde{\mu}) = \frac{1}{N_\samp{S}} \sum_{j=1}^{N_\samp{S}} APSE(\pop{P}_0, \widehat{\mu}_{\samp{S}_j}). \] Notationally the only difference here is that the first argument of the $APSE$ is now $\pop{P}_0$ and not $\pop{P}$. The consequence is that each $S_j$ in the definition is now a subset of $\pop{P}_0$ and $\pop{T}_j$ is its complement in $\pop{P}_0$ not in $\pop{P}$. Given this notation, we could write \[ \widehat{APSE}(\pop{P}, \widetilde{\mu}) = APSE(\pop{P}_0, \widetilde{\mu}). \]

As with a single subset, the question remains as to how to pick the subsets $\samp{S}_j$ from $\pop{P}_0$ for $j=1, \ldots, N_\samp{S}$.

5.7.2 Choosing the subsets

It is not always obvious how one should choose $\samp{S}_j$ and $\pop{T}_j$ in a given situation.

One guide is that the method of selecting $\samp{S}_j$ from $\pop{P}_0$ should be as similar as possible to that of selecting the sample $\samp{S}$ from the study population $\pop{P}$. That is, the same sampling mechanism would be used. For example, if $\samp{S}$ is a sample chosen at random from $\pop{P}$, then so should $\samp{S}_j$ be one chosen at random from $\pop{P}_0 = \samp{S}$. Typically this is what is done. However in general there could be different choices made depending on other aspects of the scientific context.

Suppose each subset $\samp{S}_j$ is to be chosen at random from $\pop{P}_0$. There are still several questions to ask. First, should the sampling be done with, or without, replacement? Second, how large should each sample $\samp{S}_j$ be? Should $\pop{T}_j$ be the full complement of $\samp{S}_j$ or just a sample from the complement? If the latter, how large should that $\pop{T}_j$ be? Finally, how many samples $S_j$ should we take? One? Many? How many?

To address the first, sampling with replacement would allow the possibility that the observations used in constructing the predictor function might also be used to assess its performance. Since the predictive accuracy is meant to be an “out-of-sample” assessment, it would seem more prudent to restrict ourselves to sampling without replacement. Sampling without replacement reduces the possibility of overestimating the predictor’s accuracy.

As to how large the sample should be, we can get some insight from the fact that the predicted squared errors are averaged. For example, we know that if we have $N_{\pop{T}_j}$ observations in a test set $\pop{T}_j$, then the standard deviation of an average over the set will decrease as $N_{\pop{T}_j}$ increases proportionately to $1/\sqrt{N_{\pop{T}_j}}$. The larger is $N_{\pop{T}_j}$ the better (i.e. less variable) will be our estimate of the average squared error.

Conversely, the larger is $\pop{T}_j$ the smaller is $\samp{S}_j$, the size of the training set. The smaller the training set is, the lower is the quality of the predictor function $\widehat{\mu}_{\samp{S}_1}({\bf x})$ on that training set. That could easily lead to systematically underestimating the predictor accuracy for the full population.

Choosing a sample size requires some tradeoff between the variability and the bias of the estimate. The sample size of $\samp{S}_j$ needs to be large enough to ensure that the predictor function will have stabilized and have low squared error, yet small enough so that its complement $\pop{T}_j$ is large enough to have small variability in estimating the prediction error over $\pop{T}_j$.

5.7.2.1 By partitioning ${\cal P}_0$

A simple way to create a sample $\samp{S}_j$ is to partition $\pop{P}_0$ into pieces, or groups, and then select some groups to form $\samp{S}_j$ and the remainder to form $\pop{T}_j$.

Typically, $\pop{P}_0$ is partitioned into $k$ groups $G_1, G_2, \ldots, G_k$ of equal size (approximately equal in practice). We call this a $k$-fold partition of $\pop{P}_0$:

k-fold partition figure

Selecting any set of groups from the partition will define a sample $\samp{S}_j$ and the remaining groups will define its complement $\pop{T}_j$.

The most common (and simplest) means of selecting the groups would be to select $k-1$ groups to form $\samp{S}_j$ and the remaining groups to form $\pop{T}_j$. For example, when $k=5$ we have the following partition of $\pop{P}_0$ with the green groups forming the sample and the red group forming the test.

5-fold partition figure

In this case, $\samp{S}_j = G_1 \cup G_2 \cup G_3 \cup G_5$ and $\pop{T}_j = G_4$.

Note that for a $k$-fold partition there can only be $k$ different pairs of sample $\samp{S}_j$ and test set $\pop{T}_j$. That is $N_\samp{S}=k$.

Calculating $APSE(\pop{P}_0, \widetilde{\mu})$ using sampling that selects all $k-1$ groups from a $k$-fold partition is known as $k$-fold cross-validation in the literature.

Several questions remain. For example,

What value should $k$ take?
Clearly a large value of $k$ will produce a large sample $\samp{S}_j$ but a smaller test set $\pop{T}_j$. A predictor based on a larger $\samp{S}_j$ (i.e. larger $k$) should be closer to that based on all of $\samp{S}$; one based on smaller $\samp{S}_j$ (i.e. smaller $k$) should perform more poorly (being based on fewer observations) and so tend to systematically overestimate the prediction error.
How should the partition be constructed?
Simple random sampling is the obvious choice, but there may be contexts where other sampling protocols might also be considered.
Should we consider only one such partition?
If we have $p$ partitions, then $N_\samp{S} = p \times k$ and our estimate $~APSE(\pop{P}_0, \widetilde{\mu})$ of $~APSE(\pop{P}, \widetilde{\mu})$ should be less variable. Note, however, the larger the overlap in the samples $\samp{S}_j$ (i.e. the larger the value of $k$) the greater will be the correlation between the estimated predictor functions in each element of the sum. In determining the variance of the sum then, a $2 \times Cov(\widehat{\mu}_{\samp{S}_j}, \widehat{\mu}_{\samp{S}_k})$ term would be positive and inflating the variance of the average.

The first and third points suggest that even here a bias-variance trade-off must exist for selecting $k$.

Basu, D. 1958. “On Sampling with and Without Replacement.” SankhyÄ: The Indian Journal of Statistics (1933-1960) 20 (3/4). Springer: 287–94. http://www.jstor.org/stable/25048396.

Blatt, Ben. 2013. “Here’s Waldo.” Slate. http://www.slate.com/articles/arts/culturebox/2013/11/where_s_waldo_a_new_strategy_for_locating_the_missing_man_in_martin_hanford.html.

Efron, B. 1979. “Bootstrap Methods: Another Look at the Jackknife.” The Annals of Statistics 7 (1): 1–26. www.jstor.org/stable/2958830.

Efron, Bradley, and Robert J Tibshirani. 1994. An Introduction to the Bootstrap. CRC press.

Horvitz, Daniel G, and Donovan J Thompson. 1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” Journal of the American Statistical Association 47 (260). Taylor & Francis Group: 663–85.

Lohr, Sharon. 2009. Sampling: Design and Analysis. Nelson Education.

Moro, Sérgio, Paulo Rita, and Bernardo Vala. 2016. “Predicting Social Media Performance Metrics and Evaluation of the Impact on Brand Building: A Data Mining Approach.” Journal of Business Research 69 (9). Elsevier: 3341–51. http://dx.doi.org/10.1016/j.jbusres.2016.02.010.

Tukey, John W. 1977. Exploratory Data Analysis. Reading, Mass.

STAT 341: Computational Statistics & Data Analysis

R.W. Oldford