1.1. SAMPLING 1.1 Sampling We sample from a population1 in order to make inferences about the population as a whole. How are we to collect the sample? What are the consequences of sampling for the inferences? People are often surprised to hear that the best way to pick samples is randomly. The process has two steps. (There are more complicated arrangements, but we won’t consider them here.) Step 1: Identify and delineate the population of interest. This identification is not done at random and often involves some expert knowledge about the subject of study. In a political poll, for instance, the population might consist of people who answer their phones and who respond affirmatively when asked if they are a registered voter. In a study of cancer treatments, the population might consist of all the patients who present themselves at one of the participating clinics and who are diagnosed as having a particular type of tumor. Step 2: Pick the sample at random from the population identified in Step 1. Care must be taken to ensure that the sample really is taken at random. In a cancer experiment, the population might be the first 1000 qualified patients who happen to come to the clinic and who sign a form indicating willingness to participate in the experiment.2 Then, for each such patient, a computer is used to generate a random code indicating whether the patient will be in the treatment group or in the control group. Television and radio stations sometimes conduct polls by asking their audience to telephone the station. Such polls say little about the overall population; the respondants are self-selected rather than randomly selected and the self-selected respondants tend to be those with strong opinions. The result can be a poll that is wildly misleading about the overall population. 1 The word “population” suggests that we are talking only about people or animals. In statistics, though, “population” has a broader meaning and can refer to any collection. 2 This constitutes a random sampling from a particular population: those people whose go for treatment to cancer clinics. If, however, the clinic announces its planned experiment publically, and patients who have previously proved untreatable converge on the clinic, the sampling would not be randomly drawn from all patients seeking cancer treatment. 3
4 Resampling Stats in MATLAB Random sampling is important because it helps in avoiding the systematic influence of confounding variables. A famous example is the 1954 polio study as described in reference [?]. As part of this study, the parents of children in second grade were asked to give permission for their children to be injected with an experimental polio vaccine. The children whose parents said yes were given the vaccine. The children whose parents said no were put into the control group. This is not random sampling; the two groups differ systematically in terms of the confounding variable of parental permission. You might find it hard to believe that parental permission has anything to do with polio. As things turned out, however, there was a strong connection between the willingness of parents to give permission and the risk of polio.3 How do we know in actuality that parental permission is associated with risk of polio? Because there was a simultaneous second study which randomly selected children for actual vaccination from the sub-group of children whose parents gave permission; the other children with permission were injected with a sterile placebo. One important consequence of using random sampling is that we encounter sampling variability: different samples provide somewhat different results. Consider a thought experiment in which we imagine a population of 1000 people, 10 of whom have a certain genetic trait — a 1% prevalence rate. Take a random sample of 100 people from the population. This sample might have no people with the trait, or one, or two, or up to 10. The exact number is random. Suppose that the sample has 2 people with the trait. If we knew only about the sample, and not about the whole population, would we be justified in concluding that the population has a 2% prevalence rate? To some extent the answer is yes, but we also realize that the rate in the population might not be exactly the same as the rate in our random sample. In addition to information about our sample itself we also have other knowledge. This is our knowledge of the sampling process itself and the way it leads to random variability. One of the main tasks of statistical inference is to characterize sampling variability in a way that puts reasonable bounds on what we can conclude about the population from our sample of it. For instance, if our sample of 100 shows a 2% prevalence, we would be quite confident that 3 This has been explained as resulting from the higher educational levels of parents who give permission. (See reference [?].) This is associated with more sanitary living conditions and consequently a reduced exposure to the polio virus during early childhood. An early exposure to polio can result in a natural immunity without noticible symptoms.
Resampling the prevalence in the population as a whole is not 90% but we might not be so sure that is isn’t 5%. By using random sampling when collecting data we gain the ability to use mathematical techniques to analyze the sampling variability. Often, we simply assume that the sample is random because this makes it easy to do statistical inference calculations. Of course, to the extent that the assumption is wrong the results of our calculations may be misleading and unreliable. Our use of random sampling produces an uncertainty about our results that can be quantified and analyzed objectively. Failure to use random sampling results in uncertainties that are themselves unknown and subjective, and leads overall to a higher level of ignorance. 1.2 Resampling The basic idea of resampling is very simple: mimic the process of sampling by picking another sample at random from a hypothetical population of interest, usually based on data from your sample. Much of what we study in statistical inference is variability introduced by taking a sample of a population. When it is too expensive or impractical to sample more data from the population itself, we can study sampling variability by a simple expedient: we sample instead from an artificial population constructed on our computer and that embodies everything we know about the population of interest. In many, but not all, of the examples that follow, this artificial population is the very data set from which we seek to draw inferences. Since the data set is itself a sample of the whole population, we are taking a sample from the sample: resampling. This doesn’t, of course, provide more information about the population, but it does provide us with a way of understanding the consequences of sampling variability for drawing inferences about the population based on our data. One problem in mimicking the sampling process is that generally the population is much bigger than the sample. When carrying out resampling, though, our data is often exactly the same size of the resample to be taken. In order to make the data set seem larger than it is, we resample with replacement. For example, when resampling 10 points from a data set of 10 values, we keep all 10 values available when selecting each of the resampled points. It is as if we wrote all 10 values on individual slips of paper and put them in a hat. To resample 10 points, we select one slip at random from the hat and write its value on a tally sheet. We than replace the slip in the hat and repeat the process until we have 5
6 Resampling Stats in MATLAB 10 values on the tally sheet. This means that some of the original data values might appear more than once on the tally sheet; others might appear not at all. Based on this simple process of resampling, a number of techniques have been developed for answering questions in statistical inference. It is only in the last 20 years that resampling has achieved widespread use. This is actually a substantial fraction of the 90 or so years since the advent of modern statistical inference, but historically it is true that most statisticians have been trained and most statistics students have learned non-resampling ways of doing statistical inference. Resampling is generally impractical without access to computers. But fast, inexpensive computing is now a fact of life and this allows us to exploit an important advantage of resampling in teaching statistics: it is a natural and intuitive way of solving problems in statistical inference. 1.3 An Introduction to Probability We are surrounded in everyday life by statements involving probability. Most people have learned how to translate these statements into operational terms. “The chance of rain is 10% today” means that it is OK to plan a picnic. “This surgery has a survival rate of 98%” means that the surgery is risky, but that it is worthwhile if it will resolve a fatal 1 per vehicle disorder. “The probability of a fatal car accident is 5,000,000 mile” means that driving a car is fairly safe. “The odds of winning in the PowerBall lottery are 1 in 180 million” means not to quit your job because you think your ticket is lucky. Where people have difficulty is in using probabalities to make quantitative judgements or to analyze complicated situations. This is where Resampling Stats can help. Where do probabilities come from? When flipping a coin, what is the probability of getting heads? Everyone knows that heads and tails are equally likely: the probability of each is 50%. How do we know this? We could test the hypothesis by flipping the coin many times (10? 100? 1000?) and seeing whether heads comes up about half the time. But few of us have ever done this and certainly we are not in the habit of vetting the coin to be used in a game by testing it many times. We know that the probability of heads is 50% because 1) this seems reasonable (“There’s no difference between heads and tails from the coin’s perspective, so why should one or the other come up more often?”) and 2) we have never read anywhere an expos´e of the