Resampling Stats in MATLAB This document is an excerpt from Resampling Stats in MATLAB Daniel T. Kaplan Copyright (c) 1999 by Daniel T. Kaplan, All Rights Reserved This document differs from the published book in pagination and in the omission (unintentional, but unavoidable for technical reasons) of figures and cross-references from the book. It is provided as a courtesy to those who wish to examine the book, but not intended as a replacement for the published book, which is available from Resampling Stats, Inc. www.resample.com 703-522-2713 Chapter 1: Sampling, Resampling, and Inference If one could go out and measure exactly and completely the quantities one is interested in, there would be no need for statistics. There are in fact many cases where this can be done; perhaps this is why so many people get so far in life without knowing anything about statistics. Want to know whether you have a fever? Use a thermometer to take your temperature. Want to know the price of an item in a store? Look at the price tag or ask the clerk. Want to know how many people there are in the United States? Go out and count them, a procedure stipulated by the census provisions of the U.S. constitution to be performed every 10 years. In reality, the census situation is not so simple. A count of approximately 280 million U.S. residents cannot be performed instantly. In the process of doing it some people who have already been counted will die, and others will be born without being counted. Some people will be counted twice since they will move from one residence to another during the counting period; others will not be counted for the same reasons. People in some segments of the population — homeless people, illegal immigrants — are just hard to count. For these reasons, the census bureau has recently proposed, in the face of considerable controversy and opposition, not to make an explicit person-by-person count, but to sample the population and use statistical techniques to estimate the size of the population. In many cases we have no choice but to base our conclusions on a sample rather than a full measurement. Suppose we want to find out which of two treatments for breast cancer is more effective. It would 1
2 Resampling Stats in MATLAB be impossible to apply both treatments to every person who has breast cancer. Instead, we select a sample of people with the disease and apply one treatment. We take another sample — different people — and apply the other treatment. We then compare the treatment outcomes in the two sample groups. This procedure raises some important practical questions. How should the two samples be picked? How large should the samples be? If we do find a difference between the treatment outcomes in the two groups, how confident are we that it is not just a chance outcome, the way a flip of the coin will randomly favor one player or the other? If we find no difference in outcome, how sure are we that this is not because our sample groups are too small? These are all questions of statistical inference: how we reason from a sample to the entire population of interest. Some other examples: • A biologist studies the ecology of freshwater mussels. He is interested in whether the diversity of species is decreasing in the face of deterioration of the environment from pollution and from the introduction of rapidly proliferating non-native species such as the zebra mussel. He cannot conduct a complete census: doing so would kill all the mussels. Instead, he takes small samples and infers from the sample what is happening to the population as a whole. For example: if 8 different species are found this year as opposed to 12 in the previous year, is one justified in concluding that species diversity is falling? • The price of health insurance is based on a calculation of how likely you are to get sick and how much it will cost to treat you if you do fall ill. The estimate of sickness rates is not primarily based on your own personal history (although factors such as smoking, age, gender, and so on play a role) but on data collected from a sample of the population. Insurance companies and health maintenance organizations need to use data from a past sample of the population in order to make conclusions about their present and future customers. How much money does the insurance company need to have in reserve in order to ensure that it can pay the bills for its customers?
1.1. SAMPLING 1.1 Sampling We sample from a population1 in order to make inferences about the population as a whole. How are we to collect the sample? What are the consequences of sampling for the inferences? People are often surprised to hear that the best way to pick samples is randomly. The process has two steps. (There are more complicated arrangements, but we won’t consider them here.) Step 1: Identify and delineate the population of interest. This identification is not done at random and often involves some expert knowledge about the subject of study. In a political poll, for instance, the population might consist of people who answer their phones and who respond affirmatively when asked if they are a registered voter. In a study of cancer treatments, the population might consist of all the patients who present themselves at one of the participating clinics and who are diagnosed as having a particular type of tumor. Step 2: Pick the sample at random from the population identified in Step 1. Care must be taken to ensure that the sample really is taken at random. In a cancer experiment, the population might be the first 1000 qualified patients who happen to come to the clinic and who sign a form indicating willingness to participate in the experiment.2 Then, for each such patient, a computer is used to generate a random code indicating whether the patient will be in the treatment group or in the control group. Television and radio stations sometimes conduct polls by asking their audience to telephone the station. Such polls say little about the overall population; the respondants are self-selected rather than randomly selected and the self-selected respondants tend to be those with strong opinions. The result can be a poll that is wildly misleading about the overall population. 1 The word “population” suggests that we are talking only about people or animals. In statistics, though, “population” has a broader meaning and can refer to any collection. 2 This constitutes a random sampling from a particular population: those people whose go for treatment to cancer clinics. If, however, the clinic announces its planned experiment publically, and patients who have previously proved untreatable converge on the clinic, the sampling would not be randomly drawn from all patients seeking cancer treatment. 3
4 Resampling Stats in MATLAB Random sampling is important because it helps in avoiding the systematic influence of confounding variables. A famous example is the 1954 polio study as described in reference [?]. As part of this study, the parents of children in second grade were asked to give permission for their children to be injected with an experimental polio vaccine. The children whose parents said yes were given the vaccine. The children whose parents said no were put into the control group. This is not random sampling; the two groups differ systematically in terms of the confounding variable of parental permission. You might find it hard to believe that parental permission has anything to do with polio. As things turned out, however, there was a strong connection between the willingness of parents to give permission and the risk of polio.3 How do we know in actuality that parental permission is associated with risk of polio? Because there was a simultaneous second study which randomly selected children for actual vaccination from the sub-group of children whose parents gave permission; the other children with permission were injected with a sterile placebo. One important consequence of using random sampling is that we encounter sampling variability: different samples provide somewhat different results. Consider a thought experiment in which we imagine a population of 1000 people, 10 of whom have a certain genetic trait — a 1% prevalence rate. Take a random sample of 100 people from the population. This sample might have no people with the trait, or one, or two, or up to 10. The exact number is random. Suppose that the sample has 2 people with the trait. If we knew only about the sample, and not about the whole population, would we be justified in concluding that the population has a 2% prevalence rate? To some extent the answer is yes, but we also realize that the rate in the population might not be exactly the same as the rate in our random sample. In addition to information about our sample itself we also have other knowledge. This is our knowledge of the sampling process itself and the way it leads to random variability. One of the main tasks of statistical inference is to characterize sampling variability in a way that puts reasonable bounds on what we can conclude about the population from our sample of it. For instance, if our sample of 100 shows a 2% prevalence, we would be quite confident that 3 This has been explained as resulting from the higher educational levels of parents who give permission. (See reference [?].) This is associated with more sanitary living conditions and consequently a reduced exposure to the polio virus during early childhood. An early exposure to polio can result in a natural immunity without noticible symptoms.