×

Close

- Machine Learning - ML
- Note
**310 Views**- 10 Offline Downloads
- Uploaded 9 months ago

CHAPTER EVALUATING HYPOTHESES Empirically evaluating the accuracy of hypotheses is fundamental to machine learning. This chapter presents an introduction to statistical methods for estimating hypothesis accuracy, focusing on three questions. First, given the observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over additional examples? Second, given that one hypothesis outperforms another over some sample of data, how probable is it that this hypothesis is more accurate in general? Third, when data is limited what is the best way to use this data to both learn a hypothesis and estimate its accuracy? Because limited samples of data might misrepresent the general distribution of data, estimating true accuracy from such samples can be misleading. Statistical methods, together with assumptions about the underlying distributions of data, allow one to bound the difference between observed accuracy over the sample of available data and the true accuracy over the entire distribution of data. 5.1 MOTIVATION In many cases it is important to evaluate the performance of learned hypotheses as precisely as possible. One reason is simply to understand whether to use the hypothesis. For instance, when learning from a limited-size database indicating the effectiveness of different medical treatments, it is important to understand as precisely as possible the accuracy of the learned hypotheses. A second reason is that evaluating hypotheses is an integral component of many learning methods. For example, in post-pruning decision trees to avoid overfitting, we must evaluate

the impact of possible pruning steps on the accuracy of the resulting decision tree. Therefore it is important to understand the likely errors inherent in estimating the accuracy of the pruned and unpruned tree. Estimating the accuracy of a hypothesis is relatively straightforward when data is plentiful. However, when we must learn a hypothesis and estimate its future accuracy given only a limited set of data, two key difficulties arise: Bias in the estimate. First, the observed accuracy of the learned hypothesis over the training examples is often a poor estimator of its accuracy over future examples. Because the learned hypothesis was derived from these examples, they will typically provide an optimistically biased estimate of hypothesis accuracy over future examples. This is especially likely when the learner considers a very rich hypothesis space, enabling it to overfit the training examples. To obtain an unbiased estimate of future accuracy, we typically test the hypothesis on some set of test examples chosen independently of the training examples and the hypothesis. a Variance in the estimate. Second, even if the hypothesis accuracy is measured over an unbiased set of test examples independent of the training examples, the measured accuracy can still vary from the true accuracy, depending on the makeup of the particular set of test examples. The smaller the set of test examples, the greater the expected variance. This chapter discusses methods for evaluating learned hypotheses, methods for comparing the accuracy of two hypotheses, and methods for comparing the accuracy of two learning algorithms when only limited data is available. Much of the discussion centers on basic principles from statistics and sampling theory, though the chapter assumes no special background in statistics on the part of the reader. The literature on statistical tests for hypotheses is very large. This chapter provides an introductory overview that focuses only on the issues most directly relevant to learning, evaluating, and comparing hypotheses. 5.2 ESTIMATING HYPOTHESIS ACCURACY When evaluating a learned hypothesis we are most often interested in estimating the accuracy with which it will classify future instances. At the same time, we would like to know the probable error in this accuracy estimate (i.e., what error bars to associate with this estimate). Throughout this chapter we consider the following setting for the learning problem. There is some space of possible instances X (e.g., the set of all people) over which various target functions may be defined (e.g., people who plan to purchase new skis this year). We assume that different instances in X may be encountered with different frequencies. A convenient way to model this is to assume there is some unknown probability distribution D that defines the probability of encountering each instance in X (e-g., 23 might assign a higher probability to encountering 19-year-old people than 109-year-old people). Notice 23 says nothing

about whether x is a positive or negative example; it only detennines the probability that x will be encountered. The learning task is to learn the target concept or target function f by considering a space H of possible hypotheses. Training examples of the target function f are provided to the learner by a trainer who draws each instance independently, according to the distribution D, and who then forwards the instance x along with its correct target value f ( x ) to the learner. To illustrate, consider learning the target function "people who plan to purchase new skis this year," given a sample of training data collected by surveying people as they arrive at a ski resort. In this case the instance space X is the space of all people, who might be described by attributes such as their age, occupation, how many times they skied last year, etc. The distribution D specifies for each person x the probability that x will be encountered as the next person arriving at the ski resort. The target function f : X + { O , 1 ) classifies each person according to whether or not they plan to purchase skis this year. Within this general setting we are interested in the following two questions: 1. Given a hypothesis h and a data sample containing n examples drawn at random according to the distribution D, what is the best estimate of the accuracy of h over future instances drawn from the same distribution? 2. What is the probable error in this accuracy estimate? 5.2.1 Sample Error and True Error To answer these questions, we need to distinguish carefully between two notions of accuracy or, equivalently, error. One is the error rate of the hypothesis over the sample of data that is available. The other is the error rate of the hypothesis over the entire unknown distribution D of examples. We will call these the sample error and the true error respectively. The sample error of a hypothesis with respect to some sample S of instances drawn from X is the fraction of S that it misclassifies: Definition: The sample error (denoted errors(h)) of hypothesis h with respect to target function f and data sample S is Where n is the number of examples in S, and the quantity S(f ( x ) , h ( x ) ) is 1 if f ( x ) # h ( x ) , and 0 otherwise. The true error of a hypothesis is the probability that it will misclassify a single randomly drawn instance from the distribution D. Definition: The true error (denoted e r r o r v ( h ) ) of hypothesis h with respect to target function f and distribution D,is the probability that h will misclassify an instance drawn at random according to D. errorv ( h ) = Pr [f ( x ) # h ( x ) ] XED

Here the notation Pr denotes that the probability is taken over the instance XGV distribution V. What we usually wish to know is the true error errorv(h) of the hypothesis, because this is the error we can expect when applying the hypothesis to future examples. All we can measure, however, is the sample error errors(h) of the hypothesis for the data sample S that we happen to have in hand. The main question considered in this section is "How good an estimate of errorD(h) is provided by errors (h)?" 5.2.2 Confidence Intervals for Discrete-Valued Hypotheses Here we give an answer to the question "How good an estimate of errorv(h) is provided by errors(h)?' for the case in which h is a discrete-valued hypothesis. More specifically, suppose we wish to estimate the true error for some discretevalued hypothesis h, based on its observed sample error over a sample S, where 0 0 0 the sample S contains n examples drawn independent of one another, and independent of h, according to the probability distribution V nz30 hypothesis h commits r errors over these n examples (i.e., errors(h) = rln). Under these conditions, statistical theory allows us to make the following assertions: 1. Given no other information, the most probable value of errorD(h)is errors(h) 2. With approximately 95% probability, the true error errorv(h) lies in the interval 7 errors(h) f 1.96 errors(h)(l - errors ( h ) ) To illustrate, suppose the data sample S contains n = 40 examples and that hypothesis h commits r = 12 errors over this data. In this case, the sample error errors(h) = 12/40 = .30. Given no other information, the best estimate of the true error errorD(h) is the observed sample error .30. However, we do not expect this to be a perfect estimate of the true error. If we were to collect a second sample S' containing 40 new randomly drawn examples, we might expect the sample error errors,(h) to vary slightly from the sample error errors(h). We expect a difference due to the random differences in the makeup of S and S'. In fact, if we repeated this experiment over and over, each time drawing a new sample S, containing 40 new examples, we would find that for approximately 95% of these experiments, the calculated interval would contain the true error. For this reason, we call this interval the 95% confidence interval estimate for errorv(h). In the current example, where r = 12 and n = 40, the 95% confidence interval is, according to the above expression, 0.30 f (1.96 - .07) = 0.30 f .14.

## Leave your Comments