×

Close

Touch here to read

Page-1

Topic:

CSE176 Introduction to Machine Learning — Lecture notes
´ Carreira-Perpi˜
Miguel A.
n´an
EECS, University of California, Merced
November 28, 2016
These are notes for a one-semester undergraduate course on machine learning given by Prof.
´ Carreira-Perpi˜
Miguel A.
n´an at the University of California, Merced. The notes are largely based on
the book “Introduction to machine learning” by Ethem Alpaydın (MIT Press, 3rd ed., 2014), with
some additions.
These notes may be used for educational, non-commercial purposes.
´ Carreira-Perpi˜
c
2015–2016
Miguel A.
n´an

1
Introduction
1.1
What is machine learning (ML)?
• Data is being produced and stored continuously (“big data”):
– science: genomics, astronomy, materials science, particle accelerators. . .
– sensor networks: weather measurements, traffic. . .
– people: social networks, blogs, mobile phones, purchases, bank transactions. . .
– etc.
• Data is not random; it contains structure that can be used to predict outcomes, or gain knowledge in some way.
Ex: patterns of Amazon purchases can be used to recommend items.
• It is more difficult to design algorithms for such tasks (compared to, say, sorting an array or
calculating a payroll). Such algorithms need data.
Ex: construct a spam filter, using a collection of email messages labelled as spam/not spam.
• Data mining: the application of ML methods to large databases.
• Ex of ML applications: fraud detection, medical diagnosis, speech or face recognition. . .
• ML is programming computers using data (past experience) to optimize a performance criterion.
• ML relies on:
– Statistics: making inferences from sample data.
– Numerical algorithms (linear algebra, optimization): optimize criteria, manipulate models.
– Computer science: data structures and programs that solve a ML problem efficiently.
• A model:
– is a compressed version of a database;
– extracts knowledge from it;
– does not have perfect performance but is a useful approximation to the data.
1.2
Examples of ML problems
• Supervised learning: labels provided.
– Classification (pattern recognition):
∗ Face recognition. Difficult because of the complex variability in the data: pose and
illumination in a face image, occlusions, glasses/beard/make-up/etc.
Training examples:
Test images:
∗ Optical character recognition: different styles, slant. . .
∗ Medical diagnosis: often, variables are missing (tests are costly).
1

∗ Speech recognition, machine translation, biometrics. . .
∗ Credit scoring: classify customers into high- and low-risk, based on their income and
savings, using data about past loans (whether they were paid or not).
– Regression: the labels to be predicted are continuous:
∗ Predict the price of a car from its mileage.
∗ Navigating a car: angle of the steering.
∗ Kinematics of a robot arm: predict workspace location from angles.
Savings
if income > θ1 and savings > θ2
then low-risk else high-risk
Low-Risk
y: price
θ2
y = wx + w0
High-Risk
θ1
Income
x: mileage
• Unsupervised learning: no labels provided, only input data.
– Learning associations:
∗ Basket analysis: let p(Y |X) = “probability that a customer who buys product X
also buys product Y ”, estimated from past purchases. If p(Y |X) is large (say 0.7),
associate “X → Y ”. When someone buys X, recommend them Y .
– Clustering: group similar data points.
– Density estimation: where are data points likely to lie?
– Dimensionality reduction: data lies in a low-dimensional manifold.
– Feature selection: keep only useful features.
– Outlier/novelty detection.
• Semisupervised learning: labels provided for some points only.
• Reinforcement learning: find a sequence of actions (policy) that reaches a goal. No supervised
output but delayed reward.
Ex: playing chess or a computer game, robot in a maze.
2

2
Supervised learning
2.1
Learning a class from examples: two-class problems
• We are given a training set of labeled examples (positive and negative) and want to learn a
classifier that we can use to predict unseen examples, or to understand the data.
• Input representation: we need to decide what attributes (features) to use to describe the input
patterns (examples, instances). This implies ignoring other attributes as irrelevant.
x2: Engine power
x2: Engine power
training set for a “family car”
Hypothesis class of rectangles
(p1 ≤ price ≤ p2 ) AND (e1 ≤ engine power ≤ e2 )
where p1 , p2 , e1 , e2 ∈ R
C
e2
e1
x2t
x1t
p1
x1: Price
p2
x1: Price
D
• Training set: X = {(xn , yn )}N
is the nth input vector and yn ∈ {0, 1} its
n=1 where xn ∈ R
class label.
• Hypothesis (model) class H: the set of classifier functions we will use. Ideally, the true class
distribution C can be represented by a function in H (exactly, or with a small error).
• Having selected H, learning the class reduces to finding an optimal h ∈ H. We don’t know the
true class regions C, but we can approximate them by the empirical error :
E(h; X ) =
N
X
n=1
I(h(xn ) 6= yn ) = number of misclassified instances
There may be more than one optimal h ∈ H. In that case, we achieve better generalization by maximizing the margin (the distance
between the boundary of h and the instances closest to it).
the hypothesis with the largest margin
noise and a more complex hypothesis
x2
h2
h1
x1
3

## Leave your Comments