Learning from
Observations
Outline
|
|
|
|
Learning |
|
Hypothesis Spaces |
|
Decision Trees |
|
Naïve Bayes |
|
Not in the text |
|
Training and Testing |
What is Learning
|
|
|
Memorizing something |
|
Learning facts through observation and
exploration |
|
Generalizing a concept from experience |
|
|
|
“Learning denotes changes in the system
that are adaptive in the sense that they enable the system to do the task or
tasks drawn from the same population more efficiently and more effectively
the next time” – Herb Simon |
Why is it necessary?
|
|
|
Three reasons: |
|
Unknown environment – need to deploy an
agent in an unfamiliar territory |
|
Save labor – we may not have the
resources to encode knowledge |
|
Can’t explicitly encode knowledge – may
lack the ability to articulate necessary knowledge. |
Learning agents
Learning element
|
|
|
|
|
|
|
Design of a learning element is
affected by |
|
Which components of the performance
element are to be learned |
|
What feedback is available to learn
these components |
|
What representation is used for the
components |
|
|
|
Type of feedback: |
|
Supervised learning: correct answers
for each example |
|
Unsupervised learning: correct answers
not given |
|
Reinforcement learning: occasional
rewards |
Induction
|
|
|
|
Making predictions about the future
based on the past. |
|
|
|
If asked why we believe the sun will
rise tomorrow, we shall naturally answer, “Because it has always risen every
day.” We have a firm belief that it
will rise in the future, because it has risen in the past. – Bertrand Russell |
|
|
|
Is induction sound? Why believe that the future will look
similar to the past? |
Inductive learning
|
|
|
|
|
|
|
Simplest form: learn a function from
examples |
|
|
|
f is the target function |
|
|
|
An example is a pair (x, f(x)) |
|
|
|
Problem: find a hypothesis h |
|
such that h ≈ f |
|
given a training set of examples |
|
|
|
This is a highly simplified model of
real learning: |
|
Ignores prior knowledge |
|
Assumes examples are given |
Inductive learning method
|
|
|
|
Memorization |
|
Noise |
|
Unreliable function |
|
Unreliable sensors |
Inductive learning method
|
|
|
Construct/adjust h to agree with f on
training set |
|
(h is consistent if it agrees with f on
all examples) |
|
E.g., curve fitting: |
Inductive learning method
|
|
|
Construct/adjust h to agree with f on
training set |
|
(h is consistent if it agrees with f on
all examples) |
|
E.g., curve fitting: |
Inductive learning method
|
|
|
Construct/adjust h to agree with f on
training set |
|
(h is consistent if it agrees with f on
all examples) |
|
E.g., curve fitting: |
Inductive learning method
|
|
|
Construct/adjust h to agree with f on
training set |
|
(h is consistent if it agrees with f on
all examples) |
|
E.g., curve fitting: |
Inductive learning method
|
|
|
Construct/adjust h to agree with f on
training set |
|
(h is consistent if it agrees with f on
all examples) |
|
E.g., curve fitting: |
Inductive learning method
|
|
|
Construct/adjust h to agree with f on
training set |
|
(h is consistent if it agrees with f on
all examples) |
|
E.g., curve fitting: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ockham’s razor: prefer the simplest
hypothesis consistent with data |
An application: Ad
blocking
Learning Ad blocking
|
|
|
Width and height of image |
|
Binary Classification: Ad or ØAd? |
Nearest Neighbor
|
|
|
A type of instance based learning |
|
Remember all of the past instances |
|
Use the nearest old data point as
answer |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Generalize to kNN, that is take the
average class of the closest k neighbors. |
Application: Eating out
|
|
|
|
Problem: Decide on a restaurant, based
on the following attributes: |
|
Alternate: is there an alternative
restaurant nearby? |
|
Bar: is there a comfortable bar area to
wait in? |
|
Fri/Sat: is today Friday or Saturday? |
|
Hungry: are we hungry? |
|
Patrons: number of people in the
restaurant (None, Some, Full) |
|
Price: price range ($, $$, $$$) |
|
Raining: is it raining outside? |
|
Reservation: have we made a
reservation? |
|
Type: kind of restaurant (French,
Italian, Thai, Burger) |
|
WaitEstimate: estimated waiting time (0-10,
10-30, 30-60, >60) |
Attribute representation
|
|
|
Examples described by attribute or feature
values (Boolean, discrete, continuous) |
|
E.g., situations where I will/won't
wait for a table: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Classification of examples is positive
(T) or negative (F) |
Bayes Rule
Naïve Bayes Classifier
|
|
|
|
Calculate most probable function value |
|
Vmap = argmax P(vj|
a1,a2, … , an) |
|
= argmax P(a1,a2, … , an| vj)
P(vj) |
|
P(a1,a2,
… , an) |
|
= argmax P(a1,a2, … , an| vj)
P(vj) |
|
|
|
Naïve assumption: P(a1,a2,
… , an) = P(a1)P(a2) … P(an) |
Naïve Bayes Algorithm
|
|
|
|
|
NaïveBayesLearn(examples)
For each target value vj
P’(vj) ← estimate
P(vj)
For each attribute value ai
of each attribute a
P’(ai|vj)
← estimate P(ai|vj) |
|
|
|
ClassfyingNewInstance(x)
vnb= argmax P’(vj) Π P’(ai|vj) |
An Example
|
|
|
(due to MIT’s open coursework slides) |
|
|
An Example
|
|
|
(due to MIT’s open coursework slides) |
|
|
Decision trees
|
|
|
Developed simultaneously by statistics
and AI |
|
E.g., here is the “true” tree for
deciding whether to wait: |
Expressiveness
|
|
|
Decision trees can express any function
of the input attributes. |
|
E.g., for Boolean functions, truth
table row → path to leaf: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Trivially, there is a consistent
decision tree for any training set with one path to leaf for each example
(unless f nondeterministic in x) but it probably won't generalize to new
examples |
|
|
|
Prefer to find more compact decision
trees |
Hypothesis spaces
|
|
|
How many distinct decision trees with n
Boolean attributes? |
|
= number of Boolean functions |
|
= number of distinct truth tables with
2n rows = 22n |
|
|
|
E.g., with 6 Boolean attributes, there
are 18,446,744,073,709,551,616 trees |
Hypothesis spaces
|
|
|
|
How many distinct decision trees with n
Boolean attributes? |
|
= number of Boolean functions |
|
= number of distinct truth tables with
2n rows = 22n |
|
|
|
E.g., with 6 Boolean attributes, there
are 18,446,744,073,709,551,616 trees |
|
|
|
How many purely conjunctive hypotheses
(e.g., Hungry Ù ØRain)? |
|
Each attribute can be in (positive), in
(negative), or out |
|
Þ 3n distinct
conjunctive hypotheses |
|
More expressive hypothesis space |
|
increases chance that target function
can be expressed |
|
increases number of hypotheses
consistent with training set |
|
Þ may get worse predictions |
The best hypothesis
|
|
|
|
|
Find best function that models given
data. |
|
|
|
How to define the best function? |
|
Fidelity to the data – error on
existing data: E(h,D) |
|
Simplicity – how complicated is the
solution: C(h) |
|
One measure: how many possible
hypotheses for the class? |
|
|
|
Inevitable tradeoff between complexity
of hypothesis and degree of fit to the data |
|
|
|
Minimize α E(h,D) + (1-α) C(h) |
|
Where α is a tuning parameter |
Decision tree learning
|
|
|
Aim: find a small tree consistent with
the training examples |
|
Idea: (recursively) choose "most
significant" attribute as root of (sub)tree |
|
|
Choosing an attribute
|
|
|
Idea: a good attribute splits the
training set into subsets that are (ideally) "all positive" or
"all negative" |
|
|
|
|
|
|
|
|
|
|
|
Patrons? is a better choice |
Information Content
|
|
|
|
Entropy measures purity of sets of
examples |
|
Normally denoted H(x) |
|
Or as information content: the less you
need to know (to determine class of new case), the more information you have |
|
With two classes (P,N): |
|
IC(S) = - (p/t) log2 (p/t) - (n/t) log2 (n/t) |
|
E.g., p=9, n=5;
IC([9,5]) = - (9/14) log2 (9/14) - (5/14) log2 (5/14) |
|
= 0.940 |
|
Also, IC([14,0])=0; IC([7,7])=1 |
Entropy curve
|
|
|
|
For p/t between 0 & 1, the 2-class
entropy is |
|
0 when p/(p+n) is 0 |
|
1 when p/(p+n) is 0.5 |
|
0 when p/(p+n) is 1 |
|
monotonically increasing between 0 and
0.5 |
|
monotonically decreasing between 0.5
and 1 |
|
When the data is pure, only need to
send 1 bit |
Using information theory
|
|
|
To implement Choose-Attribute in the
DTL algorithm |
|
Entropy: |
|
I(P(v1), … , P(vn))
= Σi=1 -P(vi) log2 P(vi) |
|
For a training set containing p
positive examples and n negative examples: |
|
|
Information gain
|
|
|
A chosen attribute A divides the
training set E into subsets E1, … , Ev according to
their values for A, where A has v distinct values. |
|
|
|
|
|
Information Gain (IG) or reduction in
entropy from the attribute test: |
|
|
|
|
|
Choose the attribute with the largest
IG |
Information gain
|
|
|
For the training set, p = n = 6,
I(6/12, 6/12) = 1 bit |
|
|
|
Consider the attributes Patrons and Type
(and others too): |
|
|
|
|
|
|
|
|
|
|
|
Patrons has the highest IG of all
attributes and so is chosen by the DTL algorithm as the root |
Example contd.
|
|
|
Decision tree learned from the 12
examples: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Substantially simpler than “true”
tree---a more complex hypothesis isn’t justified by small amount of data |
Performance measurement
|
|
|
How do we know that h ≈ f ? |
|
Try h on a new test set of examples |
|
Learning curve = % correct on test set
as a function of training set size |
|
|
Training and testing sets
|
|
|
|
Where does the test set come from? |
|
Collect a large set of examples |
|
Divide into training and testing data |
|
Train on training data, assess on
testing |
|
Repeat 1-3 for different splits of the
set. |
|
|
|
Same distribution |
|
“Learning … enable[s] the system to do
the task or tasks drawn from the same population” – Herb Simon |
|
To think about: Why? |
|
|
Overfitting
|
|
|
|
|
Better training performance = test
performance? |
|
Nope.
Why? |
|
Hypothesis too specific |
|
Models noise |
|
Pruning |
|
Keep complexity of hypothesis low |
|
Stop splitting when: |
|
IC below a threshold |
|
Too few data points in node |
Summary
|
|
|
Learning needed for unknown
environments, lazy designers |
|
Learning agent = performance element +
learning element |
|
For supervised learning, the aim is to
find a simple hypothesis approximately consistent with training examples |
|
Decision tree learning using
information gain |
|
Learning performance = prediction
accuracy measured on test set |