learning theory
view markdownreferences: (1) Machine Learning - Tom Mitchell, (2) An Introduction to Computational Learning Theory - Kearns & Vazirani
evolution
- performance is correlation PerfD(h,c)=∑h(x)⋅c(x)⋅P(x)
- want P(PerfD(h,c)<PerfD(c,c)−ϵ)<δ
sample problems
- ex: N marbles in a bag. How many draws with replacement needed before we draw all N marbles?
- write Pi=N−(i−1)N where i is number of distinct drawn marbles
- transition from i to i+1 is geometrically distributed with probability Pi
- mean times is sum of mean of each geometric
- in order to get probabilities of seeing all the marbles instead of just mean[# draws], want to use Markov’s inequailty
- write Pi=N−(i−1)N where i is number of distinct drawn marbles
- box full of 1e6 marbles
- if we have 10 evenly distributed classes of marbles, what is probability we identify all 10 classes of marbles after 100 draws?
computational learning theory
- frameworks
- PAC
- mistake-bound - split into b processes which each fail with probability at most δ/b
- questions
- sample complexity - how many training examples needed to converge
- computational complexity - how much computational effort needed to converge
- mistake bound - how many training examples will learner misclassify before converging
- must define convergence based on some probability
PAC - probably learning an approximately correct hypothesis - Mitchell
- want to learn C
- data X is sampled with Distribution D
- learner L considers set H of possible hypotheses
- true error errd(h) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D.
- errD(h)=Prx∈D[c(x)≠h(x)]
- getting errD(h)=0 is infeasible
- PAC learnable - consider concept class C defined over set of instances X of length n and a learner L using hypothesis space H
- C is PAC-learnable by L using H if for all c∈C, distributions D over X, ϵ s.t. 0 < ϵ < 1/2 δ s.t. 0<δ<1/2, learner L will with probability at least (1−δ) output a hypothesis h∈H s.t errD(h)≤ϵ
- efficiently PAC learnable - time that is polynomial in 1/ϵ,1/δ,n,size(c)
- probably - probability of failure bounded by some constant δ
- approximately correct - err bounded by some constant ϵ
- assumes H contains hypothesis with artbitraily small error for every target concept in C
sample complexity for finite hypothesis space - Mitchell
- sample complexity - growth in the number of training examples required
- consistent learner - outputs hypotheses that perfectly fit training data whenever possible
- outputs a hypothesis belonging to the version space
- consider hypothesis space H, target concept c, instance distribution D, training examples D of c. The versions space VSH,D is ϵ-exhaused with respect to c and D if every hypothesis h in VSH,D has error less than ϵ with respect to c and D: (∀h∈VSH,D)errD(h)<ϵ
rectangle learning game - Kearns
- data X is sampled with Distribution D
- simple soln: tightest-fit rectangle
- define region T so prob a draw misses T is 1−ϵ/4
- then, m draws miss with (1−ϵ/4)m
- choose m to satisfy 4(1−ϵ/4)m≤δ
- then, m draws miss with (1−ϵ/4)m
VC dimension
- VC dimension measures capacity of a space of functions that can be learend by a statistical classification algorithm
- let H be set of sets and C be a set
- H∩C:=h∩C:|h∈H
- a set C is shattered by H if H∩C contains all subsets of C
- The VC dimension of H is the largest integer D such that there exists a set C with cardinality D that is shattered by H
- VC (Vapnic-Chervonenkis) dimension - if data is mapped into sufficiently high dimension, then samples will be linearly separable (N points, N-1 dims)
- VC dimension 0 -> hypothesis either always returns false or always returns true
- Sauer’s lemma - let d≥0,m≥1, H hypothesis space, VC-dim(H) = d. Then, ΠH(m)≤ϕ(d,m)
- fundamental theorem of learning theory provides bound of m that guarantees learning: m≥[4ϵ⋅(d⋅ln(12ϵ)+ln(2δ))]
concept learning and the general-to-specific ordering
- definitions
- concept learning - acquiring the definition of a general category given a sample of positive and negative training examples of the category
- concept is boolean function that returns true for specific things
- can represent function as vector acceptable features, ?, or null (if any null, then entire vector is null)
- general hypothesis - more generally true
- general defines a partial ordering
- a hypothesis is consistent with the training examples if it correctly classifies them
- an example x satisfies a hypothesis h if h(x) = 1
- concept learning - acquiring the definition of a general category given a sample of positive and negative training examples of the category
- find-S - finding a maximally specific hypothesis
- start with most specific possible
- generalize each time it fails to cover an observed positive training example
- flaws
- ignores negative examples
- if training data is perfect, then will get answer
- no errors
- there exists a hypothesis in H that describes target concept c
- version space - set of all hypotheses consistent with the training examples
- list-then-eliminate - list all hypotheses and eliminate any that are inconsistent (slow)
- candidate-elimination - represent most general (G) and specific (S) members of version space
- version space representation theorem - version space can be found from most general / specific version space members
- for positive examples
- make S more general
- fix G
- for negative examples
- fix S
- make G more specific
- in general, optimal query strategy is to generate instances that satisfy exactly half the hypotheses in the current version space
- testing?
- classify as positive if satisfies S
- classify as negative if doesn’t satisfy G
- inductive bias of candidate-elimination - target concept c is contained in H