Perceptron

Binary Classification

Given n points (x1,y1),(x2,y2),...,(xn,yn) learning a function h such that,

h(x)=y

where xiRd is a vector of length d and y{1,1} is a binary label, is known as Binary Classification.

Statistical Learning

Given n points (x1,y1),(x2,y2),...,(xn,yn)i.i.dP learn a function h:Rd{1,1} such that,

Pr(x,y)P[h(x)=y]>>0

where xiRd is a vector of length d, y{1,1} is a binary label, and P is an unknown distribution.

A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.

Online Learning

``Online learning involves learning the function using one data observation at a time. At each step i=1,2,...:

  1. Receive feature vector xi
  2. Choose prediction function hi, predict label y^i=hi(xi)
  3. View true label yi, suffer mistake if yiy^i

Perceptron Algorithm

Input

Output

for t=1,2,...,doselect training example index It{1,...,n}if yIt(wTxIt+b)δ thenww+yItxItbb+yIt

Uniqueness

The perceptron algorithm only guarantees (given the right convergence condition) finding some solution, which may not necessarily be the best one.

Termination

The perceptron algorithm can terminate when one of the following is true:

Padding & Pre-Multiplication

Goal: Find w,b such that, for all i1,2,...,n

yi=sign((w×xi)+b)=sign(w,xi+b)

where A,B represents the dot product of vectors A and B.

Thus,

yi=sign(w,xi+b)=sign((w,b),(xi,1))

where (A,b) represents the vector A padded with scalar b.

(w,b)=[w1w2w3...wdb],(xi,1)=[xi,1xi,2xi,3...xi,d1]sign((w,b),(xi,1))=w1xi,1+w2xi,2+...+wdxi,d+b

Thus,

yi=sign((w,b),(xi,1))=sign(z,(xi,1))

where z=(w,b),

Thus,

yi=sign(z,(xi,1))yiz,(xi,1)>0z,yi(xi,1)>0z,ai>0

where ai=yi(xi,1).

Thus, our goal is,

Az>0

where A is the matrix with rows ai:

A=[a1a2...an]=[y1x1,1y1x1,2...y1x1,dy1y2x2,1y2x2,2...y2x2,dy2...............ynxn,1ynxn,2...ynxn,dyn]

Linear Separability

A dataset has linear separability if and only if there exists,

z=(w,b)

such that,

z,ais>0

for all i1,2,...,n, for some constant sR+.

Equivalently,

Azs1

Error Bound & Margin

For a linearly separable dataset, a perceptron will converge in a number of steps (or mistakes) given by,

R2(||z||22s2)

where ||.||2 is the L2 norm operator and R=maxi||ai||.

Given our goal is to minimize the number of steps,

min(z,s):Azs1(||z||22s2)=min(z,s):||z||2=1,Azs1(1s2)=1(max(z,s):||z||2=1,Azs1s)2=(1max||z||2=1miniz,aimargin γ)2

Multiclass Classification

Multiclass Classification is the process of classifying data when there are more than 2 classes of labels. There are two approaches taken in that case.

One vs All

One vs One