where is a vector of length and is a binary label, is known as Binary Classification.
Statistical Learning
Given points learn a function such that,
where is a vector of length , is a binary label, and is an unknown distribution.
A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.
Online Learning
``Online learning involves learning the function using one data observation at a time. At each step :
Receive feature vector
Choose prediction function , predict label
View true label , suffer mistake if
Perceptron Algorithm
Input
Dataset,
Initial weights, (typically )
Initial bias, (typically )
Threshold, (typically )
Output
Trained weights,
Trained bias,
Uniqueness
The perceptron algorithm only guarantees (given the right convergence condition) finding some solution, which may not necessarily be the best one.
Termination
The perceptron algorithm can terminate when one of the following is true:
all points are classified correctly
validation error stops decreasing (validation dataset is not used for training, but to avoid overfitting)
some iteration budget is exhausted
weights and bias are not changing much
Padding & Pre-Multiplication
Goal: Find such that, for all
where represents the dot product of vectors and .
Thus,
where represents the vector padded with scalar .
Thus,
where ,
Thus,
where .
Thus, our goal is,
where is the matrix with rows :
Linear Separability
A dataset has linear separability if and only if there exists,
such that,
for all , for some constant .
Equivalently,
Error Bound & Margin
For a linearly separable dataset, a perceptron will converge in a number of steps (or mistakes) given by,
where is the L2 norm operator and .
Given our goal is to minimize the number of steps,
Multiclass Classification
Multiclass Classification is the process of classifying data when there are more than 2 classes of labels. There are two approaches taken in that case.
One vs All
Train binary classifiers, each predicting whether the data is of a certain label or not
Output prediction label based on
One vs One
Train binary classifiers, one for each combination of labels