Perceptron

Binary Classification

$\large n$ $\large (\textbf{x}_1, y_1), (\textbf{x}_2, y_2), ..., (\textbf{x}_n, y_n)$ $\large h$ such that,

h (x) = y

$\large \textbf{x}_i \in \R^d$ $\large d$ $\large y \in \{-1, 1\}$ is a binary label, is known as Binary Classification.

Statistical Learning

$\large n$ $\large (\textbf{x}_1, y_1), (\textbf{x}_2, y_2), ..., (\textbf{x}_n, y_n) \sim {}_{i.i.d} \: P$ $\large h: \R^d \rarr \{-1, 1\}$ such that,

P r_{(x, y) \sim P} [h (x) = y] > > 0

$\large \textbf{x}_i \in \R^d$ $\large d$ $\large y \in \{-1, 1\}$ $\large P$ is an unknown distribution.

A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.

Online Learning

Online learning $\large i = 1, 2, ...$ :

$\large \textbf{x}_i$
$\large h_i$ $\large \hat{y}_i = h_i(\textbf{x}_i)$
$\large y_i$ $\large y_i \ne \hat{y}_i$

Perceptron Algorithm

Input

$\large (\textbf{x}_1, y_1), (\textbf{x}_2, y_2), ..., (\textbf{x}_n, y_n) \in \R^d \times \{-1, 1\}$
$\large \textbf{w} \in \R^d$ $\large \vec{0}$ )
$\large b \in \R$ $\large 0$ )
$\large \delta \ge 0$ $\large 0$ )

Output

$\large \textbf{w} \in \R^d$
$\large b \in \R$

\begin{matrix} \begin{aligned} for t = 1, 2, . . ., do \\ select training example index I_{t} \in {1, . . ., n} \\ if y_{I_{t}} (w^{T} x_{I_{t}} + b) \leq δ then \\ w \leftarrow w + y_{I_{t}} x_{I_{t}} \\ b \leftarrow b + y_{I_{t}} \end{aligned} \end{matrix}

Uniqueness

The perceptron algorithm only guarantees (given the right convergence condition) finding some solution, which may not necessarily be the best one.

Termination

The perceptron algorithm can terminate when one of the following is true:

all points are classified correctly
validation error stops decreasing (validation dataset is not used for training, but to avoid overfitting)
some iteration budget is exhausted
weights and bias are not changing much

Padding & Pre-Multiplication

$\large \textbf{w}, b$ $\large i \in 1, 2, ..., n$

\begin{matrix} \begin{aligned} y_{i} & = sign ((w \times x_{i}) + b) \\ = sign (⟨ w, x_{i} ⟩ + b) \end{aligned} \end{matrix}

$\large \langle \textbf{A}, \textbf{B}\rangle$ $\large \textbf{A}$ $\large \textbf{B}$ .

Thus,

\begin{matrix} \begin{aligned} y_{i} & = sign (⟨ w, x_{i} ⟩ + b) \\ = sign (⟨ (w, b), (x_{i}, 1) ⟩) \end{aligned} \end{matrix}

$\large (\textbf{A}, b)$ $\large \textbf{A}$ $\large b$ .

\begin{matrix} (w, b) = [\begin{matrix} w_{1} \\ w_{2} \\ w_{3} \\ . . . \\ w_{d} \\ b \end{matrix}], (x_{i}, 1) = [\begin{matrix} x_{i, 1} \\ x_{i, 2} \\ x_{i, 3} \\ . . . \\ x_{i, d} \\ 1 \end{matrix}] \\ ⟹ sign (⟨ (w, b), (x_{i}, 1) ⟩) = w_{1} x_{i, 1} + w_{2} x_{i, 2} + . . . + w_{d} x_{i, d} + b \end{matrix}

Thus,

\begin{matrix} \begin{aligned} y_{i} & = sign (⟨ (w, b), (x_{i}, 1) ⟩) \\ = sign (⟨ z, (x_{i}, 1) ⟩) \end{aligned} \end{matrix}

$\large \textbf{z} = (\textbf{w}, b)$ ,

Thus,

\begin{matrix} \begin{aligned} y_{i} & = sign (⟨ z, (x_{i}, 1) ⟩) \\ ⟹ y_{i} ⟨ z, (x_{i}, 1) ⟩ > 0 \\ ⟹ ⟨ z, y_{i} (x_{i}, 1) ⟩ > 0 \\ ⟹ ⟨ z, a_{i} ⟩ > 0 \end{aligned} \end{matrix}

$\large \textbf{a}_i = y_i (\textbf{x}_i, 1)$ .

Thus, our goal is,

A z > \vec{0}

$\large \textbf{A}$ $\large a_i$ :

\begin{matrix} A = [\begin{matrix} a_{1} \\ a_{2} \\ . . . \\ a_{n} \end{matrix}] = [\begin{matrix} y_{1} x_{1, 1} & y_{1} x_{1, 2} & . . . & y_{1} x_{1, d} & y_{1} \\ y_{2} x_{2, 1} & y_{2} x_{2, 2} & . . . & y_{2} x_{2, d} & y_{2} \\ . . . & . . . & . . . & . . . & . . . \\ y_{n} x_{n, 1} & y_{n} x_{n, 2} & . . . & y_{n} x_{n, d} & y_{n} \end{matrix}] \end{matrix}

Linear Separability

A dataset has linear separability if and only if there exists,

z = (w, b)

such that,

⟨ z, a_{i} ⟩ \geq s > 0

$\large i \in {1, 2, ..., n}$ $\large s \in \R^+$ .

Equivalently,

A z \geq s \vec{1}

Error Bound & Margin

For a linearly separable dataset, a perceptron will converge in a number of steps (or mistakes) given by,

R^{2} (\frac{| | z | |_{2}^{2}}{s^{2}})

$\Large ||.||_2$ $\large R = \max\limits_i ||\textbf{a}_i||$ .

Given our goal is to minimize the number of steps,

\begin{matrix} \begin{aligned} min_{(z, s) : A z \geq s \vec{1}} (\frac{| | z | |_{2}^{2}}{s^{2}}) \\ = min_{(z, s) : | | z | |_{2} = 1, A z \geq s \vec{1}} (\frac{1}{s^{2}}) \\ = \frac{1}{(max_{(z, s) : | | z | |_{2} = 1, A z \geq s \vec{1}} s)^{2}} \\ = (\frac{1}{\underset{margin γ}{\underset{⏟}{max_{| | z | |_{2} = 1} min_{i} ⟨ z, a_{i} ⟩}}})^{2} \end{aligned} \end{matrix}

Multiclass Classification

Multiclass Classification is the process of classifying data when there are more than 2 classes of labels. There are two approaches taken in that case.

One vs All

$\large k$ binary classifiers, each predicting whether the data is of a certain label or not
$\large \arg \max\limits_i \langle \textbf{z}, \textbf{a}_i \rangle$

One vs One

$\large k \choose 2$ binary classifiers, one for each combination of labels
$\large k \choose 2$ classifiers and output the majority vote