Information Theory

Shannon Entropy

Shannon Entropy $\large H(X)$ of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits needed on average to encode symbols drawn from a distribution P.

H (X) = - \sum_{x \in X} p (x) \log_{2} [p (x)]

The Shannon Entropy measures:

the degree of uncertainty in a value
the amount of "surprise" in seeing this observation
how much information is represented by this distribution

Kullback-Liebler Divergence

Kullback-Liebler Divergence $\large KI$ $\large P$ $\large Q$ . It can also be seen as the Relative Entropy measure between the two distributions.

K L (P | | Q) = \sum_{i = 1}^{N} P (i) \log [\frac{P (i)}{Q (i)}]

$\large KL(P||Q) \ge 0$ $\large KL(P||Q) = 0$ $\large P = Q$
$\large KL(P||Q)$ $\large Q$ $\large P$
$\large KL(P||Q) \ne KL(Q||P)$

Mutual Information

Mutual Information $\large MI$ $\large X$ $\large Y$ $\large p(X, Y)$ $\large p(X) \: p(Y)$ . Mutual Information also measures the reduction in uncertainty for one variable given a known value of the other variable.

M I (X, Y) = \sum_{x \in X} \sum_{y \in Y} p (X, Y) \log [\frac{p (X, Y)}{p (X) p (Y)}]

$\large MI(X, Y) \ge 0$
$\large MI(X, Y) = 0$ $\large X$ $\large Y$ are independent
$\large MI(X, Y) = H(X) + H(X) - H(X, Y)$
$\large MI(X, Y) = KL(p(X, Y), p(X) \: p(Y))$

Information Gain

Information Gain $\large IG$ measures the reduction in entropy or surprise by splitting a dataset according to a given value of a random variable.

I G (Y, X) = H (Y) - H (Y | X)