Skip to main content


In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable XX, which takes values in the alphabet X\mathcal{X} and is distributed according to p:Xโ†’[0,1]p: \mathcal{X}\to[0, 1]:

E(X):=โˆ’โˆ‘xโˆˆXp(x)logโกp(x)=E[โˆ’logโกp(X)],E(X) := -\sum\limits_{x \in \mathcal{X}} p(x) \log p(x) = \mathbb{E}[-\log p(X)] ,

Entropy (information theory)

If P=0P = 0, the code will be all zero.

What information can we send to the friend? Very little. The Entropy HH is very low, and the information II is also very low.

If P=1P = 1, the code will be all oneโ€”also very low HH and II.

H(x)โˆ’โˆ‘xโˆˆXp(x)lnโกp(x)H(x) -\sum\limits_{x \in \mathcal{X}} p(x) \ln p(x)

The joint entropy of random variable XX and YY will be

H(x,y)=โˆ’โˆ‘xโˆˆXโˆ‘yโˆˆYp(x,y)lnโกp(x,y)H(x, y) = -\sum\limits_{x \in \mathcal{X}}\sum\limits_{y \in \mathcal{Y}} p(x, y) \ln p(x, y)
  • Log base-2 for computer science.
  • Log base-ee for Physics and Mathematics.

The conditional entropy is also similar.

H(yโˆฃx)=โˆ’โˆ‘xโˆˆXโˆ‘yโˆˆYp(x,y)lnโกp(yโˆฃx)H(y|x) = -\sum\limits_{x \in \mathcal{X}}\sum\limits_{y \in \mathcal{Y}} p(x, y) \ln p(y|x)

Then we can calculate the mutual information:

I(x,y)=โˆ’โˆ‘xโˆˆXโˆ‘yโˆˆYp(x,y)lnโกp(x,y)p(x)p(y)I(x, y) = -\sum\limits_{x \in \mathcal{X}}\sum\limits_{y \in \mathcal{Y}} p(x, y) \ln{p(x, y) \over {p(x) p(y)}}

This is closer to the KL distance between p and q: KL(pโˆฃโˆฃq)KL(p || q)

How close are XX and YY being independent? If the mutual information is small, then they are almost independent.

Also, I(X,Y)=H(Y)โˆ’H(YโˆฃX)I(X,Y) = H(Y) - H(Y|X)