Cross Entropy
在信息论中,低概率事件包含更多的信息,高概率事件包含更少的信息。
- Low Probability Event (surprising): More information.
- Higher Probability Event (unsurprising): Less information.
事件X的信息量可以表示为$h(x) = -log(P(x))$
Entropy: 指的是传输一个概率分布中的随机事件所需要的bits数目。这个概率分布越平均,其entropy 越大。
Entropy可表示为:$H(X) = - \sum p(x)log(p(x))$
… the cross entropy is the average number of bits needed to encode data coming from a source with distribution p when we use model q …
If we think of a distribution as the tool we use to encode symbols, then entropy measures the number of bits we'll need if we use the correct tool y . This is optimal, in that we can't encode the symbols using fewer bits on average.
In contrast, cross entropy is the number of bits we'll need if we encode symbols from y using the wrong tool ŷ . This consists of encoding the i -th symbol using log1ŷ i bits instead of log1yi bits. We of course still take the expected value to the true distribution y , since it's the distribution that truly generates the symbols: cross entropy 可以计算为: $$ H(P, Q) = - \sum p(x)log(q(x)) $$
其中P为目标分布,Q为近似分布
KL Divergence
The KL divergence from ŷ to y is simply the difference between cross entropy and entropy:
KL(y || ŷ )=∑iyilog1ŷ i−∑iyilog1yi=∑iyilogyiŷ i
It measures the number of extra bits we'll need on average if we encode symbols from y according to ŷ ; you can think of it as a bit tax for encoding symbols from y with an inappropriate distribution ŷ . It's never negative, and it's 0 only when y and ŷ are the same.
Note that minimizing cross entropy is the same as minimizing the KL divergence from ŷ to y . (They're equivalent up to an additive constant, the entropy of y , which doesn't depend on ŷ .)