# Entropy

The most basic notion of entropy is the minimum number of bits required by identifiers, called **codes**, if we assign a unique code to each element of a set \(\mathcal{U}\) and all the codes are of the same length.

notation:

- \(\mathcal{U}\) := a set of elements
- \(\mathcal{l}\) := length

## Worst Case entropy

This basic notion is the worst case entropy of \(\mathcal{U}\) and is denoted \(\mathcal{H}_{uc}(\mathcal{U})\). See that \[ \mathcal{H}_{wc}(\mathcal{U}) = \log|\mathcal{U}| \] bits (log is logarithm in base 2): if we used codes of length \(\mathcal{l} < \mathcal{H}_{wc}(\mathcal{U})\), we would only have \[ 2^{\mathcal{l}} < 2^{\mathcal{H}_{wc}(\mathcal{U})} = 2^{\log|\mathcal{U}|} = |\mathcal{U}| \] distinct codes, which would be insufficient for giving a distinct code to each element in \(\mathcal{U}\).

Therefore, if all the codes have the same length, this length must be at least \(\lceil\mathcal{H}_{wc}(\mathcal{U})\rceil\) bits. If they are of different lengths, then the longest ones still must use at least \(\lceil\mathcal{H}_{wc}(\mathcal{U})\rceil\) bits. This explains the adjective “worst-case.”

Unlike the examples of bit sequences and strings, the classical representations of trees are very far from this number; actually they use at least n pointers. Since such pointers must distinguish among the n different nodes, by the same reasoning on the entropy they must use at least log n bits. Therefore, classical representations of trees use at least n log n bits, whereas 2n bits should be sufficient to represent any tree of n nodes.

Indeed, it is not very difficult to encode a general tree using 2n bits: traverse the tree in depth-first-order, writing a 1 upon arriving at a node, and a 0 when leaving it in the recursion. It is an easy exercise to see that one can recover the tree from the resulting stream of 2n 1s and 0s.

It is much more difficult, however, to navigate this representation in constant time per operation; for example, going to the parent or to a child of a node. This is where Compact Data Structures are handy.

### Shannon Entropy

In classical information theory, one assumes that there is an infinite source that emits elements \(u \in U\) with probabilities \(Pr(u)\). By using codes of varying lengtsh of \(\mathcal{l}(u)\) bits (**shorter codes for more probable elements**), one can reduce the average length of the codes:
\[
\bar{\mathcal{l}} = \sum_{u \in U}Pr(u) \cdot \mathcal{l}(u)
\]
**The natural question is how to assign codes that minimize the average code length.** Fundamentally the minimum possible average code length for codes cna be univocally decoded as
\[
\mathcal{H}(Pr) = \sum_{u \in U}Pr(u) \cdot \log\frac{1}{Pr(u)}
\]
which is the ** Shannon entropy** of the

**\(Pr : U \rightarrow [0.0, 1.0]\). The measure \(H(Pr)\) can then be interpreted as how many bits of**

**probability distribution****information**are contained in

**each element emitted by the source**. The more biased the probability distribution, the more predictable, or less “surprising” the output of the source is, and the

**less information is carried by its elements**. In particular, if one probability tends to 1.0 and all the others tend to 0.0, then the measure tends to be 0.

\(\mathcal{H}(Pr)\) also suggests that an optimal code for \(u\) should be \(\log\frac{1}{Pr(u)}\) bits long. It can further be proved that no other choice of code lengths reaches the optimal average code length \(\mathcal{H}(Pr)\). In other words, more probable elements should receive shorter codes.

No matter how codes are assigned to reduce the average code length, the length of the longest code is still at least \(\lceil\mathcal{H}_{wc}(U)\rceil\).

When all probabilities are equal, shannon entropy is **maximized**, reaching precisely \(\log|U|\). This means the optimum code length is the same length for all elements, and is said to be **incompressible**. This coincides with the worst case entropy measure.