----------------------------------------------
S13j. Priors and entropy in probability theory
----------------------------------------------
For a probability distribution on a finite set of alternatives,
given by probabilities p_n summing to 1, the Shannon entropy is
defined by
S = - sum p_n log_2 p_n.
The main use of the entropy concept is the maximum entropy principle,
used to define various interesting ensembles by maximizing the entropy
subject to constraints defined by known expectation values
= sum P_n f(n)
for certain key observables f.
If the number of alternatives is infinite, this formula must be
appropriately generalized. In the literature, one finds various
possibilities, the most common being, for random vectors with
probability density p(x), the absolute entropy
S = - k_B integral dx p(x) log p(x)
with the Boltzmann constant k_B and Lebesgue measure dx.
The value of the Boltzmann constant k_B is conventional and has no
effect on the use of entropy in applications.
There is also the relative entropy
S = - k_B integral dx p(x) log (p(x)/p_0(x)),
which involves an arbitrary positive function p_0(x). If p_0(x)
is a probability density then the relative entropy is nonnegative.
For a probability distribution over an _arbitrary_ sigma algebra
of events, the absolute entropy makes no sense since there is no
distinguished measure and hence no meaningful absolute probability
density. One needs to assume a measure to be able to define a
probability density (namely as the Radon-Nikodym derivative,
assuming it exists). This measure is called the prior (it is often
improper = not normalizable to a probability density).
Once one has specified a prior dmu,
= integral dmu(x) rho(x) f(x)
defines the density rho(x), and then
S(rho)= <-k_B log(rho(x))>
defines the entropy with respect to this prior. Note that the
condition for rho to define a probability density is
integral dmu(x) rho(x) = <1> = 1.
In many cases, symmetry considerations suggest a unique natural prior.
For random variables on a locally compact homogeneous space (such as
the real line, the circle, n-dimensional space or the n-dimensional
sphere), the conventional measure is the invariant Haar measure.
In particular, for probability theory of finitely many alternatives,
it is conventional to consider the symmetric group on the set of
alternatives and take as the (proper) prior the uniform measure, giving
= sum_x rho(x) f(x).
The density rho(x) agrees with the probability p_x, and the
corresponding entropy is the Shannon entropy is one takes k_B=1/log2.
For random variables whose support is R or R^n, the conventional
symmetry group is the translation group, and the corresponding
(improper) prior is the Lebesgue measure. In this case one obtains
the absolute entropy given above. But one could also take as prior
a noninvariant measure
dmu(x) = dx p_0(x);
then the density becomes rho(x)=p(x)/p_0(x), and one arrives at the
relative entropy.
If there is no natural transitive symmetry group, there is no natural
prior, and one has to make other useful choices. In particular, this
is the case for random natural numbers.
Choice A. Treating the natural numbers as a limiting situation of
finite interval [0:n] suggests to use the measure with
integral dmu(x) phi(x) = sum_n phi(n)
as (improper) prior, making
= sum_n rho(n) f(n)
the definition of the density; in this case, p_n=rho(n) is the
probability of getting n.
Choice B. Statistical mechanics suggests to use as (proper) prior
instead a measure with
integral dmu(x) phi(x) = sum_n h^n phi(n)/n!,
where h is Planck's constant, making
= sum_n rho(n) h^n f(n)/n!
the definition of the density; in this case, p_n=h^n rho(n)/n! is the
probability of getting n.
The maximum entropy ensemble defined by given expectations depends on
the prior chosen. In particular, if the mean of a random natural number
is given, choice A leads to a geometric distribution, while
choice B leads to a Poisson distribution. The latter is the one
relevant for statistical mechanics. Indeed, choice B is the prior
needed in statistical mechanics of systems with an indefinite
number n of particles to get the 'correct Boltzmann counting' in the
grand canonical ensemble. With choice A, the maximum entropy
solution is unrelated to the distributions arising in statistical
mechanics.
Thus while the geometric distribution has greater Shannon entropy
than the Poisson distribution, this is irrelevant for classical physics.
In statistical physics with an indeterminate number of particles,
only the relative entropy corresponding to choice B is meaningful.
(In the quantum physics of systems with discrete spectrum, however,
the microcanonical ensemble is the right prior, and then Shannon's
entropy is the correct one.)
The identification of 'information' and 'Shannon entropy'
is dubious for situations with infinitely many alternatives.
Shannon assumes in his analysis that without knowledge, all
alternatives are equally likely, which makes no sense in the infinite
case, and may even be debated in the finite case.
(One of the problems of a subjective, Bayesian approach to
probability is that one always needs a prior before information
theoretic arguments make sense. If there is doubt about the former
the results become doubtful, too. Since information theory in
statistical mechanics works out correctly _only_ if one used the
right prior (choice B) and the right knowledge (expectations of
the additive conserved quantities in the equilibrium case),
both the prior and the knowledge are objectively determined.
But this is strange for a subjective approach as the information
theoretic one, and casts doubt on the relevance of information
theory in the foundations.)