![]() |
|
|||||||||||||||||
Minimum description lengthThe minimum description length principle is a formalization of Occam's Razor in which the best hypothesis for a given set of data is the one that leads to the largest compression of the data. MDL is important in information theory and learning theory. Any set of data can be represented by a string of symbols from a finite (say, binary) alphabet. "The fundamental idea behind the MDL Principle is that any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally." (Grünwald, 1998. See the link below.) Since we want to select the hypothesis that captures the most regularity in the data, we look for the hypothesis with which the best compression can be achieved. In order to do this, we must first fix a code to compress the data. The most general way to do this is to choose a (Turing-complete) computer language. We then write a program in that language, that outputs the data. This program thus represents the data. The length of the shortest program that outputs the data is called the Kolmogorov complexity of the data. This is the central idea of Ray Solomonoff's idealized theory of inductive inference. However, this mathematical theory does not provide a practical way of doing inference. The most important reasons for this are:
MDL is an attempt to remedy these, by:
Rather than "programs", in MDL theory one usually speaks of candidate hypotheses, models or codes. The set of allowed codes is then called the model class. (To confuse matters, some authors refer to the model class as the model.) The code of which the description, together with the description of the data, has the shortest length is then selected. MDL was not the first attempt to do hypothesis selection through minimizing description length; as early as 1968 Wallace and Boulton pioneered a related concept called Minimum Message Length (MML). MDL was introduced by Jorma Rissanen in 1978; it differs from MML in several ways, most notably (at least in most of J. Rissanen's early MDL papers) in its extensive use of one-part rather than two-part codes. Central to MDL theory is the 1-1 correspondence between code length functions and probability distributions. (The lemma involved is the Kraft-McMillan inequality.) For any probability distribution P, it is possible to construct a code C such that the length (in bits) of C(x) is equal to - log2P(x); this code minimizes the expected code length. Vice versa, given a code C, one can construct a probability distribution P such that the same holds. (Rounding issues are ignored here.) In other words, searching for an efficient code reduces to searching for a good probability distribution, and vice versa. This has led some researchers to view MDL as being equivalent to Bayesian inference. Code length of the model and code length of model and data together in the MDL framework correspond to prior probability and marginal likelihood respectively in the Bayesian framework. This point of view is expressed for example in David MacKay's Information Theory, Inference, and Learning Algorithms (see link below). However, while Bayesian machinery is often useful in constructing efficient MDL codes, the MDL framework sometimes uses other codes that do not fit into a Bayesian framework. An example is the Shtarkov `normalized maximum likelihood code', which plays a central role in current MDL theory, but has no equivalent in Bayesian inference. Furthermore, the MDL Principle prefers some priors over others. While the same priors tend to be favored in so-called objective Bayesian analysis, they are favored for different reasons. External links
The contents of this article are licensed from Wikipedia.org under the GNU Free Documentation License.
How to see transparent copy 01-04-2007 01:21:04 |
|





