Minimum_description_length

Minimum description length

Model selection principle

Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam's razor. The MDL principle can be extended to other forms of inductive inference and learning, for example to estimation and sequential prediction, without explicitly identifying a single model of the data.

MDL has its origins mostly in information theory and has been further developed within the general fields of statistics, theoretical computer science and machine learning, and more narrowly computational learning theory.

Historically, there are different, yet interrelated, usages of the definite noun phrase "the minimum description length principle" that vary in what is meant by description:

Within Jorma Rissanen's theory of learning, a central concept of information theory, models are statistical hypotheses and descriptions are defined as universal codes.
Rissanen's 1978^[1] pragmatic first attempt to automatically derive short descriptions, relates to the Bayesian Information Criterion (BIC).
Within Algorithmic Information Theory, where the description length of a data sequence is the length of the smallest program that outputs that data set. In this context, it is also known as 'idealized' MDL principle and it is closely related to Solomonoff's theory of inductive inference, which is that the best model of a data set is represented by its shortest self-extracting archive.

Statistical MDL learning

Any set of data can be represented by a string of symbols from a finite (say, binary) alphabet.

[The MDL Principle] is based on the following insight: any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally. (Grünwald, 2004)^[5]

Based on this, in 1978, Jorma Rissanen published an MDL learning algorithm using the statistical notion of information rather than algorithmic information. Over the past 40 years this has developed into a rich theory of statistical and machine learning procedures with connections to Bayesian model selection and averaging, penalization methods such as Lasso and Ridge, and so on - Grünwald and Roos (2020)^[6] give an introduction including all modern developments. Rissanen started out with this idea: all statistical learning is about finding regularities in data, and the best hypothesis to describe the regularities in data is also the one that is able to statistically compress the data most. Like other statistical methods, it can be used for learning the parameters of a model using some data. Usually though, standard statistical methods assume that the general form of a model is fixed. MDL's main strength is that it can also be used for selecting the general form of a model and its parameters. The quantity of interest (sometimes just a model, sometimes just parameters, sometimes both at the same time) is called a hypothesis. The basic idea is then to consider the (lossless) two-stage code that encodes data $D$ with length ${L(D)}$ by first encoding a hypothesis $H$ in the set of considered hypotheses ${\cal {H}}$ and then coding $D$ "with the help of" $H$ ; in the simplest context this just means "encoding the deviations of the data from the predictions made by $H$ :

{L(D)}=\min _{H\in {\cal {H}}}\ (\ L(H)+L(D|H)\ )\

The $H$ achieving this minimum is then viewed as the best explanation of data $D$ . As a simple example, take a regression problem: the data $D$ could consist of a sequence of points $D=(x_{1},y_{1}),\ldots ,(x_{n},y_{n})$ , the set ${\cal {H}}$ could be the set of all polynomials from $X$ to $Y$ . To describe a polynomial $H$ of degree (say) $k$ , one would first have to discretize the parameters to some precision; one would then have to describe this precision (a natural number); next, one would have to describe the degree $k$ (another natural number), and in the final step, one would have to describe $k+1$ parameters; the total length would be $L(H)$ . One would then describe the points in $D$ using some fixed code for the x-values and then using a code for the $n$ deviations $y_{i}-H(x_{i})$ .

In practice, one often (but not always) uses a probabilistic model. For example, one associates each polynomial $H$ with the corresponding conditional distribution expressing that given $X$ , $Y$ is normally distributed with mean $H(X)$ and some variance $\sigma ^{2}$ which could either be fixed or added as a free parameter. Then the set of hypotheses ${\cal {H}}$ reduces to the assumption of a linear^{[clarification needed]} model, $Y=H(X)+\epsilon$ , with $H$ a polynomial.

Furthermore, one is often not directly interested in specific parameters values, but just, for example, the degree of the polynomial. In that case, one sets ${\cal {H}}$ to be ${\cal {H}}=\{{\cal {H}}_{0},{\cal {H}}_{1},\ldots \}$ where each ${\cal {H}}_{j}$ represents the hypothesis that the data is best described as a j-th degree polynomial. One then codes data $D$ given hypothesis ${\cal {H}}_{j}$ using a one-part code designed such that, whenever some hypothesis $H\in {\cal {H}}_{j}$ fits the data well, the codelength $L(D|H)$ is short. The design of such codes is called universal coding. There are various types of universal codes one could use, often giving similar lengths for long data sequences but differing for short ones. The 'best' (in the sense that it has a minimax optimality property) are the normalized maximum likelihood (NML) or Shtarkov codes. A quite useful class of codes are the Bayesian marginal likelihood codes. For exponential families of distributions, when Jeffreys prior is used and the parameter space is suitably restricted, these asymptotically coincide with the NML codes; this brings MDL theory in close contact with objective Bayes model selection, in which one also sometimes adopts Jeffreys' prior, albeit for different reasons. The MDL approach to model selection "gives a selection criterion formally identical to the BIC approach"^[7] for large number of samples.

Statistical MDL Notation

Central to MDL theory is the one-to-one correspondence between code length functions and probability distributions (this follows from the Kraft–McMillan inequality). For any probability distribution $P$ , it is possible to construct a code $C$ such that the length (in bits) of $C(x)$ is equal to $-\log _{2}P(x)$ ; this code minimizes the expected code length. Conversely, given a code $C$ , one can construct a probability distribution $P$ such that the same holds. (Rounding issues are ignored here.) In other words, searching for an efficient code is equivalent to searching for a good probability distribution.

Other systems

Rissanen's was not the first information-theoretic approach to learning; as early as 1968 Wallace and Boulton pioneered a related concept called minimum message length (MML). The difference between MDL and MML is a source of ongoing confusion. Superficially, the methods appear mostly equivalent, but there are some significant differences, especially in interpretation:

MML is a fully subjective Bayesian approach: it starts from the idea that one represents one's beliefs about the data-generating process in the form of a prior distribution. MDL avoids assumptions about the data-generating process.
Both methods make use of two-part codes: the first part always represents the information that one is trying to learn, such as the index of a model class (model selection) or parameter values (parameter estimation); the second part is an encoding of the data given the information in the first part. The difference between the methods is that, in the MDL literature, it is advocated that unwanted parameters should be moved to the second part of the code, where they can be represented with the data by using a so-called one-part code, which is often more efficient than a two-part code. In the original description of MML, all parameters are encoded in the first part, so all parameters are learned.
Within the MML framework, each parameter is stated to exactly the precision which results in the optimal overall message length: the preceding example might arise if some parameter was originally considered "possibly useful" to a model but was subsequently found to be unable to help to explain the data (such a parameter will be assigned a code length corresponding to the (Bayesian) prior probability that the parameter would be found to be unhelpful). In the MDL framework, the focus is more on comparing model classes than models, and it is more natural to approach the same question by comparing the class of models that explicitly include such a parameter against some other class that doesn't. The difference lies in the machinery applied to reach the same conclusion.

Share this article:

This article uses material from the Wikipedia article Minimum_description_length, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.

[1] [1]
Rissanen, J. (September 1978). "Modeling by shortest data description". Automatica. 14 (5): 465–471. doi:10.1016/0005-1098(78)90005-5.

[2] [2]
Zenil, Hector; Kiani, Narsis A.; Zea, Allan A.; Tegnér, Jesper (January 2019). "Causal deconvolution by algorithmic generative models". Nature Machine Intelligence. 1 (1): 58–66. doi:10.1038/s42256-018-0005-0. hdl:10754/630919. S2CID 86562557.

[3] [3]
"Remodelling machine learning: An AI that thinks like a scientist". Nature Machine Intelligence: 1. 28 January 2019. doi:10.1038/s42256-019-0026-3. S2CID 189929110.

[4] [4]
Archived at Ghostarchive and the Wayback Machine: "The Limits of Understanding". YouTube.

[peter-5] [5]
Grunwald, Peter (June 2004). "A tutorial introduction to the minimum description length principle". arXiv:math/0406077. Bibcode:2004math......6077G. {{cite journal}}: Cite journal requires |journal= (help)

[6] [6]
Grünwald, Peter; Roos, Teemu (2020). "Minimum Description Length Revisited". International Journal of Mathematics for Industry. 11 (1). doi:10.1142/S2661335219300018. hdl:10138/317252. S2CID 201314867.

[7] [7]
Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). "Model Assessment and Selection". The Elements of Statistical Learning. Springer Series in Statistics. pp. 219–259. doi:10.1007/978-0-387-84858-7_7. ISBN 978-0-387-84857-0.

[mackay-8] [8]
MacKay, David J. C.; Kay, David J. C. Mac (2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press. ISBN 978-0-521-64298-9.^{[page needed]}

[cwi-9] [9]
Rissanen, Jorma. "Homepage of Jorma Rissanen". Archived from the original on 2015-12-10. Retrieved 2010-07-03.

[springer-10] [10]
Rissanen, J. (2007). Information and Complexity in Statistical Modeling. Springer. Retrieved 2010-07-03.^{[page needed]}

[volker-11] [11]
Nannen, Volker (May 2010). "A Short Introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length (MDL)". arXiv:1005.2364. Bibcode:2010arXiv1005.2364N. {{cite journal}}: Cite journal requires |journal= (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

Minimum_description_length

Minimum description length

Overview

Two-Part codes

MDL in machine learning

Recent work on algorithmic MDL learning

Statistical MDL learning

Example of Statistical MDL Learning

Statistical MDL Notation

Limitations of Statistical MDL Learning

Related concepts

Other systems

See also

References

Further reading

Share this article: