Maximum-entropy_Markov_model

Maximum-entropy Markov model

Add article description

In statistics, a maximum-entropy Markov model (MEMM), or conditional Markov model (CMM), is a graphical model for sequence labeling that combines features of hidden Markov models (HMMs) and maximum entropy (MaxEnt) models. An MEMM is a discriminative model that extends a standard maximum entropy classifier by assuming that the unknown values to be learnt are connected in a Markov chain rather than being conditionally independent of each other. MEMMs find applications in natural language processing, specifically in part-of-speech tagging^[1] and information extraction.^[2]

Model

Suppose we have a sequence of observations $O_{1},\dots ,O_{n}$ that we seek to tag with the labels $S_{1},\dots ,S_{n}$ that maximize the conditional probability $P(S_{1},\dots ,S_{n}\mid O_{1},\dots ,O_{n})$ . In a MEMM, this probability is factored into Markov transition probabilities, where the probability of transitioning to a particular label depends only on the observation at that position and the previous position's label^{[citation needed]}:

P(S_{1},\dots ,S_{n}\mid O_{1},\dots ,O_{n})=\prod _{t=1}^{n}P(S_{t}\mid S_{t-1},O_{t}).

Each of these transition probabilities comes from the same general distribution $P(s\mid s',o)$ . For each possible label value of the previous label $s'$ , the probability of a certain label $s$ is modeled in the same way as a maximum entropy classifier:^[3]

P(s\mid s',o)=P_{s'}(s\mid o)={\frac {1}{Z(o,s')}}\exp \left(\sum _{a}\lambda _{a}f_{a}(o,s)\right).

Here, the $f_{a}(o,s)$ are real-valued or categorical feature-functions, and $Z(o,s')$ is a normalization term ensuring that the distribution sums to one. This form for the distribution corresponds to the maximum entropy probability distribution satisfying the constraint that the empirical expectation for the feature is equal to the expectation given the model:

\operatorname {E} _{e}\left[f_{a}(o,s)\right]=\operatorname {E} _{p}\left[f_{a}(o,s)\right]\quad {\text{ for all }}a.

The parameters $\lambda _{a}$ can be estimated using generalized iterative scaling.^[4] Furthermore, a variant of the Baum–Welch algorithm, which is used for training HMMs, can be used to estimate parameters when training data has incomplete or missing labels.^[2]

The optimal state sequence $S_{1},\dots ,S_{n}$ can be found using a very similar Viterbi algorithm to the one used for HMMs. The dynamic program uses the forward probability:

\alpha _{t+1}(s)=\sum _{s'\in S}\alpha _{t}(s')P_{s'}(s\mid o_{t+1}).

Share this article:

This article uses material from the Wikipedia article Maximum-entropy_Markov_model, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.

[1] [1]
Toutanova, Kristina; Manning, Christopher D. (2000). "Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger". Proc. J. SIGDAT Conf. on Empirical Methods in NLP and Very Large Corpora (EMNLP/VLC-2000). pp. 63–70.

[orig-2] [2]
McCallum, Andrew; Freitag, Dayne; Pereira, Fernando (2000). "Maximum Entropy Markov Models for Information Extraction and Segmentation" (PDF). Proc. ICML 2000. pp. 591–598.

[3] [3]
Berger, A.L. and Pietra, V.J.D. and Pietra, S.A.D. (1996). "A maximum entropy approach to natural language processing". Computational Linguistics. 22 (1). MIT Press: 39–71.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[4] [4]
Darroch, J.N. & Ratcliff, D. (1972). "Generalized iterative scaling for log-linear models". The Annals of Mathematical Statistics. 43 (5). Institute of Mathematical Statistics: 1470–1480. doi:10.1214/aoms/1177692379.

[crf-5] [5]
Lafferty, John; McCallum, Andrew; Pereira, Fernando (2001). "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data". Proc. ICML 2001.

[6] [6]
Léon Bottou (1991). Une Approche théorique de l'Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole (Ph.D.). Université de Paris XI.

[1]

[2]

[3]

[4]

[5]

[6]

Maximum-entropy_Markov_model

Maximum-entropy Markov model

Model

Strengths and weaknesses

References

Share this article: