Baum-Welch documentation


Training is performed using the Train function. It takes as arguments a FAR or WFST representing the plaintext, and channel model and a FAR representing the ciphertext.

Normalization strategies

Training takes a template argument argument representing the desired expectation table type. Two types are available. The state expectation table normalizes so that all arcs leaving a state sum to semiring one. The state/input-label expectation table normalizes so that the weights of all arcs leaving a state with the same input label sum to semiring one; this is arguably the most intuitive normalization strategy and so it used as the default.

Semirings and count collection strategies

In a semiring without the _ path property_, like the log semiring, the E-step sums over all state sequences and thus performs true Baum-Welch training. In the case the input is a FST language model with backoffs, attach a $$\phi$$-matcher (fstspecial --fst_type=phi) to ensure proper interpretation.

In a semiring with the path property, like the tropical semiring, the E-step sums over only the most likely state sequence, an approximation known as Viterbi training (Brown et al. 1993:293). In the case the input is an FST language model with backoffs, backoffs can be encoded exactly using a $$\phi$$-matcher or approximated by $$\epsilon$$ using the default matcher. In practice, Baum-Welch training is slightly slower than Viterbi training, though somewhat more accurate (Jansche 2003).


Train takes an struct representing options for training. These define the size of the batch, the maximum number of iterations for training, the learning rate $$\alpha$$, and the comparison/quantization $$\delta$$ used to detect convergence.

Since expectation maximization training is only locally optimal, _random (re)starts_ can be used to avoid local minima. For particularly difficult problems, it may be beneficial to perform a very large number of random starts (e.g., Berg-Kilpatrick and Klein 2013). In this library, we randomly initialize weights by sampling from the uniform distribution $$(\texttt{kDelta}, 1]$$ in the real numbers and then mapping these values onto the appropriate values in the desired semiring. Randomize can be used to generate random restarts.

Stepwise interpolation

The actual training method uses a form of stepwise interpolation due to Liang and Klein (2009). At time $$k$$ a weight $$W_k$$ is given by

$$W_k = (1 - \nu_k) W_{k - 1} + \nu_k M_k$$

where $$\nu_k = (k - 2)^{-\alpha}$$, $$\alpha$$ is a fixed hyperparameter controlling exponential decay of $$\nu_k$$, and $$M_k$$ is the maximized expectation at time $$k$$.


`Decode` is the primary driver for decoding. Decoding with the trained model is essentially a process of constructing a weighted lattice followed by a shortest path computation.

In a semiring with the path property, decoding is simple:

  1. Compose the input, channel model, and output, and materialize the cascade.
  2. Compute the shortest path using the Viterbi algorithm.
  3. Input-project (for decipherment only), then remove epsilons and weights.

In a semiring without the path property, exact decoding is much more challenging, and in fact we do not have an algorithm for the pair case since we normally need to preserve the input-output correspondences. In the decipherment case, it is possible to perform with a DFA lattice, but determinizing the entire lattice is rarely feasible for interesting problems. Therefore we use the following strategy:

  1. Compose the plaintext model, channel model, and ciphertext, and materialize the cascade, then input-project and remove epsilons to produce an acyclic, epsilon-free non-deterministic weighted acceptor (henceforth, the NFA).
  2. Compute $$\beta_n$$, the costs to the future in the NFA.
  3. Compute, on the fly, a deterministic acceptor (henceforth, the DFA); expansion of the DFA converts the NFA future costs $$\beta_n$$ to the DFA future costs $$\beta_d$$.
  4. Compute, on the fly, a mapping of the DFA to a semiring with the path property by converting the weights to tropical.
  5. Compute the shortest path of the on-the-fly tropical-semiring DFA using A*search (Hart et al. 1968) with $$\beta_d$$ as a heuristic, halting immediately after the shortest path is found.
  6. Convert the shortest path back to the input semiring and remove weights.


Pair modeling

In the pair scenario, we observe a set of paired input/output strings $$(i, o)$$ where $$i \in {I}^{*}$$ and $$o \in {O}^{*}$$ and wish to construct a conditional model. This is useful for many monotonic string-to-string transduction problems, such as grapheme-to-phoneme conversion or abbreviation expansion.

Probabilistic formulation

We would like to estimate a conditional probability model over input-output pairs, henceforth $$\textrm{P}(o \mid i)$$.

Finite-state formulation

We first construct FARs containing all $$i$$ and all $$o$$ pairs, respectively.

We then construct a model of the alignment $$ \Gamma \subseteq {I}^{*} \times {O}^{*}$$. This serves as the initial (usually uniform) model for $$(i, o)$$; we henceforth refer to it as the channel model.

Given an estimate for the channel model, $$\hat{\Gamma}$$, we can compute the best alignment by decoding

$$\mathrm{ShortestPath}\left[i \circ \hat{\Gamma} \circ o\right]$$

where $$\textrm{ShortestPath}$$ represents the Viterbi algorithm.

Pair language modeling

Once the alignment model $$\hat{\Gamma}$$ is estimated, we can use it to construct a so-called pair language model (Novak et al. 2012, 2016). This is much like a classic language model but the actual symbols represent pairs in $${I} \times {O}$$; unlike $$\hat{\Gamma}$$, it is a joint model. To construct such a model:

  1. Compute the best $$i, o$$ alignments by decoding with $$\hat{\Gamma}$$.
  2. Rewrite the FSTs in the decoded alignments FAR as acceptors over a pair symbol vocabulary $${I} \times {O}$$.
  3. Compute a higher-order language model (e.g., using the NGram tools), with standard smoothing and shrinking options, producing a weighted finite-state acceptor.
  4. Convert the weighted finite-state acceptor to a finite-state transducer by rewriting the "acceptor" $${I} \times {O}$$ arcs with the corresponding "transducer" $${I} : {O}$$ arcs.

Then the pair language model can be applied using composition and decoded using the Viterbi algorithm.


In the decipherment scenario, we observe a corpus of ciphertext $$c \in {C}^{*}$$. We imagine that there exists a corpus of plaintext $$p \in {P}^{*}$$ which was used to generate the ciphertext $$c$$. Our goal is to decode the ciphertext; i.e., to uncover the corresponding plaintext. We assume that the ciphertext $$c$$ has been generated (i.e., encoded) by $$\pi_o\left(p \circ \gamma\right)$$ where $$\pi_o$$ is the output projection operator and $$\gamma$$ is the encoder (i.e., inverse key). We further assume that the ciphertext can be recovered (i.e., decoded) using $$\pi_i\left(\gamma \circ c\right)$$ where $$\pi_i$$ is the input projection operator. However, we do not observe either $$\gamma$$, and must instead estimate it from data.

Probabilistic formulation

We would like to estimate a conditional probability model which predicts the plaintext $$p$$ given $$c$$; i.e., $$\textrm{P}(p \mid c)$$. By Bayes' rule

$$\textrm{P}(p \mid c) \propto \textrm{P}(p) \textrm{P}(c \mid p).$$

That is; the conditional probability of the plaintext given observed ciphertext is proportional to the product of the probability of the plaintext and the conditional probability of the ciphertext given the plaintext; for any corpus of ciphertext the denominator $$\textrm{P}(c)$$ is constant and thus can be ignored.

We then express decoding as

$$\hat{p} = \arg\max_p \textrm{P}(p) \textrm{P}(c \mid p).$$

Finite-state formulation

We first construct a probabilistic model $$\Lambda \subseteq {P}^{*}$$, a superset of the true plaintext $$p$$ and a model of $$(p)$$. We henceforth refer to this as the language model.

We then construct a model of the inverse keyspace, $$ \Gamma \subseteq {P}^{*} \times {C}^{*}$$. This is a superset of the true encoder relation $$\gamma$$ and serves as an initial (usually uniform) model for $$(c \mid p)$$; we henceforth refer to it as the channel model.

Given an estimate for the channel model, $$\hat{\Gamma}$$, we can express decoding as

$$\mathrm{ShortestPath}\left[\pi_i\left(\Lambda \circ \hat{\Gamma} \circ c\right)\right]$$

where $$\textrm{ShortestPath}$$ represents the Viterbi algorithm.

Sparsity penalties

Knight et al. (2006) suggest maximizing

$$\hat{p} = \arg\max_p \textrm{P}(p) \textrm{P}(c \mid p)^k$$

where $$k$$ is some real number $$> 1$$ during decoding. This strategy increases the sparsity of the hypotheses generated by the trained model. This penalty can be applied using fstmap --map_type=power --power=$K before decoding, if desired.


Berg-Kilpatrick, T., and Klein, D. 2013. Decipherment with a million random restarts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 874-878.

Brown, P., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19(2): 263-311.

Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1): 1-38.

Hart, P. E., Nilsson, N. J., Raphael, B. 1968. A formal basis for the heuristic determination of minimal cost paths. IEEE Transactions on Systems Science and Cybernetics 4(2): 100-107.

Jansche, M. 2003. Inference of string mappings for language technology. Doctoral dissertation, Ohio State University.

Knight, K., Nair, A., Rashod, N., and Yamada, K. 2006. Unsupervised analysis for decipherment problems. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 499-506.

Liang, P., and Klein, D. 2009. Online EM for unsupervised models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 611-619.

Novak, J. R., Minematsu, N., and Hirose, K. 2012. WFST-based grapheme-to-phoneme conversion: Open source tools for alignment, model-building and decoding. In Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 45-49.

Novak, J. R., Minematsu, N., and Hirose, K. 2016. Phonetisaurus: exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Language Engineering 22(6): 907-938.

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2022-04-06 - KyleGorman
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback