Baum-Welch documentation
Training
Training is performed using the
TrainBaumWelch
function. It takes as arguments a FAR or WFST representing the plaintext, and channel model and a FAR representing the ciphertext.
Normalization strategies
Training takes a template argument argument representing the desired expectation table type. Two types are available. The state expectation table normalizes so that all arcs leaving a state sum to semiring one. The state/input-label expectation table normalizes so that the weights of all arcs leaving a state with the same input label sum to semiring one; this is arguably the most intuitive normalization strategy and so it used as the default.
Semirings and count collection strategies
Training function is controlled in part by the arc type of the inputs.
In a semiring without the
path property, like the log semiring, the E-step sums over all state sequences and thus performs true Baum-Welch training. In the case the input is a FST language model with backoffs, attach a $\phi$-matcher (
fstspecial --fst_type phi
) to ensure proper interpretation.
In a semiring with the path property, like the tropical semiring, the E-step sums over only the most likely state sequence, an approximation known as
Viterbi training (Brown et al. 1993:293). In the case the input is an FST language model with backoffs, backoffs can be encoded exactly using a $\phi$-matcher or approximated by $\epsilon$ using the default matcher. In practice, Baum-Welch training is slightly slower than Viterbi training, though somewhat more accurate (Jansche 2003).
Options
TrainBaumWelch
takes an struct representing options for training. These define the size of the batch, the maximum number of iterations for training, the learning rate $$\alpha$$, and the comparison/quantization $$\delta$$ used to detect convergence.
Since expectation maximization training is only locally optimal,
random (re)starts can be used to avoid local minima. For particularly difficult problems, it may be beneficial to perform a very large number of random starts (e.g., Berg-Kilpatrick and Klein 2013). In this library, we randomly initialize weights by sampling from the uniform distribution $(\delta, 1]$ in the real numbers and then mapping these values onto the appropriate values in the desired semiring.
RandomizeBaumWelch
can be used to generate random restarts.
References
Berg-Kilpatrick, T., and Klein, D. 2013. Decipherment with a million random
restarts. In _Proceedings of the 2013 Conference on Empirical Methods in Natural
Language Processing_, pages 874-878. Seattle.
Brown, P., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. 1993. The
mathematics of statistical machine translation: parameter estimation.
Computational Linguistics 19(2): 263-311.
Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from
incomplete data via the EM algorithm. _Journal of the Royal Statistical Society,
Series B_, 39(1): 1-38.
Hart, P. E., Nilsson, N. J., Raphael, B. 1968. A formal basis for the heuristic
determination of minimal cost paths. _IEEE Transactions on Systems Science and
Cybernetics_ 4(2): 100-107.
Jansche, M. 2003. Inference of string mappings for language technology. Doctoral
dissertation, Ohio State University.
Knight, K., Nair, A., Rashod, N., and Yamada, K. 2006. Unsupervised analysis for
decipherment problems. _Proceedings of the COLING/ACL 2006 Main Conference
Poster Sessions_, pages 499-506. Sydney.
Liang, P., and Klein, D. 2009. Online EM for unsupervised models. In
_Proceedings of Human Language Technologies: The 2009 Annual Conference of the
North American Chapter of the Association for Computational Linguistics_, pages
611-619. Boulder.
Novak, J. R., Minematsu, N., and Hirose, K. 2012. WFST-based grapheme-to-phoneme
conversion: Open source tools for alignment, model-building and decoding.
_Proceedings of the 10th International Workshop on Finite State Methods and
Natural Language Processing_, pages 45-49. Donostia–San Sebastián.
Novak, J. R., Minematsu, N., and Hirose, K. 2016. Phonetisaurus: exploring
grapheme-to-phoneme conversion with joint n-gram models in the WFST framework.
Natural Language Engineering 22(6): 907-938.