NGram Model Format

The following gives the encoding of all n-gram models produced by the utilities here, including those with unnormalized counts, as a cyclic weighted finite-state transducer in OpenFst format.

An n-gram is a sequence of k symbols: w1 ... wk. Let N be the set of n-grams in the model.

  • There is a unigram state in every model, representing the empty string.
  • Every proper prefix of every n-gram in N has an associated state in the model.
  • The state associated with an n-gram w1 ... wk has a backoff transition (labeled with <epsilon>) to the state associated with its suffix w2 ... wk.
  • An n-gram w1 ... wk is represented as a transition, labeled with wk, from the state associated with its prefix w1 ... wk-1 to a destination state defined as follows:
    • If w1 ... wk is a proper prefix of an n-gram in the model, then the destination of the transition is the state associated with w1 ... wk
    • Otherwise, the destination of the transition is the state associated with the suffix w2 ... wk.
  • Start and end of the sequence are not represented via transitions in the automaton or symbols in the symbol table. Rather
    • The start state of the automaton encodes the "start of sequence" n-gram prefix (commonly denoted <s>).
    • The end of the sequence (often denoted </s>) is included in the model through state final weights, i.e., for a state associated with an n-gram prefix w1 ... wk, the final weight of that state represents the weight of the n-gram w1 ... wk </s>.
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2012-03-04 - BrianRoark
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback