Description
This utility counts n-grams from an input FST archive. This produces a count FST with the same topology as the eventual normalized model, complete with backoff transitions. The option
order specifies the maximum order n-gram to count, and the utility counts all n-gram orders less than or equal to the parameterized maximum order. The option
--epsilon_as_backoff causes the counter to interpret
<epsilon> as a backoff transition while counting, which is only appropriate in very specialized circumstances (see caveats below).
Usage
ngramcount [--options] [in.far [out.fst]]
--order: type = int64, default = 3
--epsilon_as_backoff: type = bool, default = false
|
|
class NGramCounter(size_t order);
|
|
In addition to the simple C++ usage above, optional arguments permit the passing of non-default values for various parameters similar to the command-line version.
Examples
The default counts trigrams, bigrams and unigrams from an input corpus:
ngramcount earnest.far >earnest.3g.cnts
To count trigrams, bigrams and unigrams from a single FST using the library functions:
NGramCounter<Log64Weight> ngram_counter(3);
StdMutableFst *fst = StdMutableFst::Read("in.fst", true);
ngram_counter.Count(*fst);
VectorFst<StdArc> fst;
ngram_counter.GetFst(&fst);
fst.Write("out.fst");
Caveats
Backoff transitions, labeled with
<epsilon>, have weight One() in the semiring. By default, the count FSTs are in the tropical semiring, hence backoff weight is 0 and n-gram transitions have weight -log(count).
The
--epsilon_as_backoff switch interprets
<epsilon> in the input fst archive as a backoff transition. This is only appropriate when the corpus is randomly sampled from a model and shows where backoff transitions were taken. It allows for the use of the
presmoothed method in
ngrammake. These are not typical scenarios, hence these options should be used with care.
References