Description
Command line utility to calculate the perplexity of a corpus given a model. Verbose mode gives the per word contribution to the perplexity. Out-of-vocabulary items can be dealt with in several ways. If an existing OOV token exists in the model, and thus has probability mass, then that symbol can be specified with the switch
--OOV_symbol
. Every symbol not found in the vocabulary will be mapped to that symbol. If there is no OOV symbol with allocated probability mass in the model, the option
--OOV_probability
allows unigram probability mass to be allocated to the class of OOVs. Note that any OOV symbol represents a class of words. To appropriately assign probability to any given instance, that class probability should be shared among the set. To do this, we must specify the OOV class size, which by default is 10000.
Usage
ngramperplexity [--options] ngram.fst [in.far [out.txt]]
--OOV_symbol: type = string, default = ""
--OOV_class_size: type = double, default = 10000
--OOV_probability: type = double, default = 0
|
|
Examples
$ ngramperplexity earnest.aa.mod earnest.ab.far
Caveats
If there is no OOV_symbol specified, and the OOV_probability is zero, any encountered OOVs -- which would receive 0 probability under these parameterizations -- will be ignored in perplexity calculation.