PyniniDocs < GRM

GRM Web>Pynini>PyniniDocs (2022-07-10, KyleGorman) (raw view)
---+ Pynini documentation
<verbatim>

        ,ggggggggggg,
       dP"""88""""""Y8,
       Yb,  88      `8b
        `"  88      ,8P                        gg                  gg
            88aaaad8P"                         ""                  ""
            88"""""  gg     gg   ,ggg,,ggg,    gg    ,ggg,,ggg,    gg
            88       I8     8I  ,8" "8P" "8,   88   ,8" "8P" "8,   88
            88       I8,   ,8I  I8   8I   8I   88   I8   8I   8I   88
            88      ,d8b, ,d8I ,dP   8I   Yb,_,88,_,dP   8I   Yb,_,88,_
            88      P""Y88P"8888P'   8I   `Y88P""Y88P'   8I   `Y88P""Y8
                          ,d8I'
                        ,dP'8I
                       ,8"  8I
                       I8   8I
                       `8, ,8I
                        `Y8P"

</verbatim>

---++ Introduction

Pynini (Gorman 2016, Gorman & Sproat 2021) is a library for compiling a grammar of strings, regular expressions, and context-dependent rewrite rules into _weighted finite-state transducers_.

---+++ Design

Pynini supports much of the functionality of [[Thrax]], but whereas Thrax is essentially a compiler for a domain-specific language, Pynini is implemented as a Python extension module. This provides several advantages:

   * Pynini users can exploit Python's rich tooling ecosystem, including logging and unit testing frameworks.
   * Pynini users can incorporate Thrax primitives like string compilation into executables.

Pynini is based on the [[http://openfst.org][OpenFst library]], a C++17 template library for weighted finite-state transducers, and particularly its [[http://python.openfst.org][Python extension]].

---+++ Outline

This document covers the following areas:

   1 [[#FormalDoc][formal preliminaries underlying finite-state transducers]],
   1 [[#IntroDoc][getting started with Pynini]],
   1 [[#ExamplesDoc][examples of Pynini in action]],
   1 [[#ApiDoc][API reference]], and
   1 [[#AdvancedDoc][advanced topics]].

Those who are familiar with FSTs may prefer to jump straight to [[#IntroDoc][getting started with Pynini]]. Or, for a quick start, you may also wish to read our
[[https://www.oreilly.com/ideas/how-to-get-superior-text-processing-in-python-with-pynini][Pynini tutorial for O'Reilly]].

#FormalDoc
---++ Formal preliminaries

_Finite-state transducers_ (FSTs) are models of computation widely used for speech and language processing:

   * *Speech recognition*: language models, the pronunciation dictionaries, and the lattices produced by the acoustic model are all represented as FSTs; combined, these define the hypothesis space for the recognizer.
   * *Speech synthesis*: FSTs are used for _text normalization_, the phase of linguistic analysis that converts written text like "meet me at St. Catherine on Bergen St. @ 9:30" to a pronounceable utterance like "meet me at the Saint Catherine on Bergen Street at nine thirty".
   * *Natural language processing*: FSTs are used to represent sequence models such as those used for part of speech tagging, noun chunking, and named entity recognition.
   * *Information retrieval*: finite automata are used to detect dates, times, etc., in web text.

FSTs are less "expressive" than e.g., arbitrary C++ functions or neural networks. However, the formal constraints that limit their expressivity guarantee that FSTs can be efficiently combined, optimized, and searched.

---+++ State machines

Finite automata are simply _state machines_ with a finite number of states. And, a state machine is any devices whose behavior can be modeled using transitions between a number of states. While our finite automata are implemented in software, there are also hardware state machines, like the traditional gumball machine (image credit: Wikimedia Commons):

https://upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Kosher_gumballs.JPG/360px-Kosher_gumballs.JPG

A working gumball machine can be in one of two states: either its coin slot does, or does not, contain a coin. Initially the gumball machine has an empty coin slot; and turning the knob has no effect on the state of the machine. But once a coin is inserted into the coin slot, turning the knob dispenses a gumball and empties the coin slot. If the knob is turned while the coin slot is vacant, no change of state occurs. However, if the knob is turned while the coin slot is occupied, a gumball is dispensed and the coin slot is cleared. We can represent this schematic description of a gumball machine as a directed graph in which the states are represented by circles and the transitions between states—henceforth, <i>arcs</i>—are represented by arrows between states. Each arc has a pair of labels representing the input consumed, and output emitted, respectively, when that arc is traversed. By convention, the Greek letter &epsilon; (_epsilon_) is used to represent null input and/or output along an arc.

<img alt="gumball.png" src="%ATTACHURLPATH%/gumball.png" />

The bold double-circle state labeled =0= simply indicates that that state is "final", whereas the single-circle state =1= is non-final. Here, this encodes the notion that no user would walk away from a gumball machine while there's still a coin in the slot.

---+++ Finite-state acceptors

In what follows, we draw heavily from chapter 1 of [[#ReferenceDoc][Roark & Sproat 2008]], and interested readers are advised to consult that resource for further details.

---++++ Definition

_Finite-state acceptors_ (or FSAs) are the simplest type of FST. Each FSA represents a set of strings (henceforth, a _language_), similar to a regular expression, as follows. An FSA consists of a five-tuple (_Q_, _S_, _F_, &Sigma; &delta;) where:

   1 _Q_ is a finite set of states,
   1 _S_ is the set of _start_ states,
   1 _F_ is the set of _final_ states,
   1 &Sigma; is the alphabet, and
   1 &delta; : _Q_ × (&Sigma; &cup; &epsilon;) &rarr; _Q_ is the transition relation.

A _path_ through an FSA (_Q_, _S_, _F_, &Sigma; &delta;) is simply a sequence of transitions starting with a start state _s_ &isin; _S_, taking only transitions licensed by &delta;, and terminating at a final state _f_ &isin; _F_. More formally, let a path be a sequence _P_ = <i>p</i><sub>0</sub>, <i>p</i><sub>1</sub>, …, <i>p<sub>n</sub></i> where each element _p_ &isin; _P_ is a triple over _Q_ × (&Sigma; &cup; &epsilon;) × Q. Each triple consists of a source state, a _label_ in the alphabet (or the null label &epsilon;) and a destination state. Then, a path is valid if and only if all of the following conditions hold:

   1 The source state for <i>p</i><sub>0</sub> is in _S_,
   1 &delta;(<i>q<sub>i</sub></i> <i>l<sub>i</sub></i>) = <i>q<sub>i + 1</sub></i> for any <i>p<sub>i</sub></i> = (<i>q<sub>i</sub></i>, <i>l<sub>i</sub></i>, <i>q<sub>i + 1</sub></i>) &isin; _P_, and
   1 the destination state for <i>p<sub>n</sub></i> is in _F_.

The first condition requires that the path begin at a start state, the second condition requires that all transitions be licensed by &delta;, and the third condition requires that the final transition be to a final state.

Let the _string_ of a path be the (possibly empty) sequence of labels. For instance, if _P_ is (_s_, <i>l<sub>0</sub></i>, <i>q<sub>1</sub></i>, (<i>q</i><sub>1</sub>, <i>l</i><sub>1</sub>, <i>q</i><sub>2</sub>), (<i>q</i><sub>2</sub>, <i>l</i><sub>2</sub>, <i>q</i><sub>3</sub>), then the string of _P_ is simply <i>l</i><sub>0</sub>, <i>l</i><sub>1</sub>, <i>l</i><sub>2</sub>. The _language_ described by an FSA is simply the union of all strings of all its paths.

---++++ Example

The following FSA represents the Perl-style regular expression =ab*cd+e=:

<img alt="abcde.png" src="%ATTACHURLPATH%/abcde.png" />

Here, the bold circle labeled =0= indicates the start state, and the double
circle labeled =4= indicates the final state.

---+++ Finite-state transducers

---++++ Definition

_Finite-state transducers_ (FSTs) are generalization of FSAs. In the normal case of a _two-way_ transducer, &delta; is instead a relation from _Q_ × (&Sigma;<sub>i</sub> &cup; &epsilon;) × (&Sigma;<sub>o</sub> &cup; &epsilon;) &rarr; _Q_ where &Sigma;<sub>i</sub>and &Sigma;<sub>o</sub> are the input and output alphabets, respectively. Paths through a FST are defined similarly to the definition given for FSAs above, except that each path corresponds to a set of two strings, an input string over &Sigma;<sub>i</sub><sup>*</sup> and an output string over &Sigma;<sub>o</sub><sup>*</sup>. Whereas FSAs describe sets of strings, FSTs describe relations _between_ sets of strings.

When the relation described by an FST is such that each input string corresponds to at most one output string, we say that the FST is _functional_.

---++++ Example

The following FST represents the relation =(a:a)(b:b)*(c:g)(d:f)+(e:e)=:

<img alt="abcgdfe.png" src="%ATTACHURLPATH%/abcgdfe.png" />

---+++ Weighted finite-state transducers

---++++ Definition

_Weighted finite-state transducers_ (WFSTs) are generalizations of FSTs which use an alternative definition of both _F_ and &delta; incorporating the notion of _weights_. FST weights and their operations can be understood by first defining the notion of _semiring_, which require us to first define the notion _monoid_.

A monoid is a pair (_M_, •) where _M_ is a set and • is a
binary operation on _M_ such that:

   1 • is _closed_ over _M_: for all _a_, _b_ in _M_, _a_ • _b_ &isin; _M_,
   1 there is an identity element _e_ &isin; _M_ such that _a_ • _e_ = _e_ • = _a_ for all _a_ &isin; _M_, and
   1 • is _associative_; for all _a_, _b_, _c_ &isin; _M_, (_a_ • _b_) • _c_ = _a_ • (_b_ • _c_).

A monoid (_M_, •) is said to be _commutative_ if for all _a_, _b_ &isin; _M_, _a_ • _b_ = _b_ • _a_.

Then, a semiring is a triple (&#x1D542;, &oplus;, &otimes;) such that:

   1 (&#x1D542;, &oplus;) is a commutative monoid,
   1 (&#x1D542;, &otimes;) is a monoid,
   1 for all _a_, _b_, _c_ &isin; &#x1D542;, _a_ &otimes; (_b_ &oplus; _c_) = (_a_ &otimes; b) &oplus; (_a_ &otimes; _c_), and
   1 for all _a_ &isin; &#x1D542;, _a_ &otimes; <span style="text-decoration: overline;">0</span> = <span style="text-decoration: overline;">0</span> &otimes; _a_ = <span style="text-decoration: overline;">0</span> where <span style="text-decoration: overline;">0</span> is the identity element for the monoid (&#x1D542;, &oplus;).

In many cases, &#x1D542; is the set of real numbers, so a semiring can be denoted simply by specifying the pair of operators (&oplus; &otimes;). The so-called _real semiring_ is simply (+, ×).

When working with probabilities as weights, we often use the _tropical semiring_ (min, +) and negative log probabilities, taking advantage of the logarithmic identity log(_x_) + log(_y_) = log(_xy_). The tropical semiring (and the associated =standard= arc type) is the default semiring in Pynini.

At last, we can give the modified definitions for _F_ and delta; for WFSTs. Whereas for unweighted FSTs, _F_ is a set of final states, for WFSTs _F_ is a set of pairs over _Q_ × &#x1D542;, where the second element is the _final weight_ for that state. And, the transition relation &delta; for a WFST is from _Q_ × (&Sigma;<sub>i</sub> &cup; &epsilon;) × (&Sigma;<sub>o</sub> &cup; &epsilon;) × &#x1D542; to _Q_. The definition of paths is parallel to those for unweighted FSTs except that each element in the path is also associated with a weight in &#x1D542;.

---++++ Example

WFSTs are a natural representation for conditional probability distributions from strings to strings. For example, consider a text normalization rule which verbalizes =2:00= as =two= with _P_ = .2 and as =two o'clock= with _P_ = .8. The following _probability_ (i.e., real-valued) _semiring_ WFST encodes this distribution:

<img alt="time.png" height="100" src="%ATTACHURLPATH%/time.png" width="816" />

---+++ Conventions used in Pynini

   * Pynini represents all acceptors and transducers, weighted or unweighted, as
    WFSTs. Thus, for example, a weighted finite-state acceptor (WFSA) is represented as a WFST in which input and output labels match in all cases, and an unweighted finite-state transducer is represented by a WFST in which all weights are <span style="text-decoration: overline;">1</span> or <span style="text-decoration: overline;">0</span>.
   * Pynini only permits one state to be designated the start state.
   * Pynini assigns a final weight to all states; a _nonfinal_ state is just one which has a final weight of <span style="text-decoration: overline;">0</span>.

#IntroDoc
---++ Getting started with Pynini

---+++ Starting Pynini

[[http://openfst.cs.nyu.edu/twiki/pub/GRM/PyniniDownload/README.rst][Install Pynini]] then simply import the module as follows:
<verbatim>
import pynini</verbatim>

---+++ Building FSTs

In Pynini, all FST objects are an instance of a class called =pynini.Fst=, representing a mutable WFST. The user must specify the arc type at construction time; by default, the =standard= arc type (and the associated tropical semiring) is are used.

Pynini provides several functions for compiling strings into FSTs; we proceed to review a few of these methods.

#AcceptorDoc
---++++ Constructing acceptors

The =acceptor= function compiles a (Unicode or byte) string into a deterministic acceptor. The user may specify a final weight using the =weight= keyword argument; by default, the final weight is <span style="text-decoration: overline;">1</span>. The user may also specify the desired arc type using the =arc_type= keyword argument. The user may also specify how the characters in the string are to be translated into labels of the acceptor. By default (=token_type="byte"=), each arc in the acceptor corresponds to byte in the input string. Finally, the user may specify whether or not the symbol table used should be attached to the FST using the =attach_symbols= keyword argument, which defaults to =True=.
<verbatim>
# UTF-8 bytestring, byte arcs.
print(pynini.accep("Pont l'Evêque"))
# Output below...

0	1	80	80
1	2	111	111
2	3	110	110
3	4	116	116
4	5	32	32
5	6	108	108
6	7	39	39
7	8	69	69
8	9	118	118
9	10	195	195
10	11	170	170
11	12	113	113
12	13	117	117
13	14	101	101
14</verbatim>

If the user specifies =token_type="utf8"= then each arc in the FST corresponds to a Unicode codepoint.
<verbatim>
# UTF-8 bytestring, Unicode codepoint arcs.
print(pynini.accep("Pont l'Evêque", token_type="utf8"))
# Output below...

0	1	80	80
1	2	111	111
2	3	110	110
3	4	116	116
4	5	32	32
5	6	108	108
6	7	39	39
7	8	69	69
8	9	118	118
9	10	195	195
10	11	170	170
11	12	113	113
12	13	117	117
13	14	101	101
14</verbatim>

Sequences of characters enclosed by =[= and =]= receive special interpretations in =byte= or =utf8= mode. Pynini first attempts to parse any such sequence as an integer using the C standard library function =strtoll=.
<verbatim>
assert pynini.accept("b[0x61][97]") == pynini.accep("baa")  # OK.</verbatim>

If this fails, then Pynini treats the sequence as a sequence of one or more whitespace-delimited "generated" symbols, each of which is given a unique identifier in the resulting FST's symbol table.
<verbatim>
x = pynini.accep("[It's not much of a cheese shop really]")
y = pynini.accep("[It's][not][much][of][a][cheese][shop][really]")
assert x == y   # OK.</verbatim>

A bracket character to be interpreted literally *must* be escaped.
<verbatim>
bracket = pynini.accep("\[")  # OK.

unused = pynini.accep("[")  # Not OK: Raises FstStringCompilationError.</verbatim>

Finally, if the user specifies a =SymbolTable= as the =token_type=, then the input string is parsed according and the FST labeled according to that table. String parsing failures are logged and raise a =FstStringCompilationError= exception from within the Python interpreter.

Nearly all FST operations permit a string to be passed in place of an =Fst= argument; in that case, =pynini.acceptor= is used (with default arguments) to compile the string into an FST.

#TransducerDoc
---++++ Constructing transducers

The =cross= function creates a transducer from two FSTs, compiling string arguments into FSTs if necessary. The result is a cross-product of the two input FSTs, interpreted as acceptors. Specifically, the transducer is constructed by mapping output arcs in the first FST to epsilon, mapping the input arcs in the second FST to epsilon, then concatenating the two. In the case that both FSTs are strings, the resulting transducer will simply contain the input and output string converted into labels, and the shorter of the two will be padded with epsilons.

As with =acceptor=, the user may specify a final weight using the =weight= keyword argument; the final weight is <span style="text-decoration: overline;">1</span>. The user also specify the desired arc type using the =arc_type= keyword argument. The user may also specify how the characters in the input and/or output strings are to be translated into labels of the transducer, if strings are passed in place of FST arguments. By default (=input_token_type="byte"= and =output_token_type="byte"=), each arc corresponds to a byte in the input and/or output string. Finally, the user may specify whether or not symbol tables should be attached to the FST.

Whereas =cross= is used to construct the cross-product of two strings or automata, the union of many string pair cross-products is compiled using the =string_map= and =string_file=  functions. The former takes iterables of strings as an argument; the latter reads string pairs from one-to-three [[https://en.wikipedia.org/wiki/Tab-separated_values][TSV file]]. Both functions use a compact prefix tree representation which can be used to represent a (multi)map.

As with the above, the user may specify the desired arc type using the =arc_type= keyword argument.

#WorkedExamplesDoc
---+++ Worked examples

The following examples, taken from [[#ReferenceDoc][Gorman 2016]] and [[#ReferenceDoc][Gorman & Sproat 2016]], show features of Pynini in action.

---++++ Finnish vowel harmony

[[#ReferenceDoc][Koskenniemi (1983)]] provides a number of manually-compiled FSTs modeling Finnish morphophonological patterns. One of these concerns the well-known pattern of Finnish vowel harmony. Many Finnish suffixes have two allomorphs differing only in the backness specification of their vowel. For example, the adessive suffix is usually realized as _-llä_ [l&#x2D0;æ&#x2D0;] except when the preceding stem contains one of _u_, _o_, and _a_ and there is no intervening _y_, _ö_, or _ä_; in this case, it is _-lla_ [l&#x2D0;&#x251;&#x2D0;]. For example, _käde_ 'hand' has an adessive _kädellä_, whereas _vero_ 'tax' forms the adessive _verolla_ because the nearest stem vowel is _o_ ( [[#ReferenceDoc][Ringen & Heinämäki 1999]]).

The following gives a Pynini function =make_adessive= which generates the appropriate form of the noun stem. It first concatenates the stem with an abstract representation of the suffix, and then composes the result with a context-dependent rewrite rule =adessive_harmony=.
<verbatim>
back_vowel = pynini.union("u", "o", "a")
neutral_vowel = pynini.union("i", "e")
front_vowel = pynini.union("y", "ö", "ä")
vowel = pynini.union(back_vowel, neutral_vowel, front_vowel)
archiphoneme = pynini.union("A", "I", "E", "O", "U")
consonant = pynini.union("b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n",
                         "p", "q", "r", "s", "t", "v", "w", "x", "z")
sigma_star = pynini.union(vowel, consonant, archiphoneme).closure()
adessive = "llA"
intervener = pynini.union(consonant, neutral_vowel).closure()
adessive_harmony = (pynini.cdrewrite(pynini.cross("A", "a"),
                                     back_vowel + intervener, "", sigma_star) @
                    pynini.cdrewrite(pynini.cross("A", "ä"), "", "", sigma_star)
                   ).optimize()


def make_adessive(stem):
  ur = stem + adessive
  sr = ur @ adessive_harmony
  return sr.string()</verbatim>

---++++ T9 disambiguation

T9 (short for "Text on 9 keys"; [[#ReferenceDoc][Grover et al. 1998]]) is a patented predictive text entry system. In T9, each character in the "plaintext" alphabet is assigned to one of the 9 digit keys (0 is usually reserved to represent space) of the traditional 3x4 touch-tone phone grid. For instance, the message _GO HOME_ is entered as _4604663_. Since the resulting "ciphertext" may be highly ambiguous—this sequence could also be read as _GO HOOD_ or as many nonsensical expressions—a hybrid language model/lexicon is used for decoding.

The following snippet implements T9 encoding and decoding in Pynini:
<verbatim>
LM = pynini.Fst.read("charlm.fst")
T9_ENCODER = pynini.string_file("t9.tsv").closure()
T9_DECODER = pynini.invert(T9_ENCODER)


def encode_string(plaintext):
  return (plaintext @ T9_ENCODER).string()


def k_best(ciphertext, k):
  lattice = ciphertext @ T9_DECODER
  lattice.project("output")
  lattice @= LM
  return pynini.shortestpath(lattice, nshortest=k, unique=True)</verbatim>

The first line reads a language model (LM), represented as a WFSA. The second line reads the encoder table from a TSV file: in this file, each line contains an alphabetic character and the corresponding digit key. By computing the concatenative closure of this map, one obtains an FST =T9_ENCODER= which can encode arbitrarily-long plaintext strings. The =encode_string= function applies this FST to arbitrary plaintext strings: _application_ here refers to composition of a string with a transducer followed by output projection. The
=k_best= function first applies a ciphertext string—a bytestring of digits—to the inverse of the encoder FST (=T9_DECODER=). This creates an intermediate lattice of all possible plaintexts consistent with the T9 ciphertext. This is then scored with—that is, composed with—the character LM. Finally, this function returns the _k_ highest probability plaintexts in the lattice. For the following example, the highest probability plaintext is in fact the correct one:
<verbatim>
pt = "THE SINGLE MOST POPULAR CHEESE IN THE WORLD"
ct = encode_string(pt)
print("CIPHERTEXT:", ct)
print("DECIPHERED PLAINTEXT:", k_best(ct, 1).string())
---+ Output below...

CIPHERTEXT: 8430746453066780767852702433730460843096753
DECIPHERED PLAINTEXT: THE SINGLE MOST POPULAR CHEESE IN THE WORLD</verbatim>

#ApiDoc
---++ API reference %ICON{wip}%

The APIs are documented in-module. To learn about a function or method, use Python's built-in =help= function (or in Jupyter, use the =?= prefix operator); e.g., =import pynini; help(pynini)=.

#ReferenceDoc
---++ References

Gorman, Kyle. 2016. [[http://openfst.cs.nyu.edu/twiki/pub/GRM/Pynini/pynini-paper.pdf][Pynini: A Python library for weighted finite-state grammar compilation]]. In _Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata_, 75-80.

Gorman, Kyle, and Richard Sproat. 2021. _Finite-State Text Processing_. Morgan & Claypool.

Gorman, Kyle, and Richard Sproat. 2016. [[https://www.oreilly.com/ideas/how-to-get-superior-text-processing-in-python-with-pynini][How to get superior text processing in Python with Pynini]]. O'Reilly Ideas blog, accessed 11/16/16.

Grover, Dale L., Martin T. King, and Clifford A. Kushler. 1998. Reduced keyboard disambiguating computer. US Patent 5,818,437.

Koskenniemi, Kimmo. 1983. Two-level morphology: A general computational model for word-form recognition and production. Doctoral dissertation, University of Helsinki.

Ringen, Catherine O. and Orvokki Heinämäki. 1999. Variation in Finnish vowel harmony: An OT account. _Natural Language and Linguistic Theory_ 17(2): 303-337.

Roark, Brian, and Richard Sproat. 2007. _Computational Approaches to Syntax and Morphology_. Oxford University Press.
Topic revision: r33 - 2022-07-10 - KyleGorman