TWiki
>
GRM Web
>
PyniniStringDoc
(2022-09-25,
KyleGorman
)
(raw view)
E
dit
A
ttach
---+ String (de)compilation This directory contains functions useful for mapping strings into FSAs (_compilation_) and for mapping string FSTs onto strings (_printing_). ---++ String token modes This library provides three string compilation/printing "modes" controlled by the enum =fst::TokenType=. For string compilation, C++ users should use =fst::StringCompile= in =stringcompile.h=. For string printing, C++ users should use =fst::StringPrint= in =stringprint.h=. A compilation or printing mode is said to be _safe_ when the =Label= template parameter has enough precision to losslessly encode all possible labels generated by that mode. If a mode is not safe for the given =Label=, silent conversion may occur. =BYTE= and =UTF8= modes follow special conventions for interpreting the ASCII square bracket characters [ and ], which are also described below. ---+++ =BYTE= mode =BYTE= mode is a sensible default for most users. When compiling with =BYTE= mode, one arc is generated for each byte in the input string and that byte is used as both input and output label for that arc. When printing with =BYTE= mode, each arc label is converted to a byte in the output bytestring. Since this mode maps between =Label= and bytes, this mode is safe (modulo bracket parsing) when the =Label= template parameter has at least 8 bits of (positive) integer precision. ---+++ =UTF8= mode When compiling with =UTF8= mode, the input string is assumed to be encoded UTF-8. Each arc is then labeled with a Unicode codepoint. This may produce a somewhat smaller machine than =BYTE= mode since multiple bytes in the input may correspond to single arcs in the resulting machine in this mode. However, if the input is ASCII, both =BYTE= and =UTF8= modes will produce equivalent FSAs (though the former will be slightly faster). When printing with =UTF8= mode, each arc is interpreted as a Unicode codepoint and the result is encoded as a UTF-8 bytestring. Since Unicode guarantees no more than 2<sup>32</sup> unique codepoints, this mode is safe (modulo bracket parsing) when the Label template parameter has at least 32 bits of (positive) integer precision. However, in the common case where all codepoints in the input fall within the [[https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane][Basic Multilingual Plane (BMP)]], no conversions will occur if the Label template parameter has at least 16 bits of (positive) integer precision. And, in the common case that all codepoints in the input are ASCII, no conversions will occur if the Label template parameter has at least 8 bits of (positive) integer precision. ---+++ =SYMBOL= mode When compiling with =SYMBOL= mode, the input string is parsed into tokens separated by space character, and then each input is looked up in a provided =fst::SymbolTable=. Compilation fails if any token is not present in the symbol table. When printing with =SYMBOL= mode, each arc's olabel (output-side label) is looked up in a provided =fst::SymbolTable= and each symbol string is written to the output string, once again separated by a space character. If the input is a transducer rather than an acceptor, it is implicitly reinterpreted as the acceptor given by its output projection. Printing fails if the FSA is not a string FSA (i.e., is not =fst::kString=). Since =fst::SymbolTable= is keyed with =int64=, this mode is safe when the =Label= template parameter has as much precision as an =int64=. However, no conversions will occur if the Label template parameter has at least as much precision as every key in the the symbol table. ---+++ Square bracket parsing =BYTE= and =UTF8= modes follow special conventions for parsing (unescaped) ASCII square bracket characters [ and ]. When a string is enclosed by a set of (unescaped) square brackets, we refer to it as a _bracketed span_. The delimiting brackets are subsequentely ignored and the bracketed span is interpreted as follows: 1 If the bracketed span is empty, an error results. 1 If the bracketed span contains one or more space characters, each space-delimited token is treated as a "generated" symbol and added to the symbol table attached to the resulting FSA (at a suitably high index so as to avoid any collisions with BMP codepoints). 1 Otherwise, the bracketed span is run through stroll. If strtoll can parse the entire span as a (octal, hexidecimal, or decimal) integer, the span is labeled using that integer, cast to the =Label= template type. b. Otherwise, the span is treated as a "generated" symbol and stored in a global symbol table in a private use area. Note that bracket parsing thus has the same safety guarantees as =SYMBOL= mode. However, no conversions will occur during bracket parsing so long as: * bracketed integers are within the range of the Label template parameter, and * bracketed "generated" symbols are assigned keys within the range of the =Label= template parameter. When compiling with =BYTE= or =UTF8= mode, one may use a backslash to "escape" a bracket literal; one can use =fst::EscapeString= to do this automatically. When printing, no special treatment of bracket literals is necessary, as non-literal brackets are lost during the compilation process. The following pairs of input strings and corresponding labels illustrate how brackets are parsed in =BYTE= or =UTF8= mode: * "[0146][0x6f][111]": {102, 111, 111} * "[foo bar baz]": {1048577, 1048578, 1048579} * "foo[foo foo]": {102, 111, 111, 1048577, 1048577} * "foo[foo][foo]": {102, 111, 111, 1048577, 1048577} * "foo\[bar\]": {102, 111, 111, 91, 98, 97, 114, 93} One can use =fst::GeneratedSymbols= to retrieve a view of the generated symbol table. ---++ String maps =fst::StringMapCompile= compiles an FST from a vector of vector of strings. The first element in each inner vector is interpreted as the input string. The optional second element is interpreted as the output string for the transduction; if not specified it defaults to the value of the first element. An optional third element is interpreted as a weight for the transduction; if not specified it defaults to semiring One. These elements are used to form a cross-product transducer using the user-specified compilation mode (as above). The resulting FST represents the union of all pairs of weighted cross-products. =fst::StringFileCompile= is quite similar to =fst::StringMapCompile=, except that it reads pairs of input strings directly from a one-to-three-column TSV (tab-separated values) file. Both functions use a prefix-tree construction, so the resulting FST may be significantly more compact than would be generated by repeatedly computing the union of many transducers. In the case that all inputs are strings rather than string pairs, the resulting transducer is an epsilon-free acceptor.
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r9
<
r8
<
r7
<
r6
<
r5
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r9 - 2022-09-25
-
KyleGorman
GRM
Log In
or
Register
GRM Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Webs
Contrib
FST
Forum
GRM
Kernel
Main
Sandbox
TWiki
Main
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback