GRM Web>GrmGrammarCompiler (2011-02-08, RichardSproat)

OpenGrm Grammar Compiler

The OpenGrm Grammar Compiler is a set of tools for compiling grammars expressed as regular expressions and context-dependent rewrite rules into weighted finite-state transducers using the OpenFst format.

As a starting point we recommend taking a look at the sample grammar provided with the distribution, which will be installed in /usr/local/share/grm/stdlib. Copy the contents of that directory to a directory where you have write permissions and in that directory do the following:

$ grmmakedep example.grm
$ make

This will build an FST archive (far) from the grammar called "example.far". You can test the grammar as follows:

$ grmrewrite-tester --far=example.far --rule=TOKENIZER
Input string: Well, I can't eat muffins in an agitated manner. 
Output string: Well , I ca n't eat muffins in an agitated manner .
Input string: Mr. Ernest Worthing, B. 4, The Albany.
Output string: Mr. Ernest Worthing , B. four , The Albany .
Input string: Lieutenant 1840,
Output string: Lieutenant eighteen forty ,
Input string: Uncle Jack!
Output string: Uncle Ernest !

The comments in example.grm give an introduction to many of the features in the toolkit.

Specific Details

Each source file should have the extension .grm. The compiler compiles each source file to a single FST archive with the extension .far. The FST archive and the original source file then form a pair which can be used in other files.

Each grammar file has three sections: Imports, Functions, and Body, each of which is optional. Comments begin with a single pound sign (#) and last until the next newline.

Imports

The import section consists of a list of import statements. An import statement looks like:

import 'path/to/file.grm' as alias;

This tells the compiler to first find the provided file and load from that all functions (from the second section, described below) and all exported symbols. The symbols (FSTs) are loaded from the compiled target FST archive (file.far) which must be in the same directory as the imported source file. After this, we can reference functions and symbols from the imported file using the alias. Therefore, if file.grm provides a function named foo, then we can call on that function using the syntax alias.foo[...]. Note that an imported file may itself import other files. To access those functions and FSTs, we can just append additional aliases:

alias.utils.common.kUpperCaseLetters.

Functions

The second section involves functions. Functions in the language can be specified inline in a GRM file. A function definition uses the func keyword, then a function identifier, followed by the argument list, enclosed by square brackets and delimited by commas, and then the function body, which consists of a list of statements enclosed by curly braces. In a function, a single return object (usually an FST) must be returned using the return statement. An example is as follows:

func UnionWithTriple[fst] {
  fst3 = fst fst fst;
  result = fst | fst3;
  return result;
}

To call this function, we can then use the function name and bind the arguments as one might expect in any other popular language today:

export a_or_a3 = UnionWithTriple["a"];

Alternatively, functions in the GRM language can also be written in C++. To do this, one should overload the grm_language::function::Function class and call the proper registration macros. This is easily done by following the examples of already-written functions in src/include/grm:

arcsort.h, cdrewrite.h, closure.h, compose.h, concat.h, connect.h, determinize.h, difference.h, expand.h, function.h,
invert.h, minimize.h, optimize.h, reverse.h, rewrite.h, rmepsilon.h, stringfile.h, stringfst.h, symboltable.h, union.h

It is also necessary to #include your new header in src/lib/walker/loader.cc and add a line that registers your function in that file in the obvious place:

REGISTER_GRM_FUNCTION(YourFunction);

The majority of the built-in functions (such as union and closure) are implemented in this manner. The libraries then need to be rebuilt.

Body

The final section is the main body, which consists of the statements to generate the FSTs that are exported to the FST archive as output. Each statement consists of an assignment terminating with a semicolon:

foo = "abc";
export bar = foo | "xyz";

Statements can be preceded with the export keyword; such statements will then be written to the final output archive. Statements lacking this keyword define temporaries that be used later, but are themselves not output.

FST String Input

Basic string FSTs are defined by text enclosed by double quotes ("). (Note that raw strings, such as those used in filenames, are enclosed by single quotes (') instead.) They can be parsed in one of three ways, which are denoted using a dot (.) followed by either byte, utf8, or an identifier holding a symbol table. Note that within strings, the backslash character (\) is used to escape the next character. Of particular note, \n translates into a newline, \r into a line feed, and \t into the tab character. Literal left and right square brackets will also need escaping, as they are used natively to generate symbols (see below). All other characters following the backslash are uninterpreted, so that we can use \" and \' to insert an actual quote (double) quote symbol instead of terminating the string.

The basic way is to treat it as a pure byte string; each arc of the resulting FST will correspond to a single 1-byte character of the input. This can be specified either by leaving off the parse mode ("abc") or by explicitly using the byte mode ("abc"/byte).
The second way is to use UTF8 parsing by using the special keyword ("abc"/utf8).
Finally, we can load up a symbol table and split the string using the fst_field_separator flag (found in fst/src/lib/symbol-table.cc) and then perform symbol table lookups. Symbol tables can be loaded using the SymbolTable built-in function.

      arctic_symbol_table = SymbolTable['/path/to/bears.symtab'];
      pb = "polar bear"/arctic_symbol_table;

We can also create temporary symbols on the fly by enclosing a symbol name inside brackets within an FST string. All of the text inside the brackets will be taken to be part of the symbol name, and future encounters of the same symbol name will map to the same label. By default, labels use "Private Use Area B" of the unicode table (0x100000 - 0x10FFFD), except that the last two code points 0x10FFFC and 0x10FFFD are reserved for the "[BOS]" and "[EOS]" tags discussed below.

      cross_pos = "cross" ("" : "_[s_noun]");
      pluralize_nouns = "_[s_noun]" : "es";

If the symbol name is a complete integer (i.e., a string successfully and completely consumed by strtol() from stdlib.h) then instead of a temporary symbol, we use that number as the arc label directly. This can be used to explicitly specify arcs. For example, "[32][0x20][040]" will translate into a string FST that matches three space characters, the same as " ".

Keywords & Symbols

A list of reserved keywords and symbols (and examples where appropriate) are listed below.

=: used to assign the object on the right side (most commonly an FST) to the identifier on the left side.
export: used to specify that a particular FST symbol (rule) be written out to the FST archive.

         export foo = "abc";

(): used to group an expression to be evaluated first, breaking normal precedence rules.
;: terminates a statement
#: begins a comment that lasts until the end of the line (until the first newline).
/: used to specify that the preceding string FST be parsed using a particular parse mode.
byte: used to explicitly specify that a string should be parsed byte-by-byte for FST arcs. This is the default option when no parse mode is specified.
utf8: used to specify that a string should be parsed using UTF8 characters for FST arcs.

     a = "avarice";                # Parse mode = byte.
     b = "brackish"/byte;          # Parse mode = byte.
     c = "calumny"/utf8;           # Parse mode = UTF8.
     d = "decrepitude"/my_symtab;  # Parse mode = symbol table.

<>: used to attach a weight to the FST specified. This should be after the FST in question and the contents are an arbitrary string (without angle brackets) which will be read as the appropriate weight type for the chosen arc.

    foo = "aaa" <1>;
    goo = "aaa" : "bbb" <-1>;

import: used to import functions and exported FSTs from another GRM file into the current one.
as: specifies the aliased name that the imported file uses in the current file.

     import 'path/to/utils.grm' as utils;

func: used to declare and define a function.
[,,]: used to specify and separate function arguments.
return: specifies the object to return from the function to the caller. This keyword is invalid in the main body. Within a function, statements after the return are ignored.

     func Cleanup[fst] {
       return RmEpsilon[Determinize[fst]];
     }

Standard Library Functions, Operations, and Constants

The following are built in functions with special syntax that operate on FSTs, presented here in order of decreasing binding strength. Within a level, all operations are left-associative unless otherwise specified. All functions are assumed to expand their arguments (to VectorFst) unless otherwise specified.

Closure: repeats the argument FST an arbitrary number of times.
- fst*: accepts fst 0 or more times.
- fst+: accepts fst 1 or more times.
- fst?: accepts fst 0 or 1 times.
- fst{x,y}: accepts fst at least x but no more than y times.
Concatenation: follows the first FST with the second. This operator is right-associative. There is also a delayed version which does not expand arguments:

          foo bar
          Concat[foo, bar]
          ConcatDelayed[foo, bar]

Difference: takes the difference of two FSTs, accepting only those accepted by the first and not the second.

 
         foo - bar
         Difference[foo, bar]

Composition: composes the first FST with the second. Using the explicit function name allows for option selection regarding the sorting of the input arguments. This function uses delayed FSTs.

          foo @ bar  # This option will sort the right FST's input arcs, exactly the same as would Compose[foo, bar, 'right'].
          Compose[foo, bar] # This way will not perform any arc sorting of the two input FSTs.
          Compose[foo, bar, 'left'|'right'|'both']  # This option will sort either the left FST's output arcs, the right FST's input arcs, or both of the above.

Union: accepts either of the two argument FSTs. The delayed version doesn't expand arguments.

          foo | bar
          Union[foo, bar]
          UnionDelayed[foo, bar]

Rewrite: rewrites strings matching the first FST to the second.

          foo : bar
          Rewrite[foo, bar]

Weight: attaches a weight to the particular FST.

          fst <weight>

We also have functions that operate on FSTs that can only be called via function name. These are very easy to extend. See the above section on functions for more information.

ArcSort: sorts the arcs from all states of an FST either on the input or output tape. This function uses delayed FSTs.

          ArcSort[fst, 'input'|'output']

Connect: connects a FST, removing unreachable states and paths.

          Connect[fst]

CDRewrite: given a transducer and two context acceptors (and the alphabet machine), this will generate a new FST that performs the rewrite everywhere in the provided contexts.
- the 4th argument (sigma_star) needs to be a minimized machine.
- the 5th argument selects the direction of rewrite. We can either rewrite left-to-right or right-to-left or simultaneously.
- the 6th argument selects whether the rewrite is optional. It can either be obligatory or optional.

          CDRewrite[tau, lambda, rho, sigma_star, 'ltr'|'rtl'|'sim', 'obl'|'opt']

Determinize: determinizes an FST.

          Determinize[fst]

RmEpsilon: removes all epsilon (label 0) arcs from the argument FST.

          RmEpsilon[fst]

Expand: explicitly expands the provided FST to VectorFst (if the FST was already expanded, this just does a reference counted copy and thus is fast).

          Expand[fst]

Invert: inverts the provided FST. This function uses delayed FSTs.

          Invert[fst]

Minimize: minimizes the provided FST.

          Minimize[fst]

Optimize: optimizes the provided FST. This involves a combination of removing epsilon arcs, summing arc weights, and determinizing and minimizing the machine (possibly after encoding if the input is a transducer).

          Optimize[fst]

Reverse: reverses the provided FST.

          Reverse[fst]

The following are built-in functions that do other tasks. Note that grmmakedep will construct appropriate dependencies for files indicated with these functions. See the compilation rules below for more information.

LoadFst: we can load FSTs either directly from a file or by extracting it from a FAR.

          LoadFst['path/to/fst']
          LoadFstFromFar['path/to/far', 'fst_name']

StringFile: loads file consisting of a list of strings and compiles it (in byte mode) to an FST that represents the union of those string. It is equivalent to the union of those strings ("string1 | string2 | string3 | ..."), but can be significantly more efficient for large lists.

          StringFile['strings_file']

Symbol Table: loads and returns the symbol table at the path specified by the string argument. Note that the indir flag (found in grm/src/flags/flags.cc) is prepended (if possible) as the initial path in which the compiler searches for the resource.

          SymbolTable['path/to/symtab']

There exist also some potentially useful constants. In the byte.grm file, distributed to /usr/local/share/grm/stdlib, we have the following pre-defined FSTs, which accept what the corresponding is*() functions from C++'s ctype.h library accept. These are built to work with the byte-style interpretation of strings.

kDigit: single digits from 0 to 9
kLower: lower case letters from a to z
kUpper: upper case letters from A to Z
kAlpha: the union of kLower and kUpper (all alphabet letters in either case)
kAlnum: the union of kDigit and kAlpha (all alphabet letters in either case plus digits)
kSpace: whitespace (space, tab, newline, carriage return)
kNotSpace: characters in the range 1 to 255 except for the above whitespaces
kPunct: the majority of printable "punctuation" characters
kGraph: the union of kAlnum and kPunct (all printable ASCII characters)
kBytes: the entire byte range from 1 to 255.

Tips, Tricks, and (Known) Bugs

When using generated symbols (e.g., "[sym1]foo[sym2]") across multiple files, it is a good idea to put a listing of all of them in a single file and import that one file first. This way, if we get two separate compilations, each using a different subset of the generated symbols, we do not get a mis-matched symbol error.
When using the composition, it is often a good idea to call Optimize[] on the arguments; some compositions can be massively sped up via argument optimization. However, calling Optimize[] across the board (which one can do via the flag --optimize_all_fsts flag) often results in redundant work and can slow down compilation speeds on the whole. Judicious use of optimization is a bit of a black art.
The final statement in an input grammar file needs to be syntax-error free. The parser dies mysteriously with no helpful debugging output if the last statement is poorly formed.

Using the Compiler

-- RichardSproat - 02 Feb 2011

Topic revision: r4 - 2011-02-08 - RichardSproat