User:Steamerandy/sandbox/metalanguage

From Wikipedia, the free encyclopedia

Parser Programming Languages[edit]

THE DISCRIPTION OF THESE LANGUAGES IS BASED ON AVAILABLE DOCUMENTATION AND PERSONAL EXPERANCE IMPLEMENTING SLIC(System of Languages for Implementing Compilers) AND COMPILERS WITTEN IN SLIC.

THE USE IS INTENDED TO DETERMINE HOW OR WERE THEY FIT INTO TODAY'S CLASSIFICATIONS OF COMPUTER LANGUAGES AND GRAMMARS. THE SUBJECT MATTER DISCUSSED HERE IS NOT INTENDED FOR THE LAYPERSON.

Parser programming languages were components of early compiler writing systems. Information is presented here in an attempt to foster a better understanding of these older technologies that are not parser generators. They provide parser programming languages that are more like an impariative programming language then declarative.

History[edit]

Parser generator languages originated in 1964 with Dewie Val Schorrie's META II. META II was just a of the concept. Dewie went on working for SDC, Systems Development Corporation devoping CWIC. META II a very simple implementation producing only stack machine code. In CWIC's SYNTAX language tree and list construction operators replaced the simple output of stack machine code from the parser. CWIC's GENETATOR language is based on LISP 2. An list processig language having an ALGOL like syntax. CWIC directly genetated IBM 360 code. In SLIC I added a macro like PSEUDO instruction language and a MACHOP machine instruction language. MACHOPs transform their input parameters into a srquance of bit fields. The SLIC linker combined the relocatable SLIC output files in to a target machine load file.



Though able to program general string and text processing applications in these languages, an important point is that these are full compiler development systems that are capable of source code translation to machine executable object code.

Examples are coded in cc the successor of SLIC (System of Languages for Implementing Compilers). language.

cc was changed to use syntax more like c including commenting. Basicly .BEGIN and .END in the procedural sub-languages are replaced by { and } and the compiler compiler now renamed cc short for compiler compiler. The generator, pseudo, and machop languages changed to use c math mathematical, logical operators. Though cc is a personal development I am only ising it to illistrates the functionality of the older languages. The last public document is a paper on CWIC in the ACM archives. This is about their parser programming languages.

Compiler Compiler[edit]

The term compiler compiler originally used in describing real complete compiler development systems such as META II, TREEMETA, and CWIC was subverted by the developers of yacc a simplistic parser genetator taking in a BNF grammar and producing a parser for that language. This, questionable unintentional, naming has forever muddied the termonolgy. Before yacc, true compiler compiles: META II, TREEMETA, and CWIC were producing working compilers. Thoes were all called metacompilers.




The parser languages are stack-based functional programming languages. There are two operational stacks pllus the call stack. The tree is constructed on the parse stack. Token formula create token objects and push them onto the parse stack. Node objects are created and pushed on the node stack using:

:<node name>.

then

!<number>

pops the top <number> of parse stack objects combining them with the poped top node stack entry to create a tree that is pushed on the parse stack.

The need for left recursion is eliminated by the $ zero-or-more operator. An arithmetic example illistrates parsing with directed tree construction.

expr = term $(('+':ADD|`-`:SUB) term!2);

Calling the term formula: it is expected to leave a only a single term object on the parse stack. When we get to the !2 we have the term just recognized and an operator node on the node stack. The nest to the top parse stack entry is the term previously recognized an ADD or SUB tree created on a previous iteration of the loop. The !2 combines the top 2 parse stack entries with the top node stack entry to create an ADD or SUB tree. The term formula operates just as as described above.

term = factor $(('*':MPY|'/':DIV) factor!2);
factor = (id | number | '(' expr ')' ('^' factor:EXP!2|--);

Parsing the arithmetic expression

 program = $((declaration // A program is a
                          // sequance of
                          // declarations with
                          // compile terminated
             |.EOF .STOP) //by .STOP on success
                          //of .EOF (End Of File).
                  // On: declaration
                  // long-failing or
                  // .EOF failing the
                  // backtrack alternative:*)
            \ ERRORX["?? Error"] (*backtracks with
                 //ERRORX reporting it and flaging
                 //the furthest parse point. Error  
                 //recovery is programed using
                 // -";" a negative peak ahead
                 //string test, never advancing the
                 //parse successful or otherwise.
                 //.ANY advances over any character*)
              $(-";" (.ANY| .STOP))
                 (*only failing on end of file
                 //taking the .STOP alternative
                 //terminating the compile.
                 //On matching a semicolen the -';'
                 //fails terminating the loop. Then*)
             ";")(* is matched
                 //and the main loop continues.)*
 declaration =	"#" dirictive
             | comment
             | global	  	DECLAR[*1]                 
             | (id
                ( grammar   PARSER[*1]     // Parser
                | sequencer GENERATOR[*1]  // Generator
                | optimizer ISO[*1]        // Optimizer
                | pseudo_op PRODUCTION[*1] // Pseudo
                | emitor_op MACHOP[*1]     // Machine
                ) // All above start with an identifier
               \   // Backtrack no ID or long failure
                  ERRORX["Syntax error"] // report ERROR
                  garbol      // skip over error.);

 grammar = (":"  class  :CLASS   // character class define
           |".." token  :TOKEN   // token rule
           |"==" syntax :BCKTRAK // backtrack grammar rule
           |"="  syntax :SYNTAX  // grammar rule.
           	)";"!2	        // Combine node name and tree
          	$(-.NL -"/*" .ANY); Comment to line seperator?

 comment  =    // C style comments:
            "//" $(-.NL .ANY)     
          | "/*" $(-"*/" .ANY) "*/";

 syntax   = seq ("|" alt1|"\" alt2 |--);

 alt1     = seq:ALT!2 ('|' alt1|--); alternative sequance.
         
 alt2     = seq:BKTK!2 ("\" alt2|--); backtrack alternative sequance

 seq      = +[oper $oper]+;

 oper     = test | action | "(" syntax ")";

 test     = string | id ("[" (arg_list|-- ,NILL) "]":GENR!2|--)

The rules above define part of the cc programing language. cc is based on SLIC. Where SLIC requires sub-language section headers: .SYNTAX, .GENERATOR, .PSEUDO, .ISO, and .MACHOP, cc does away with section headers that were required in CWIC and SLIC. CWIC actually had seperate compilers for each of its languages. SLIC was a single compiler but kept the language headers used in CWIC.

I use the term formula rather then rule. Formula better fits their function. A parsing formule defins how the parse is to be performed. They define the grammar they are to parse akin to a grammar rule. But they specificly define the order in which alternatives are attempted, Plus there two alternative operatore backtrack and non backtracking. A non-backtracking alternative only attempts its right alternative when its left has not advanced. For example:

  A B | C

Prefix[edit]

I present this not as a wiki article but for information needed to better understand and classify the early transformational metalanguage META II and those that followed based on it. The reader should have some expertise in the methods of compiler construction and not be biased by it.

The term parser programming language used here was chosen to be descriptive of Schorre's transformational metalanguage's method of parser specification. In their documentation they were said to use reductive grammars. A term that seams to have vanished from history. Well except for the McGraw technical term dictionary. These parser languages documented as reductive grammars are a specific type of analytical grammar having a top driving rule (see: Parser program goal).

Parser languages are not the most significant feature to come out of the compiler-compilers using them. Very significant multi-phase compiler writing systems using specialized list processing languages based on lisp 2 were eventually incorporated with these parser languages.

Having developed the SLIC(System of Languages for Implementing Compilers) compiler writing system, based on CWIC(Compiler for Writing and Implementing Compilers), I am most knowledgeable of CWIC's inner workings. The older source to source systems have only a subset of the features of CWIC's "SYNTAX" parser language. SLIC's parser language is almost identical to CWIC's. The most striking difference is SLIC starts with the analysis of it's initiating command string. Parsing input and output file strings and calling system supplied functions to setup input and output. SLIC was implemented on a DEC-SYSTEM-10 running TOPS-10. CUSPs (Common User Service Program) were all required to decipher their initiating command line. DEC supplied assembly code, that could be included for command analysis. I chose to supply the command line code model written in SLIC. Ok. Not a difference in the actual language. Other language differences are subtle and can only be explained once detailed knowledge of programming these languages is understood.

I provide this information here as wiki rules frown on articles writen on subjects the author created. As the sole developer of the SLIC compiler-compiler I wish to avoid conflict of interest. But on the other hand I think these early developments need explaining. In developing SLIC I studied CWIC's syntax and generator languages. I met with CWIC's developers Erwin Book and Val Schorre at SDC. I have only recently spent time studying Schorre's META II. But when you understand how to program parsers in these languages the commonality is apparent. What I am trying to address here is the relation between formal grammar and the early metacompiler parser languages. As they are analytical parser programming languages the argument of their being production grammar rewrite rules does not float. These languages are not written with an infered transformation provided by a parser generator, but instead use explicit transformation operators and constructs as part of their transformational metalanguage.

The problem I have faced trying to explain these languages is that people argue their equivalence to production rules.

These languages define programming languages analytically in the same manner as parsers were being coded at the time. Early compilers were coded by hand in assembly languages. FORTRAN had just been developed. Compilers were being talked about as one huge conditional (IF) statement. Originally BNF as proposed by Backus used OR as the conective between alternatives. Naur changed to the | in applying it to the ALGOL specification. As originally proposed by Backus: BNF rules obviously were written as conditional expressions, There are papers explaining in detail the proposals and application discussed by the IAL group.

Machine code examples[edit]

IA-86 assembly code may be used illistrating code examples as IA-86 instruction documentation is commonly availably. Examples are commented for understanding by the reader. Actual generated code of course is not commented.

Objectives[edit]

  1. Explain programming in these languages detailing their specifics to the point of understanding the program parsing formula above.
  2. Explain how these scannerless parser programming languages overcome disadvantages clamed of scannerless parsers.
  3. Explain why their transformational metalanguage were described as a reductive grammar.
  4. Compare these parser programming languages developed in the 1960s to parsing expression grammars.
  5. Show how they handle many languages efficiently, avoiding backtracking, that parser generated parsers can not parse or take exponential time doing so.
  6. Explain why these grammar specification metalanguages can not be equated to Chomsky production grammars.
Show that a conflict exists parser

Questions and Sugestions[edit]

I am unsure how these Schorre parser languages should be explained. They were amongst the earliest metacompilers that define a metacompiler as a metalanguage compiler. A compiler whose input is a program written in a metalanguage. They all take as input an analytical transformational metalanguage. This is a work in progress. Questions or suggestions are welcome. Feel free to place them here.

1.

Introduction[edit]

Parser programming languages were components of early compiler writing systems. Information is presented here in an attempt to foster a better understanding of these older technologies. Though able to program general string and text processing applications in these languages, an important point is that these are part of full compiler-compiler systems that are capable of full source code translation to machine executable object code.

Explaining these languages is not an easy task. Technology took a different path and terminology has changed sense papers were published describing them. The term parser programming language, hopefully, distinguishes these top-down goal-oriented syntax analyzing languages from other parser development methods.

Originating in the early 1960s, as part of compiler writing systems, parser programming languages were developed by members of the Los Angeles ACM SIGPLAN Working Group 1 on syntax driven compilers. Though there were other simular developments, the focuse here is on the series of META "programming language"s kick started by and based on the work of Dewey Val Shorre. Starting simple in 1963 with META I introduced by Schorre in his presentation on transformational metalanguages given at the national ACM meeting in Denver Colorado. META II followed with an ACM paper published in 1964. There were many developments based on META II, as in that paper Schorre gave license to use META II as a bassis for further development:

META II is not presented as a standard language, but as a point of departure from which a user may develop his own META "language". - Dewey Val Schorre[1]

META III, META 4, META 5, TREE-META, BOOK1, and BOOK2 followed. Until in 1969, with the development of CWIC, they had evolved into full optimizing compiler writing systems. Schorre was a member of the CWIC development team.

Schorre described his META "language" concept as a programming language using syntax equations:

META II is a compiler writing language which consists of syntax equations resembling Backus normal form and into which instructions to output assembly language commands are inserted.--Dewey Val Schorre[1]

Note. Schorre was not talking about assembly that becomes part of a parser. The assembly output refered to above is a transformation of the input source code recognized by the syntax equations to assembly.

Schorre chose to describe META II as a META programming language. Programed using what he called syntax equations:

Each syntax equation is translated into a recursive subroutine which tests the input string for a particular phrase structure, and deletes it if found.--Dewey Val Schorre[1]

What was not so simply explained in the META II paper is the over-all top-down goal-oriented programming paradigm. It is only apparent in reading the META II paper and understanding the examples. The META II META II program execution starts with the top driving PROGRAM formila:

PROGRAM = '.SYNTAX' .ID .OUT ('ADR ' *) $ST '.END' .OUT ('END').,

The ., sequance is used to represent a ; character. As programs were entered on punch cards and most key punch machines did not have an entry key for a semicolon character. A program written in META II starts with a .SYNTAX statement specifying the starting, top, driving formula's .ID. In its own case the PROGRAM formula. The first line of the META II META II program is:

.SYNTAX PROGRAM

As Schorre described META II as using syntax equations, it makes more sense to write the META II PROGRAM equation as:

PROGRAM = '.SYNTAX' .ID .OUT ('ADR ' *) $EQUATION '.END' .OUT ('END').,

After the .SYNTAX statement, $ being the zero or more loop operator, any number of syntax equations may follow until the .END statement terminates the compile.

In the ALGOL 60 report BNF rules were called formula. Formula better aligns with the function of what were called syntax equations by Schorre. An updated discription could be stated:

Syntax formula are mutually recursive test functions. Thay may be used to program top-down recursive descent parsers. Each formula is an analytical parser function returning success or failure. The term parsing formulas is more discriptive considering later developments that also include token formula. Parsing formula though resembling phrase structure, constituent rules, should not be confused with them. Grammar rules are predominantly described as rewrite/production rules. It would not be wrong to call them constituent rules as they do define language constituents. A constituent simply being a part of something. The problem being the productive defination of phrase structure - constituent rules.

CWIC documentation defines test as a boolean type, success=true or failure=false. A more inclusive description is that a test is an action whose outcome is success or failure.

These languages were described by Schorre as transformational metalanguages. Token recognizing functions .ID, .NUMBER, and .STRING are part of the META II language. On success of .ID, .NUMBER, or .STRING the recognized token is saved on the *stack (star stack). A * placed in the <output list> of a .OUT(<output list>) pops the top *stack string and outputs it to the output file.

My experience with these languages comes from developing SLIC. SLIC is based on CWIC having almost identical SYNTAX and GENERATOR languages. The GENERATOR languages are a specialized dialect of LISP 2. Where SLIC departs from CWIC is the code produced by the generator language. CWIC outputs binary machine code directly from generator procedures using <> bracketed plant expressions. Primarly designed for generating IBM 360 computer code, the expressions define 8, 16, and 32 bit values. CWIC could feasibly generate machine code for many computers today. Planted code is writen out using a .FLUSH <section name> statement.

CWIC is a two phase compiler having syntax analysis producing intermediate abstract parse treess and a generator phase transforming trees into machine code. A phase should not be confused with a pass as used in a multi-pass compiler. The generator phase is initiated by calling a generator function from a syntax formula. Supplied library functions were available for reading and writing trees to and from temperary files enabling multi pass compiler construction.

CWIC and SLIC both use the terms plant or planting to refer to code production. Code planting is to named sections. Sections provide the ability to separate instructions, data, constants and whatever.

SLIC has a linker and five compiler phases each phase programed in a specialized sub-language designed for its specific function:

  • SYNTAX parser sub-language transforms source code into abstract parse trees.
  • GENERATOR using a sub-language dialect of lisp 2 transforms abstract parse trees into (macro like) PSEUDO instructions. The PSEUDO sub-language is also a dialect of lisp 2. PSEUDO lists are executed when a section is flushed. ISOs may be run on flushing befor PSEUDO instructions are called(executed).
  • ISO In Sequence Optimization. Processes PSEUDO instruction lists looking for instruction patterns replacing them, with optimized instruction sequences. (° planed but never implemented)
  • PSEUDO expansion. PSEUDO instruction, also written in a sub-language dialect of lisp 2, are, on flushing a section, executed and deleted. PSEUDOs call MACHOPs to output relocatable code to link files. PSEUDO procedures may also plant PSEUDO instruction.
  • MACHOPs define an assembler for the target machine. They are programed functions that output machine code, bit fields, into bit addressable, named sections. They optionally output symbolic assembly listing to the compiler output list file.
  • LINK SLINK, the SLIC linker program combines SLIC relocatable files and outputs target machine load files.

The idea is that a PSEUDO instruction set would be hardware independent. Maybe language specific. The target machine specifics being in the coding of PSEUDO instructions and machine instructions emited. A division of labor, expertise and function. SLIC was not meant to be a universal compiler compiler that automatically produced code for various computer systems. It is a practical approach to multi-machine targeting. PSEUDO instruction design could be generalized or specific to a language and or machine architecture. The idea is that PSEUDO inctruction coders would be experienced assembly programmers for the target machine.

Though these languages are a component of a compiler compiler, the purpose here is explaining their parser language technology.

The parser languages are a part of compiler writing systems that are commonly refered to as metacompilers. The origin of the term metacompiler is uncertain. Schorre's META "languages" are programming metalanguages. Schorre did not use the term metacompiler, instead describing META II as a programming language. Schorre's languages and simular developments were being refered to as metacompilers during SEGPLAN meetings in the 1960's. In the 1968 TREE-META document. A metacompiler is defined as a computer program, compiler, that takes as input a program written in a metalanguage. That defination, also expressed in CWIC documentation, points to metacompile being a portmanteau of metalanguage compiler.

The 1968 TREE-META documentation states:

The metacompiler class of compiler-compiler have in common a top-down parsing algorithm, a common syntax and are expressable in their own language whence the prefix meta.

However that statement does not say they compiled their self as many clame to be the defining property setting metacompilers apart from other compiler-compilers. Many were not able to compile the running version of their self. Several output assembly source code files were hand modified after translation to assembly. Others were coded in assembly or LISP and never compiled their self. This defination does not define what function they actually perform. Metacompilers may be used for many language oriented functions. At SDC(Systems Development Corporation), a government financed think tank, wink wink, metacompilers were used in developing various non-compiler related "text" processing or analysis applications. wink wink

The early parser programming languages were source to source transformation languages. As the technology advanced specialized code generation languages were developing and the parser language transforms were to intermediate abstract parse trees.

Being a programming language thay do not use BNF < > bracketed symbol demarcation, Instead using ALGOL like quoted strings, symbols and operators. The semantics of these parsing languages specify how entities are recognized and transformed.

The main focus here is on their parser language. The parser languages are analytical. CWIC and SLIC include powerful programed look ahead and backtracking constructs:

Analytical parsing operators
(Look ahead operators ?, - and backtrack alternative \ operator are only available in CWIC and SLIC.)
operation operator example
alternative
| or /
<e1>|<e2> <e1>/<e2>
backtrack alternative
\
<e1>\<e2>
backtrack formula*
==
<i>==<e>
backtracking loop*
%
%<e>
expression grouping
( )
(<e1> <e2>... <en>)
indefinite zero or more loop
$
$<e>
negative look ahead
-
-<e>
positive look ahead
?
?<e>
<e>, <e1>, and <e2> are arbitrary expressions. <i> is an idenifier.
* SLIC added the backtracking zero or more % operator and == formula.
The terms backtracking and backup are used in describing specific forms of failure. Backup applies to a lexing failure that only involves the input stream state. Backtracking involve the parse state that may have created objects. Backtracking released those objects and resets the input stream state. However output may not by undone and a backtrack to a point having produced output is a compiler error.
Compiler error refers to an error in the compiler program. Compile error refers to an error in the program being compiled.

BNF Notation and Conventions[edit]

META II was said to be BNF like. As originally defined and used BNF was simply a metalanguage used in explaining the ALGOL language. Names contained within < > are linguistic variables defined by linguistic formulas. BNF was specifically designed to be used in the specification of the ALGOL programming language. That puts specific constraints on its design. A grammar specification language that is simple and distent when combined with textual commentary. Thus we have < and > around symbols that were called linguistic variables. The bracketing of linguistic variables distinguished them from surrounding commentary in textual discussions. One can talk about the <function> formula and have no misunderstand or confusion that function is a syntax formula.

These parser languages are not BNF. Being programming languages they use common programming language notation. Where BNF used character sequences representing their self. These languages use quoted strings. Where BNF uses <name> distinguishing a linguistic variable these languages do not bracket the name.

These languages in other ways are much like BNF as used in the ALGOL report. A linguistic variable names a single linguistic formula.

expr = term $(('+'|'-')term);

Here the same rule applies. A named formula can have only one definition.

Here within textual discriptions the BNF bracketed form: <expr> for example refers to the expr formula. The < > as in the ALGOL report distinguish the bracketed name.

In the ALGOL report textual descriptions were also used within < > to describe language constructs requiring look ahead and textual replacement. See the ALGOL 60 report commenting specifications.

General overview[edit]

These languages appear to many as production re-write grammar rules that a parser generator might take as input. And indeed they have features simular to EBNF. However these languages are programming languages used to write top-down goal-oriented parsers. Schorre described META II as using syntax equations. Parsing formula or function are more precise descriptive terms. The parser languages of CWIC and SLIC are stack-oriented string processing functional programming metalanguages.

I do not intend to ignore the early languages. However one can find their documentation online. The lack of available documentation of the advanced features of CWIC and SLIC need coverage.

The type of entities recognized by the parsing languages may be classed as:

Syntax. Syntax formula are used to specify language structure.
Token. A token is a variable sequence of characters recognized by a token function.
String. A string literal is a constant sequence of characters recognized by a quoted string.
Lexeme. A lexeme is a character sequence recognized as a unit either by a string literal constant, or token function.
Character class A character class is defined by a class formula. Class formula are not executable functions.

The language's abilities to define recognizable entites vary. META I, META II and TREE-META use built-in token functions: .ID, .NUMBER, and .STRING. CWIC and SLIC have character class defines and token formula.

Syntax specifications written in these languages are fundamentally the same. Early languages use a / as an alternative. Later | was used. And a \ backtracking alternative added to the languages. The basic arithmetic expression may be defined in identically the same ways.

expr = term $('+' term|'-' term);
expr = term $(('+'|'-') term);
expr = term ('+' expr|'-' expr|.EMPTY);

The first version used in META II to generate stack machine code assumes that term generates code leaving the result on the stack. Adding code generation and analysis down to tokens:

EXPR = TERM $('+' TERM .OUTPUT('ADD') / '-' TERM .OUTPUT('SUB'));
TERM = FACTOR $('*' FACTOR .OUTPUT('MPY') / '/' FACTOR .OUTPUT('DIV"));
FACTOR = .ID .OUTPUT ('LD   ' *) / .NUMBER .OUTPUT('LDL ' *) / '(' EXPR ')';
ASGN = .ID '=' EXPR ';' .OUTPUT ('STO ' *);

The FACTOR formula generates code that pushes values onto the stack of a stack machine. In META IIs case an immulated stack machine running on an IBM 1401. An assembler is used to assemble the code produced. The LD and LDL are instructions that push their arguments onto the stack. Today PUSH and PUSHI would probably have been the numonics used. So when a number was recognized a

LDL  <.NUMBER>

would be generated. The notation <.NUMBER> representing the token recognized by .NUMBER in the FACTOR formula. Likewise:

LD   <.ID>

would be produced when .ID recognizes an identifier. If you cannot understand these equations producing stack machine code you probably should go no further. Get an HP RPN calculator and learn to use it.

Lexemes and Tokens[edit]

In these languages the terms lexeme and token are different then described in compiler text books today.[2] As related to these languages lexemes are any sequance of characters recognized as a unit. A token is a lexeme recognized by a token formula. A quoted string in a syntax formula is a constant lexeme test used to recognize a specific sequence of characters in the immediate input stream. The recognized characters are not normally kept. A token is a variable character sequence that is kept. Constant lexemes may be significant to the input programs operation.

Initially builtin token recognizing functions .ID, .NUMBER, and .STRING, we're used in META II and other early parser languages. These were later replaced by token and character class formula in CWIC. CWIC used the term token as only refering to an entity recognized by a token formula. In CWIC and SLIC token formula create (typed) token objects. Keeping the CWIC token terminology makes the distinction between lexeme and token:

All tokens are lexemes Not all lexemes are tokens.

In all these languages, when matching a lexeme, leading white space, or skip-class, characters are commonly skiped. Lexeme matching is accomplished using literal/quoted strings and token functions.

A quoted string in a syntax formula is a constant lexeme test used to recognize a specific sequence of characters in the immediate input stream. The recognized characters are not normally kept and thus no token object created.

Constant lexemes may be significant to the input programs operation. But are usually handled by programed transformations as in this expr formula:

expr = term $(('+':ADD|'-':SUB)term !2);

specifying a sequence of terms separated by + or - lexemes. The '+' and '-' string tests are constant lexeme matching tests. With the tree transforming operators :ADD, :SUB and !2 we have an ADD node created when the '+' string is recognized. Likewise when the '-' string is recognized a SUB node is created. Created nodes are pushed onto the node stack. As terms are recognized the !2 tree operator creates a 2 branch tree popping the top node stack entry and top two parse (term) stack entries. The tree forming combines the poped enities into a "tree" list and pushes it onto the parse * stack. The '+' or '-' strings need not be kept as their recognition is followed by the creation of a corosponding node. Given the followimg:

a-b+c

the expr formula transform would produce:

ADD[SUB[a,b],c]
                ADD
               /   \
            SUB     c
           /   \
          a     b

Above ADD[SUB[a,b],c] is the textual display form of a tree in CWIC and SLIC. Trees are lists whoes first element is a node object.

In talking about these compiler-compilers terms can be confusing. For example a string literal in a syntax formula was said to be a constant lexeme test. However to the parser language compiler that string is a token. The terms as explained are with respect to the parser produced.

string .. (''' .ANY ''' | '"' $(-'"' .ANY|"""""",'"') '"') MAKSTR[];

The string (..) token formula above is programed to recognize a single quote bounded single character string or a double quote bounded variable length string. The double quote bounded string may contain a double quote character by placing two consecutive double quotes together. The sequance """""",'"' recognizes two "" characters and replaces them with a single quote (,'"') output to the string token object. Calling the MAKSTR[] function creates a string object. The single quote single character string needs no transformation as it accepts any single character between the single quote delimited single character string.

The CWIC and SLIC token (..) formula recognize tokens, create typed token objects, and push them on the parse (or star) *stack. Token formula use quoted strings and character class tests in discerning token character sequences. In SLIC character class formula create a class mask and genetates a class table that is indexed by a character's ordnal. A non-zero result of ANDing the table entry, indexed by a character's ordnal, with a class-mask determines class membership.

Character class formula[edit]

Character class defines are the simplest formula. They are defined by a sequence of alternatives that may only be a character constant or a previously defined character class. They do not compile into functions instead defining class bit masks. Classes are combined to create a class table. The class table is an array indexed by a character's ordinal.

Referencing a class by name tests the current input stream character for class numbership. The class mask is tested against the class table entry indexed by the character's ordnal. A non-zero result of ANDing a class mask with a class table entry indicates class membership. Character class definitions use the : define class operator.

bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dgt: oct|'8'|'9';
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F';

upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'
    |'H'|'I'|'J'|'K'|'L'|'M'|'N'
    |'O'|'P'|'Q'|'R'|'S'|'T'|'U'
    |'V'|'W'|'X'|'Y'|'Z';
lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'
    |'h'|'i'|'j'|'k'|'l'|'m'|'n'
    |'o'|'p'|'q'|'r'|'s'|'t'|'u'
    |'v'|'w'|'x'|'y'|'z';

let: upr|lwr;

alphanum: let|dgt;

Above the hex character class includes the dgt character class that includes the oct character class that includes the bin character class. In SLIC character class formula define a bit mask. A bit is assigned for a class having any character constant alternative. The let and alphanum classes includes only other named classes requiring that no mask bit be allocated. A character class_table is generated from the combined character classes. Each class generally has a unique class_bit_mask that is ANDed with the class_table entry indexed by a character ordinal. That test being non-zero indicates class membership.

Character classes are used in token formula. .ANY is a character matching operative that matches any character. Only failing when there is no character to match i.e. on end-of-file.

The bin class is defined as a 0 or 1. When bin is referenced in other formula it matches any character listed. i.e. The characters 0 or 1.

A default skip_class character class is predefined but may be overridden by defining a skip_class. When matching strings and tokens in syntax formula leading skip_class character are skipped. Character classes are not functions. A character class defines characters belonging to a class. They are not compiled into functions. Referencing them generates inline code.

Token formula[edit]

Token formula skip leading white space, skip_class characters, allowing only backup.

id .. let $(alpha_num|+'_' alpha_num);
str .. (''' .ANY '''|'"' $(-'"' .ANY | """""",'"') '"') MAKSTR[];
chr .. ''' .ANY ''' MAKESTR[];
num .. "0B" bin $bin MAKBIN[]
      |"0O" oct $oct MAKOCT[]
      |("0H"|"0X") hex $hex MAKEHEX[]
      |dgt $dgt MAKNUM[];

Backup only involves the input stream. Initially token formula and string tests save the input stream state and on failure the input state is restored. On success the saved state is discarded. A token formula may not call itself or any other formula, except as the last operation before returning an interceding function may be called. Usually a conversion function that will bypass the default cataloging of a symbol, and instead create a typed object. MAKBIN, MAKOCT, MAKHEX, MAKNUM and MAKSTR above are examples of intercede functions.

The dictionary is a structured symbol table stack. Symbol tables are objects manipulatable in the generator language. A functions symbol table for example could be assigned to an attribute of its name symbol.

Of note is the id formula.

id .. let $(alpha_num|+'_' alpha_num );

In the grouped sequence:

$(alpha_num|+'_' alpha_num )

the +'_' is an example of a string test. Normally with string tests the string is not kept. The + indicates the string is to be kept. I.E. included in the token string. Also of note is that string tests in token formula do not skip leading white space, skip_class characters. The alternative:

+'_' alpha_num

also illistrates another feature of the parsing language. Failure of other then first test of a sequance. Here if the '_' test succeeds and alpha_num fails the loop fails and thus id fails. An id may not end with a '_' character. It must end with an alpha_num. If an id is allowed to end with a '_' then the rule:

id .. let $(alpha_num|+'_');

would be used. Or a symchr class including the '_' char could be defined:

symchr: alpha_num|'_';

And the id token defined as:

id .. let $symchr;

If the '_' is not to be kept the + is not used:

id .. let $(alpha_num|'_');

Today parsers commonly use regular expressions. The sequance [a-zA-Z][a-zA-Z0-9_]* is used to recognize an id. Regular expressions are commonplace in many applications today. Most text editors support them in search strings.

The advantages of these lexing formula are:

1. Consistent syntax of the parsing formula.
2. Creation of specialized tokens. Numeric tokens may be used in arithmetic calculations.
3. Lexical analysis is syntax directed allowing context sensitive lexing. Automatic cataloging of symbols. Cataloged symbols have attributes that may be accessed and tested.

Of note is that lexeme and token are not equilivant in these compiler writing systems. A token is a variable string recognized by a token formula. A lexeme is simply a recognized sequance of characters. A lexeme recognized by a string constant does not normally create a token. In the syntax formula:

expr = term $(('+':ADD|'-':SUB) term!2);

String lexeme tests '+' and '-' are not tokens. They do not create tokens. In the formula above their success results in a ADD or SUB node being created and pushed on the node stack. Having a lisp 2 sublanguage dialect, typed objects are created by token formula. Character classes and token formula must be defined before their use.

Scannerless Parsing - Lexemes and Tokens[edit]

Tokens are recognized by token rules. Lexemes are any sequance of characters recognized as a unit. A quoted string in a syntax formula is a constant lexeme test used to recognize a specific sequence of characters in the immediate input stream. The recognized characters are not normally kept. A token rule is used to recognize a token lexeme. It is a variable character sequence that is kept. Constant lexemes may be significant to the input programs operation. But are usually handled by programed transformations in the parser language.

Taking a mathematical expression formula:

expr = term $(('+'|'-') term);

specifying a sequence of terms separated by + or - operators. The quoted strings '+' and '-' are constant lexemes. It may be that term, if it turns out to be a single identifier symbol or number. is a lexeme. Adding CWIC transform operators:

expr = term $(('+':ADD|'-':SUB)term!2);

we have an ADD node created when the '+' constant string lexeme is recognized. Likewise when the '-' string is recognized a SUB node is created. As terms are recognized the !2 tree operator creates a two branch tree using the most recently created node. Operator strings need not be kept as their recognition triggers the creation of a node. CWIC used the term token to refer to enities recognized by token formula. Keeping the CWIC token terminology makes this distinction between lexeme and token.

When these languages were introduced they were simply METAprogramming languages. Seperste lexer programs were not so commonly used. Having specilized lexing functions, calling them lexerless parsers is not right. Lexing formula are restricted to a subset of the syntax formula operations. White space is skiped when matching a lexeme. Token formula may not call other formula. Character class definitions appear to be formuls, but are not compiled into functions. Token formula may only reference character classes and constant strings.

In all these languages when matching a lexeme leading white space, or skip-class, characters are commonly skiped. Lexeme matching is accomplished using literal/quoted strings and token functions.

Syntax formula[edit]

Syntax formulae test for language constructs. Operators and functions perform transformations on recognized tokens and previous construct transformations.

If we wish to parse an assignment statement, common in many programming languages, we start with an assignment formula.

asign = id '=' expr ';';

As it is written above it can be viewed as an analytical or productive grammar rule. These languages are analytical. Schorre explained them as transformational metalanguages. As expressed above the asign formula has no transformational operatives.


In META II transformation is to stack machine assembly code.

ASIGN = .ID '=' EXPR ';' .OUT('STO ' *);

META II's .OUT and .LABEL perform source to source transformations. The .OUT('STO ' *) outputs it's arguments ('STO ' *) the string 'STO ' is output followed by most recent recognized .ID token output by the * argument.. Assuming EXPR generates code correctly. The line output would be:

          1 1              7
1-  -6  8-0 2-            -0
        STO <.ID>

.OUT outputs starting in column 8. The assembly output code goes to an assembler that produces stack machine interpreted code. Output is formated to assembler requirements. .OUT outputs it's operands starting in the opcode field. .LABEL is used to output a label starting in column 1. The stack machine STO operation pops the top stack entry into it's argument written out by the .OUT's * argument. We assume that EXPR will have generated stack machine code calculating the expression leaving it the top stack entry. That is the stack machine for which we are generating the assembly code. The * argument of the .OUT references the top entry of the compilers parse stack. Usually refered to as the star stack. i.e. *stack. Anyway that would be the identifier recognized by the .ID token function in the ASIGN equation.

TREE-META transforms to an intermediate abstract parse tree:
ASIGN = .ID '=' EXPR ';':ASGN[2];
The above TREE-META formula using operators :<node string> and [<number of branches>] transforms the .ID and EXPR into a two branch, binary, ASGN tree:
     ASGN
    /    \
 <.ID>   <EXPR>
<.ID> and <EXPR> representing their respective recognized components.
CWIC and SLIC operate simulerly, only using !<number of branches> as their tree transformation operator:
asign = id '=' expr ';':ASGN!2;
    ASGN
   /    \
 <id> <expr>
The generator language of CWIC and SLIC are specialized lisp 2 dialects. A list structure is used to hold the abstract syntax tree. Node is an object type. A list whose first element is a node:
[ASGN,<id>,<expr>]
is recognized as a tree:
    ASGN
   /    \
 <id>  <expr>
and may be displayed in text form as:
ASGN[<id>,<expr>]
A full set or rules illustrates the tree transformations:
asign = id '=' expr ';':ASGN!2;
expr = term $(('+':ADD|'-':SUB) term!2);
term = factor $(('*':MPY|'/':DIV) factor!2);
factor = '(' expr ')'|id|num;
TREE-META has functionally equalivant trensformation operators:
expr = term $('+' term:ADD[2] |'-' term:SUB[2]);
term = factor $('*' factor:MPY[2]|'/' factor:DIV[2]);
factor = '('expr')'|.NUM|.ID;
Only using builtin, .ID, .NUM, and .STRING token matching functions. [2] is equivalent to !2 combining the node string and top 2 parse stack entries into a 3 element list that is pushed on the parse stack.
Earlier languages used textural transformations to output assembly code to an intermediate file. The textural transformations were in the form of an output function call.
expr = term $('+' term .OUT('ADD')|'-' term .OUT('SUB'));
term = factor $('*' factor .OUT ('MPY')|'/' factor .OUT ('DIV'));
factor = '('expr')'
        |.NUM .OUT('LDL ' *)
        |.ID .OUT('LD  ' *);
LDL load literal pushes a numeric constant on the stack. LD pushes the content of a variable on the stack.

Unlike rewrite-rules of Chomsky grammars these parser pregramming languages start with a top driving formula. SLIC implemented on a TOPS-10 starting up with command line input first parses it's invoking command line. Others started defining a valid program module.

program = $decl;

The above example program formula defines a program to be a sequence of decl. The $ is a repeat or loop operator. It literally means "zero or more of". The above formula defines program to be zero or more decl. In a procedural language one might declare variables, constants, functions and procedures.

decl = var|const|procedure|function;

Needy Things[edit]

Character class:

Character class formula define character groups, a named class that may only include character constant alternatives or named character class alternatives.
bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dgt: oct|'8'|'9';
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F';

upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'
    |'H'|'I'|'J'|'K'|'L'|'M'|'N'
    |'O'|'P'|'Q'|'R'|'S'|'T'|'U'
    |'V'|'W'|'X'|'Y'|'Z';
lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'
    |'h'|'i'|'j'|'k'|'l'|'m'|'n'
    |'o'|'p'|'q'|'r'|'s'|'t'|'u'
    |'v'|'w'|'x'|'y'|'z';

alpha: upr|lwr;

alphanum: alpha|dgt;
Above the hex character class includes the dgt character class that includes the oct character class that includes the bin character class. In SLIC character class formula define a bit mask. A bit is assigned for a class having any character constant alternative. The alpha and alphanum classes includes only other named classes requiring that no mask bit be allocated. A character class_table is generated from the combined character classes. Each class generally has a unique class_bit_mask that is ANDed with the class_table entry indexed by a character ordinal. That test being non-zero indicates class membership. Character classes are used in token formula. The skip_class is a predefined character class. .ANY is a character matching operative that matches any character. Only failing when there is no character to match i.e. on end-of-file.

Token formula:

Token formula are used to define token lexemes.
num .. "0B" bin $bin MAKBIN[]         // binary integer
      |"0O" oct $oct MAKOCT[]         // octal integer
      |("0H"|"0X") hex $hex MAKEHEX[] // hexadecimal integer
                                // look for decimal number
      |('+'|+'-'|--) dgt $dgt         // integer part
         ( '.' $dgt                   // fractional part
           (('E'|'e')('+'|+'-'|--)dgt (dgt|--)// exponent part
           |--) MAKFLOAT[])     // floating point
         | MAKINT[];            // decimal integer

id .. alpha $(+'_' alphanum|alphanum);

string .. '''.ANY''' | '"' $(-'"' .ANY | """""","""") '"' MAKESTR[];

char .. ''' .ANY ''' MAKESTR[];
-- was introduced as a .EMPTY replacement in later versions of SLIC. Basically reducing line langths.
Token functions, compiled or built in, skip leading white space, skip_class characters. In SLIC skiping skip_class is combined with first match allowing leading skip_class characters in the token. Characters matched by character classes are copied to the token. .ANY is a special operator matching any character only failing when there is no character to match. I.E. at end of input. Quoted string constant lexeme tests are not normally kept. A + preceeding a quoted string is used to indicate it is to be part of the token. By default a token is cataloged in the dictionary. Intercede functions MAKEBIN[], MAKEOCT[], MAKEHEX[], MAKEINT[], MAKEFLOAT[], MAKESTR[] and others intercede the cataloging of symbols and create typed objects. The <name>[<argument list>] is a call to the <name> function coded in another language.

Lexemes and Tokens 2b[edit]

Initially .ID, .NUMBER, and .STRING builtin token recognizing functions, we're used in META II and other early parser languages. These were later replaced by token and character class formula in CWIC. CWIC used the term token to refer to enities recognized by token formula. Keeping the CWIC token terminology makes the distinction between lexeme and token.

All tokens are lexemes Not all lexemes are tokens.

A quoted string in a syntax formula is a constant lexeme test used to recognize a specific sequence of characters in the immediate input stream. The recognized characters are not normally kept. Constant lexemes may be significant to the input programs operation. But are usually handled by programed transformations in the parser language. For example in the expr formula:

expr = term $(('+':ADD|'-':SUB)term!2);

specifying a sequence of terms separated by + or - operators. The '+' and '-' string tests are constant lexeme matching tests. With the tree transforming operators :ADD, :SUB and !2 we have an ADD node created when the '+' string is recognized. Likewise when the '-' string is recognized a SUB node is created. Nodes are pushed onto the node stack. As terms are recognized the !2 tree operator creates a 2 branch tree using the most recently created node and top two parse stack, term, entries. The '+' or '-' strings need not be kept as their recognition is followed by the creation of a node. Given the followimg:

a-b+c

the expr formula transform would produce:

ADD[SUB[a,b],c]
                ADD
               /   \
            SUB     c
           /   \
          a     b

When these languages were introduced lexers were not commonly used. Calling them lexerless is wrong. They have lexing functions. A more precise term would be syntax directed lexing.

Lexical analysis[edit]

CWIC and SLIC use token (..) formula that recognize tokens. Token formula create objects and push them on the (parse) *stack. Character class formula define character classes used in token formula. Character classes are not functions. They instead create a class table and a class (bit)mask that is tested against the class table indexed by a character's ordnal. A non-zero result of ANDing a class mask with a class table entry indicates class membership. Character class definitions use the : define class operator.

// define character classes:
bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dgt: oct|'8'|'9';
hex: dgt|'a'|'b'|'c'|'d'|'e'|'f'|'A'|'B'|'C'|'D'|'E'|'F';
let: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z"|
'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';
alpha_num: let | dgt;

Character class defines are the simplest formula. They are defined by a sequence of alternatives that may be a character constant or previously defined character class. They are not compiled into functions. Instead they are combined and used to create a class table.

The bin class is defined as a 0 or 1. When bin is referenced in other formula it matches any character listed. i.e. The characters 0 or 1.

SLIC uses character class formula to define a class mask constant. A class table, that is indexed by a character ordnal, is generated. A non-zero result of ANDing the table entry, indexed by a character's ordnal, with a class-mask determines class membership.

A default skip_class character class is predefined but may be overridden by defining a skip_class. When matching strings and tokens in syntax formula leading skip_class character are skipped.

// token formula:
id .. let $(alpha_num|+'_' alpha_num);
str .. (''' .ANY '''|'"' $(-'"' .ANY | """""",'"') '"') MAKSTR[];
chr .. ''' .ANY ''' MAKESTR[];
num .. "0B" bin $bin MAKBIN[]
      |"0O" oct $oct MAKOCT[]
      |("0H"|"0X") hex $hex MAKEHEX[]
      |dgt $dgt MAKNUM[];

Token formula skip leading white space, skip_class characters, allowing only backup.

Backup only involves the input stream. Initially token formula and string tests save the input stream state and on failure the input state is restored. On success the saved state is discarded. A token formula may not call itself or any other formula, except as the last operation before returning, an interceding function may be called. Usually a conversion function that will bypass the default cataloging of a symbol, and instead create a typed object. MAKBIN,MAKOCT,MAKHEX,MAKNUM and MAKSTR above are examples of intercede functions. Character classes are not functions. A character class defines characters belonging to a class. They are not compiled into functions. Referencing them generates inline code.

The dictionary is a structured symbol table stack. Symbol tables are objects manipulatable in the generator language. A functions symbol table for example could be assigned to an attribute of its name symbol.

Of note is the id formula.

id .. let $(alpha_num|+'_' alpha_num );

In the grouped sequence:

$(alpha_num|+'_' alpha_num )

the +'_' is an example of a string test. Normally with string tests the string is not kept. The + indicates the string is to be kept. I.E. included in the token string. Also of note is that string tests in token formula do not skip leading white space, skip_class characters. The alternative:

+'_' alpha_num

also illistrates another feature of the parsing language. Failure of other then first test of a sequance. Here if the '_' test succeeds and alpha_num fails the loop fails and thus id fails. An id may not end with a '_' character. It must end with an alpha_num. If an id is allowed to end with a '_' then the rule:

id .. let $(alpha_num|+'_');

would be used. Or a symchr class including the '_' char could be defined:

symchr: alpha_num|'_';

And the id token defined as:

id .. let $symchr;

If the '_' is not to be kept the + is not used:

id .. let $(alpha_num|'_');

Today parsers commonly use regular expressions. The sequance [a-zA-Z][a-zA-Z0-9_]* is used to recognize an id. Regular expressions are commonplace in many applications today. Most text editors support them in search strings.

The advantages of these lexing formula are:

1. Consistent syntax of the parsing formula.
2. Creation of specialized tokens. Numeric tokens may be used in arithmetic calculations.
3. Lexical analysis is syntax directed allowing context sensitive lexing. Automatic cataloging of symbols. Cataloged symbols have attributes that may be accessed and tested.

Of note is that lexeme and token are not equilivant in these compiler writing systems. A token is a variable string recognized by a token formula. A lexeme is simply a recognized sequance of characters. A lexeme recognized by a string constant does not normally create a token. In the formula:

expr = term $(('+':ADD|'-':SUB) term!2);

String lexeme tests '+' and '-' are not tokens. They do not create tokens. In the formula above their success results in a ADD or SUB node being created and pushed on the node stack. Having a lisp 2 sublanguage dialect, typed objects are created by token formula. Character classes and token formula must be defined before their use.

Scannerless Parsing - Lexemes and Tokens[edit]

Tokens are recognized by token rules. Lexemes are any sequance of characters recognized as a unit. A quoted string in a syntax formula is a constant lexeme test used to recognize a specific sequence of characters in the immediate input stream. The recognized characters are not normally kept. A token rule is used to recognize a token lexeme. It is a variable character sequence that is kept. Constant lexemes may be significant to the input programs operation. But are usually handled by programed transformations in the parser language.

Taking a mathematical expression formula:

expr = term $(('+'|'-') term);

specifying a sequence of terms separated by + or - operators. The quoted strings '+' and '-' are constant lexemes. It may be that term, if it turns out to be a single identifier symbol or number. is a lexeme. Adding CWIC transform operators:

expr = term $(('+':ADD|'-':SUB)term!2);

we have an ADD node created when the '+' constant string lexeme is recognized. Likewise when the '-' string is recognized a SUB node is created. As terms are recognized the !2 tree operator creates a two branch tree using the most recently created node. Operator strings need not be kept as their recognition triggers the creation of a node. CWIC used the term token to refer to enities recognized by token formula. Keeping the CWIC token terminology makes this distinction between lexeme and token.

When these languages were introduced they were simply METAprogramming languages. Seperste lexer programs were not so commonly used. Having specilized lexing functions, calling them lexerless parsers is not right. Lexing formula are restricted to a subset of the syntax formula operations. White space is skiped when matching a lexeme. Token formula may not call other formula. Character class definitions appear to be formuls, but are not compiled into functions. Token formula may only reference character classes and constant strings.

Backup and Backtracking[edit]

As already explained CWIC and SLIC are compiled into machine code. The term test is used to describe a boolean valued (success or failure) status or a parsing action returning or evaluating to a test (success or failure) status.

These parser languages are integrated components of compiler compilers building abstract parse trees and/or generating code. Backup refers to reseting the input state on failure of a lexeme test. Backtracking refers to reseting the parse state to a previous saved state. The parse state includes the input state and any partial structures, dictionary entries, etc. Backtracking is implemented using a backtrack stack frame and long-failure. Long-failure occures in syntax formula whenever other then the first test of a sequance fails.

(a b c | d)\e

In the above if a returns failure the alternative d is attempted. If b or c fail a long-failure results. The backtracking alternative e is attempted. A long failure could result from the a call chain. A long failure resets the stack to the top backtrack stack frame and using that frame resets the compile to the saved state. A longfailure is in a way like a longjmp in c or c++. Any stacked parse function returns are discarded. In the above a could long-fail returning to the backtrack stack frame e alternative. Backtracking involves releasing object memory and dictionary changes. In SLIC an unaccounted for long-failure is cought by initial backtrack stack frame that issues an internal compile error.

Within a grouped sub-expression all alternatives must be of the same type. In a backtracking alternative sequance the compile state is saved before the first alternative.

x=(a|b)\c\d)

The backtrack parse state saved before the (a|b) alternative.

When META II was developed compilers already existed for FORTRAN, COBOL and ALGOL that did not use backtracking.

The earliest Schorre languages were designed to parse languages without backtracking. META II syntax equations disallow formula requiring backtracking. i.e.

x = a b / a c;

is an invalid formula. Alternatives having the same first test are disallowed requiring the above to be written as:

x = a (b / c);

Multiple defining of symbols is also disallowed.

x = a b;
x = a c;

A formula is compiled into function whose name is the dependent variable. Thus no two formula may have the same dependent variable identifier.

Most programming languages even today can be parsed without the complexity of backtracking. TREE-META in 1968 did not have backtracking. However in the case of syntax errors backtracking is quite useful. Backtracking to the start of a parse sequance in error and skiping to a point were parsing may resume is desirable.

In the late 1960s language analysis systems developed at SDC(Systems Development Corporation) were being used for general document analysis applications. Those languages had extensive backtracking algorithms in order to analyze text documents.

CWIC and SLIC have backtracking. The backtrack alternative saves the parse state on starting the first left most alternative. In SLIC a backtrack stack frame is created. Backtrack stack frames are linKed. A failure return to a backtrack resets the parse state and control transfered to the alternative. In the test:

a \ b

if the a, left alternative, test fails the parse is backtracked and the right alternative attempted no matter how far along the parse may have gotten before failure occured. A non-backtrack alternative:

a b | c

will attempt its right alternative only when the first test of the left alternative returns failure. If the second test b fails or a long-failure in performing a or c occures a backtracking long failure results. A long failure first resets the stack to return to the backtrack stack frame. It then resets the parse state and transfers controle to the backtrack alternative.

The mechanics of backtracking in SLIC is impemented by the backtrack stack-frame holding the parse-state. That is the parse-state sense the previous back-track parse state. It could be simply a parse-state stack-frame. It contains the input state, a dictionary header etc. When a parse-state stack-frame is removed on success the previous parse-state stack-frame is updated. Created and changed dictionary entries are applied to the previous stack-frame. Not a big implementation problem.

Abstract Parse Tree[edit]

(3•x-5)/(x+4) ->

             DIV
            /   \
         SUB     ADD
        /   \   /   \
     MPY     5 x     4
    /   \
   3     x

The APT (Abstract Parse Tree) refers to the Abstract Syntax Tree structure created in CWIC and SLIC parser languages. CWIC and SLIC have procedural sub-languages based on LISP 2. Variables are not typed, able to contain any typed ATOMIC object or list object. A list element, like a variable, may contain any object type including a list. A tree is simply a list whose first element is a node.

The SYNTAX sub-languages utilizes stacks to create an APT that is constructed on the parse-stack from parse-stack and node-stack entries. The parse-stack is generally refereed to as the *stack (star-stack). Tokens are placed on the *stack by token .. functions or ,<quoted string> and +<quoted string> operations. The :<node string> operation is used to create and push a node object on the node-stack. The !<number> operator builds a list of <number>+1 elements whose first element is poped off the node-stack. The remaining <number> of list elements are poped from the parse stack filling the list in reverse order. The list is then pushed on the parse stack. The transform rules below perform arithmetic expression transformation as explained.

expr = term $(('+':ADD|'-':SUB) term!2);
term = factor $(('*':MPY|'/':DIV) factor!2);
factor = '(' expr ')'|number|id;

The expr formula above may simply return success having recognized a single term or if a + or - is recognized following that term a corresponding ADD or SUB node is created and pushed on the node stack. On recognizing a term following the + or - test the !2 following that term test creates a tree from the ADD or SUB node on the node stack and the two term's on the *stack:

<node>[<term>,<term>]

on successive iterations the last created tree is combined with the recognized term.

<node>[<node>[<term>,<term>],<term>]

Parsing the expression:

(3•x-5)/(x+4) -> 

An equivalent functional tree is is built on the parse stack:

DIV[SUB[MPY[3,x],5],ADD[x,4]]
           DIV
          /   \
       SUB     ADD
      /   \   /   \
   MPY     5 x     4
  /   \
 3     x

CWIC and SLIC recognize a list whose first element is a node as a tree. Trees are displayed in functional form ADD[x,4] as equilivant to the list [ADD,x,4]

List operators +[ and ]+ are used to create a list of parsed items. In CWIC < and > were used in the syntax language to create a list while +[ and ]+ were used in the generator language.

arg_list = +[argument $(',' argument)]+

+[ and ]+ are grouping operators that when enclosing a parse sequence create a list of the parsed items.

Program Transformation[edit]

All these languages have transforming operations included in the parsing language. META II used ".OUT" and ".LABEL" to directly write intermediate virtual stack machine assembly code to a file. * is used to output a token string. In CWIC the * stack (or parse stack) contains parsed elements. I believe META II used a * stack to hold tokens that were removed by outputting them using * in a .LABEL or .OUT transform. Built in token functions .ID, .NUM, and .STRING parse tokens.

In producing code gensyms (generated symbols) are often needed. In META II generated symbols are of the form %g<number> i.e. %g1, %g2. A META II formula is compiled into a function having special local variables *1 and *2. Referencing *1 or *2 outputs a generated symbol string. On the first referance the generated symbol is created and output. Successive referances within the same invocation of the formula output the same generated symbol string.

In production grammars left recursion is commonly used for a repeating sequance. The semantics of these parser languages specify left to right evaluation. Left recursion would result in a non-termanating loop sense no test advances the parse. These languages employ a loop operator. The $, zero or more operator, preceeding a parsing term repeats the term indefinitely until a failure occures successfully termanating the loop.

Assignment statement parsing coded in META II:
ASIGN = .ID '=' EXPR ';' .OUT('STORE ' *);
EXPR = TERM $('+' TERM .OUT('ADD') / '-' TERM .OUT('SUB'));
TERM = FACTOR $('*' FACTOR .OUT('MPY') / '/' FACTOR .OUT('DIV')); 
FACTOR = .NUM .OUT('LDL ' *)/.ID .OUT('LD ' *)/'(' EXPR ')';

Above we have four syntax formula. Each formula is a subroutine returning success or failure. ASIGN first calls .ID. .ID is a builtin identifier recognizer function following ALGOL 60 identifier specifications. If successful a symbol string is formed and * may be used as in the .OUT('STORE ' *); transform in the ASIGN formula. .ID failure results in ASSIGN failure. On success the '=' string test is attempted. On failure of the '=' test the ASIGN formula aborts the compile. On success EXPR is called. Writing the ASIGN formula it is assumed that EXPR would generate code calculating an arithmetic expression leaving the result as the top stack entry of the stack machine. .OUT('STORE ' *) generates a store instruction to the .ID recognized earler in the same rule. Right recursion is allowed but one must be aware of code generation differances using recursion and looping.

.OUT('LDL ' *) outputs a load literal, LDL, instruction when a number is recognized as in the FACTOR formula. Likewise .OUT('LD ' *) outputs a LD instruction when .ID recognizes an identifier. * outputting the most recent token recognize by .ID, .NUM, or .STRING in the formula.

META II was simple, generating code for a stack machine:

t = a+3*(b-2);
processed by the above rules generates:
Label Opcode Operand Stack
LD a a
LDL 3 3,a
LD b b,3,a
LDL 2 2,b,3,a
SUB b-2,3,a
MPY 3×(b-2),a
ADD a+3×(b-2)
STORE t
NOTE. a, 3, (b-2), b, and 2 are FACTORs

Being a recursive descent parser the direct generation of code for stack machines is simple. However most computers are not stack machine archactures. In 1968 TREE-META made a notable change adding code generation unparse rules and changing the syntax equations to transform input source code into abstract parse tree structures that are passed to unparse rules. :<node> and [<number of branche>] creates a <node> tree having the specified number of branches. <node> is a node identifier string.

Assignment statement in TREE-META
ASIGN = .ID '=' EXPR:ASGN[2];
EXPR = TERM $( '+' TERM:ADD[2]
              /'-' TERM:SUB[2]
             );
TERM = FACTOR $('*'FACTOR:MPY[2]
               /'/'FACTOR:DIV[2]
               );
FACTOR = '(' EXPR ')'/.NUM/.ID;

The :ASGN[2] creating a tree of the form:

   ASGN
  /    \
ID      EXPR

The EXPR rule in turn creating an ADD[2] or SUB[2] tree:

     ADD            SUB
    /   \          /   \
TERM     TERM  TERM     TERM

Like wise TERM creates MPY[2] and DIV[2] trees:

       MPY                DIV
      /   \              /   \
FACTOR     FACTOR  FACTOR     FACTOR
Note That in EXPR and TERM the zero or more $ loop recognizing zero enties does not produce a tree. The tree thus produced is more akin to an abstract syntax tree. It was called a parse tree in CWIC documents. So here, sense those terms are now used differently it is called an abstract parse tree

Given:

t = a+3*(b-2)

the TREE-META ASIGN program generates:

  ASGN
 /    \
t      ADD
      /   \
     a     MPY
          /   \
         3     SUB
              /   \
             b     2

At about the same time with Schorre on the developer team CWIC did the same. Only with a far more advanced code generator language based on LISP 2. The CWIC syntax language gained token and character class rules, positive and negative look ahead and program controled backtracking. !<number> is used to create trees simular to the [<number>] operation in TREE-META. However node creation and tree forming are separate stack operations. CWIC uses a parse stack and node stack. Nodes are created and pushed on the node stack. Tokens are recognized, created and pushed on the parse stack by token equtions. !<number> pops the top node stack entry and <number> of parse stack entries combining them into a list whose first element is the poped node stack entry. The list is pushed onto the parse stack. ASIGN, EXPR and TERM rules below show the tree building.

Assignment statement in CWIC
ASIGN = ID '=' EXPR:ASGN!2;
EXPR=TERM $(('+':ADD|'-':SUB)TERM!2);
TERM=FACTOR $(('*':MPY|'/':DIV)FACTOR!2);
FACTOR = '(' EXPR ')'|NUM|ID;
DGT: '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9';
UPR: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|
     'J'|'K'|'L'|'M'|'N'|'O'|'P'|'Q'|'R'|
     'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z' ;
LWR: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|
     'j'|'k'|'l'|'m|/'n'<'o'|'p'|'q'|'r'|
     's'|'t'|'u'|'v'|'w'|'x'|'y'|'z';
LET: UPR/LWR;
ALPHANUM:  LET/DGT;
NUM .. DGT $DGT MAKENUM[];
ID  .. LET $(ALPHANUM/+'_' ALPHANUM);

The ".." token rules, ID and NUM, create token objects and push them on the parse stack. It should be clear the tree transforming stack operators !<number> and :<node name> that transform node and parse stack entries into trees are programed operstors. CWIC has many object data types. Symbol objects created by token equations are cataloged in the dictionary. Other token objects can be created using interceeding functions as the last operation of a token equation. integer and floating point number objects using MAKENUM[] and FLOAT[] functions. STRING[] creates a string object. Lists contain objects. A list is an object. Dynamic memory allocation and deallocation is automatic. Because of the special nature of these languages dynamic mamory management can be tuned to the application.

The string equation in CWIC is a prime example of CWIC's analytical grammer recognition power. CWIC used the ' mark to delineate strings. Two ' marks within a string are used to represent a single ' mark. The string token equation uses strings, conforming to itself:

string .. '''' $(-'''' .ANY/ '''''','''') '''' MAKESTR[];

The above string parsing equation is a programed function that recognizes a string token. A .. token making equation has defined semantics. It skips leading skip_class characters looking for the first character match. Once the first match is made skiping stops. Characters recognized by .ANY and character classes are copied to the token. ,ANY recognizes any single character only failing at end of file. Quoted strings in token equations are normally only recognized they do not become part of the token. Operators preceeding a string modify it's function. A comma , preceeding a string is not a test. The string is unconditionally copied onto the token character sequance. A - is a negative look ahead. -'''' tests for a single quote character returning failure if matched. Successful when the string is not matched. The parse is not advanced in either case. ? is a look ahead eveluating to the test result but not advancing over any characters regardless of the test result. '''''','''' tests for two ' characters together and if successful appends a single ' onto the token. 'don''t' becomes the string don't. Two single quotes '' within a string are inturpeted as one single quote. A string literal preceeded by a + if matched is copied to the token. ,<string> appends the <string> onto the token. In '=' syntax equations + and , preceeding a string operate simulerly only creating string tokens pushing them on the parse stack.

Schorre's META II languages were said to be made up of syntax rules simular to BNF used in the ALGOL 60 report. BNF may have been their inspiration. However thinking of them as generative grammar production rules should be avoided. Keeping in mind these languages were all developed before the first Dragon Book. They should be understood and studied as programming languages. META III, MATAFOUR, META 5, TREE-META, CWIC and other compiler writing languages were developed based on Schorre's work.

Scannerless Parsing - Lexemes and Tokens[edit]

Later replaced by token and character class rules in CWIC. In these languages the terms lexeme and token are different then described in modern compiler text books.[2]

Here as related to parser programming languages:

All tokens are lexemes Not all lexemes are tokens.

Tokens are recognized by token rules. Lexemes are any sequance of characters recognized as a unit. A quoted string in a syntax formula is a constant lexeme test used to recognize a specific sequence of characters in the immediate input stream. The recognized characters are not normally kept. A token rule is used to recognize a token lexeme. It is a variable character sequence that is kept. Constant lexemes may be significant to the input programs operation. But are usually handled by programed transformations in the parser language.

Taking a mathematical expression formula:

expr = term $(('+'|'-') term);

specifying a sequence of terms separated by + or - operators. The quoted strings '+' and '-' are constant lexemes. It may be that term, if it turns out to be a single identifier symbol or number. is a lexeme. Adding CWIC transform operators:

expr = term $(('+':ADD|'-':SUB)term!2);

we have an ADD node created when the '+' constant string lexeme is recognized. Likewise when the '-' string is recognized a SUB node is created. As terms are recognized the !2 tree operator creates a two branch tree using the most recently created node. Operator strings need not be kept as their recognition triggers the creation of a node. CWIC used the term token to refer to enities recognized by token formula. Keeping the CWIC token terminology makes this distinction between lexeme and token.

When these languages were introduced they were simply METAprogramming languages. Seperste lexer programs were not so commonly used. Having specilized lexing functions, calling them lexerless parsers is not right. Lexing formula are restricted to a subset of the syntax formula operations. White space is skiped when matching a lexeme. Token formula may not call other formula. Character class definitions appear to be formuls, but are not compiled into functions. Token formula may only reference character classes and constant strings.

Backtracking and Long Failure[edit]

As already explained CWIC and SLIC are compiled into machine code. The term test is used to describe a boolean valued (success or failure) status or a parsing action returning or evaluating to a test (success or failure) status.

The syntax formula are analytical test functions defined by a (grammar analyzing) test expression. For example the test equation:

A = b (c|d);

defines the test function A as the test expression b (c|d). Logically the boolean test expression is:

b AND (c OR d)

The symbols b, c and d are tests that may represent test parsing functions or literal string tests. Grouping symbols ( ... ) inclose an expression that constatutes a single test term to it's surroundings. As tests are recognizer action's they are ordered, atempted, in a left to right progression.

The | alternative (or) operator separates alternatives that are tried in left to right order succeeding on the first successful alternative. The / was used in early keypunch character restricted languages. A failure of a lexeme other then the first in a sequance of tests results in a long (backtrack) failure. The \ backtrack alternative is used to catch a long failure. When a long failure occurs the parse state is reset to the start of the backtracking left alternative and control transfered to the right alternative. Backtracking is not necessarily local. Backtracking long failure resets the call stack to the most recent backtrack stack frame and returns failure skipping intermediate function calls. In early Schorre META languages not having backtracking a compile failure resulted. On success of a backtracking altetnative the recognized input is advanced over and the backtrack stack frame parse state posted to the previous backtrack stack frame and removed. Programed backtracking limits it's overhead cost and is minimal if not invoked.

An arithmetic expression, made of a sequence of terms, takes the form:

expr = term $(('+'|'-')term);

The $ meta symbol is an indefinite loop operator. The expr equation is a test function. It first calls term another test function. If term returns failure expression returns failure. On success the indefinite loop $(('+'|'-') term) test is entered. The '+' and '-' are literal string lexeme tests returning success or failure. A literal string lexeme test compares the input character stream against the string. In syntax equations leading skip_class characters are skiped by lexeme tests. The inner grouped ('+'|'-') test of two alternatives, + or -, constitute only a single parsing term test of the grouped (('+'|'-') term) parsing term test. Either arithmetic operator test may be successful and the grouped ('+'|'-') grouped parsing term being successful the following test called. If the first ('+'|'-') grouped test of arithmetic operators of an arithmetic term sequence fails the indefinite loop terminates successfully the expr rule returning success. In CWIC and SLIC, if after recognizing a + or - the term test then fails a long backtracking failure would result. If in the call chain proceeding expr there is no backtrack alternative the default backtrack stack frame catches it and results in an internal compiler backtrack error.

The Schorre based META "language"s are all programming languages that are compiled into some form of executable machine code.

Comparison to Parsing Expression grammars[edit]

Parser programming languages are simular to PEGs or TDPL. However META II preceeded TDPLs by a decade. Ford's paper describes many features that are equivalent to those of Schorre parser programming languages. Except parser programming languages include programming constructs directing semantic transformations to assembly code or abstract syntax trees. Allmost everything that distinguish PEGs as top down analytical grammars suited for defining programming language's is descriptive of the technology developed decades earlier by Dewey Val Schorre. The operational interpretation of PEG expressions:

Each nonterminal in a parsing expression grammar essentially represents a parsing function in a recursive descent parser, and the corresponding parsing expression represents the "code" comprising the function. Each parsing function conceptually takes an input string as its argument, and yields one of the following results:

  • success, in which the function may optionally move forward consuming one or more characters of the input.

is the same as META II:

The method of writing compilers which is given in detail in the paper may be explained briefly as follows. Each syntax equation is translated ino a recursive subroutine which tests the input string for a particular phrase structure, and deletes it if found.-Dewey Val Schorre[1]

There are many equalivent features to the listed features given in the Parsing expression grammar topic:

Given any existing parsing expressions e, e1, and e2, a new parsing expression can be constructed using the following PEG or Schorre language operators:
PEG META II CWIC Operation
e1 e2 e1 e2 e1 e2 e1 followed by e2

Implied e1 AND e2 sequence

N.A. e1/e2 e1| e2 Non-backtracking

Ordered choice

e1/e2 N.A. e1 \ e2 Backtracking ordered choice
e* $e $e Zero-or-more sequance operator
e+ e $e e $e One-or-more
e? e/.EMPTY e/.EMPTY e optional
&e N.A. ?e (look ahead)

And-predicate

!e N.A. -e (look ahead)

Not-predicate

( ... ) ( ... ) ( ... ) grouping
N.A. N.A. $(e1 .BREAK / e2) Zero-or-more

forced termination

N.A. N.A. .FAIL Forced failure
N.A. N.A. .ANY Recognize

any character

N.A. N.A. +[ e ]+ make list.
N.A. N.A. :<node> push node
N.A. N.A. !<number> form tree
N.A. N.A. g[<args>] call functions in

other languages

N.A .OUT( ... ) < exp > instruction output
N.A. .LABEL() %G1: Label output

Backtracking is not a programed feature of PEGs or a parser generator's input. Explicite programed backtrack points and long failure provides far more efficient backtracking.

Comparing PEG's to Schorre parser programming languages one can see the similarities. CWIC extends the parsing features adding token rules, backtracking alternative operator, loop break, forced failure, and calling functions in other languages. Using SLIC separately compiled modules are linked allowing common language features to be shared across compilers and languages. What happened that these older languages are more advanced then later TDPLs or PEGs?

Programming paradigm[edit]

Programming paradigm
Procedural programming
declarative programming
functional programming
imperative programming
Functional decomposition

Schorre's meta "language"s on casual inspection, ignoring transform operations, appear to be made up of declarative rewrite rules:

<symbol> <rewrite expression>

However it is wrong to summarily class them as such. One must take the time to understand how they work together as a program.

Underlying their declarative rewrite rule appearance they are programed analytical functions.

programming languages 

semantically an imperative procedural programming paradigm programming language. Each constituent formula is a uniquely named parser function.

procedural
imperative
programming paradigm 


<parser function> <constituent_name> = <conditional expression> ;

Referance to a <constituent_name> in the <conditional expression> is a call to a parsing function. Their formal semantics specifies the order of execution to be left to right and down. A sequance of tests are performed in the order coded. Alternatives are also ordered left to right and down. The first successful alternative terminating the alternative sequance.

Using what Schorre called syntax equation, one programs a recursive descent parser based on a top-down functional decomposition of the object language. Instead of equation, which allows transposition, the term formulae better fits with their procedural imperative. Each constituent formulae, i.e. syntax equation, is compiled into a (test) function that is a parsing formulae for the constituent:

<constituent_name> =

The constituent formulae is referanced by its <constituent_name> (I e. symbol identifier as might be defined by the token formula):

id: .. let $((+'_'|.EMPTY) alpha_num);

Constituent (test) functions are mutually recursive functions programed to perform top-down syntactic analysis. They are (test) functions called by referencing their name. They recognize (or parse) a constituent sequance:

expr = term $(('+':ADD|'-':SUB)term!2);

These conditional ("test") expressions are a bit different then those of procedural languages. Tests are most commonly string recognizing actions that may have additional associated actions. A test's associated action(s) on success may simply have advanced the parse. Other actions performed could have created dictionary entries or semantic transformations. On one hand they are grammar rules and on the other boolean functions.

The term test expresses an action evaluating to a boolean status. Most commonly a parsing action whose resulting status, success or failure, indicates the parsing result. On failure the input state is restored to its initial state(left unchanged). On success the parse is usually advanced over characters matched. Other transform actions may be performed on success. For example a token formula on success would have created a token. Non-parsing programed actions may be included in the formula.

Whether we call them rules, equations or formulae they are analytical boolean functions. They may include transformation operators that direct the transformation of recognize enities into other forms such as machine code or intermediate abstract parse trees:

ADD[<term>,<term>]  SUB[<term>,<term>]

         ADD                 SUB
        /   \               /   \
   <term>   <term>     <term>   <term>

These languages are used to code top-down parsers but as top-down parsers are now described as following parse trees that description does not exactly fit. The term reductive grammar was used in describing their parser language. Their top-down goal oriented imperative explains the label reductive grammar. The programed parsing starts with the top or driving formula specifying a program module. Of course all program possibilities cannot be defined by a single formula. The top driving program formula references other formula that in turn reference other formula breaking a program into parts which are in turn broken down into smaller parts reducing a program separating it into it's most basic parts. Tokens are parts formed of characters. As tokens are recognized they are kept in a push down list or stack. This stack refered to as the (star)* stack or parse stack is used by transform operations.

These are string processing functional programming languages. Being string processing languages they do not exactly follow the mathematical functional programming paradigm in regards to their input and output. Instead of formal parameters, input to a constituent test function is the input character stream. The input stream is tested character by character to match the specified constituent formula. On success the input stream has advanced over matched characters. On failure the input stream state is unchanged, reset to its initial state. One may consider the input stream processing a side effect. But these are not mathematical languages. The functionality of constituent test functions are well defined, The advancing of the parse on success is the expected action of every constituent formula. On failure the parse state is expected to be unchanged.

These languages appear to present a declarative specification of the grammar to be parsed. Underlying their declarative appearance they are semantically a procedural programming language. Each referance to a <constituent_name> in the <conditional expression> is a call to a parsing function. Their formal semantics specifies the order of execution to be left to right and down. A sequance of tests are performed in the order coded. Alternative are also ordered left to right and down. The first succeeding alternative terminating the alternative sequance in success.

Rule Minulipation[edit]

Arguments have been presented showing how a single syntax formula is equivalent to mutipal production grammar rules. i.e.

x = a (b|c)

is equilivant to:

x = a b
x = a c

Showing grouping and alternative operators are not necessary.

But that is not the case at all. The alternative c may not be atempted depending on how b fails. This is shown by defining b:

x = a (b|c)
b = d e;

If d returns failure b returns failure and the alternative c would be atempted. However if d is successful and e fails a long failure results. Control is not returned to the alternative c in x. Control is returned to a backtracking alternative. The program rule is an example of back tracking with error recovery:

program = $((declaration       // A program is a sequance of declarations
            | .EOF .STOP)      // terminated by End Of File
           \ ERRORX["Error"]   // Backtrack and report error
                               // flaging furthest parse point.
             $-(';'|.EOF));// Error recovery find a ; to continue.

A program is a sequance of declaration terminated by an end of file.

$((declaration | .EOF .STOP) ...

If declaration fails and .EOF (end of file) fails the outer backtracking alternative:

\ ERRORX["Error"]

reports the error flaging the furthest parse point and the

$-(';'|.EOF)

loop scans for a ';' or end of file to continue the parse..

The problem with this equivalency argument is they are not production rules. They are programed functions of a programming language. Thay define languages. So are also a type of grammar. And sense a programming language is defined by a formal grammar these must be a formal grammar.

Terms[edit]

The original terms used in describing these languages may have a different meaning than commonly used today. An attempt is made to define the terms used here. Some old. Some new. And others borrowed? These terms do not necessarily apply to all parser languages. I.E. META II does not create tree structures. TREE-META creates tree structures but they are not LISP 2 based objects.

Language codes M,T,C, and S indicate the systems to which a term applies.

M = META II
T = TREE-META
C = CWIC
S = SLIC

Other Schorre based languages are known to have existed but their documentation has not been researched.

atom

C,S

In the advanced metacompilers having LISP 2 based code generation languages there are two fundamental types of objects: atom and list. An atom is any object not a list. It is short for atomic object. Some atomic objects may have attributes. Dictionary entries for example are symbol strings that may have named attribute values associated.
Abstract Parse Tree

C,S

The Abstract Parse Tree is equivalent the an abstract syntax tree. The APT is constructed by node :<name> and tree !<number> building operators of the parser language.
Abstract Parse Tree

T

The Abstract Parse Tree is equivalent the an abstract syntax tree. The APT is constructed by node :<name> and tree [<number>] building operators of the parser language.
dictionary

C,S

Dictionary refers to the collection of symbols cataloged primarily during parsing of the object language. The dictionary provides for attributes to be attached to dictionary entries and assigned values. Attributes are named object containers associated with dictionary entries and nodes. This is a vary powerful feature. An inline function could have an attribute of its name symbol assigned it's abstract parse tree. A symbol attribute is an associated variable and may hold any object type.
input stream

C,S

The source program is processed as a character stream. The actual stream is implemented internally allowing the stream to be reset to a previous input state.
lexeme

M,T,C,S

A lexeme is a recognized lexical unit, a string of characters which forms a syntactic unit, of the target language. In these parsers not all lexemes are tokens. Token equations recognize token lexemes. String literals recognize string lexemes. Recognized literal strings are not normally kept.

Note: This use of lexeme and token are different then commonly used today.

list

C,S

A list is an object used to organize objects into structures. They are essentially an ordered sequence of objects. A list is written and displayed with its elements separated by commas, and surrounded by square brackets. For example:
[ADD,[MPY,2,foo],[MPY,5,[XPN,foo,3]]]
is a list structure containing atoms and other lists that represents the abstract parse tree:
         ADD
       /     \
   MPY        MPY
  /   \      /   \
 2    foo   5     XPN
                 /   \
              foo     3

A tree is a list whose first element is a node object. In the syntax language lists may be created using list operators. CWIC uses angle brackets < ... >. SLIC uses +[ ... ]+ which is consistent with the generator language constructor of both CWIC and SLIC. Lists representing syntax trees are created using node and tree constructor operators :<node> and !<number of branches>. TREE-META uses [<number of branches>].

object

C,S

Objects are dynamic, heap managed data. They are created and destroyed automaticly. Objects carry their data class and may be tested as to their class.
object code

M,T,C,S

object code is the machine code output by a compiler.
object container

C,S

Object containers hold object pointers. Unlike C and C++ there is no special syntax for object pointers. All data is dynamically allocated and destroyed automatically. Heap management is a feature of the advanced metacompilers. Variables in the generator languages are object (pointer) containers.
object language

M,T,C,S

Object language is the language a META program operates on as data.
PSEUDO instructions

S

PSEUDO instructions provide an abstract layer of code generation. They are liken to assembly macros that produces assembly language instructions. They are used to define an abstract computer architecture instruction set, providing a idealized instruction set in which a compiler can generate machine independent code.

They are programed procedures coded in a LISP 2 dialect that call MACHOP procedures to produce target machine code.

Pseudo instructions are planted (appended) to named section code lists.

Reductive Grammar

M,T,C,S

A reductive grammar is an analytical grammar that defines language structure from the top down. Repeatedly breaking down structure into smaller and smaller sub-structure parts. Down to the level at which words are formed.

Defining a language in a manner coincidentally analogous to recursive decent parser analysis methods.

source code

M,T,C,S

Source code is the input to a compiler. Source code not conforming to the language specification should be rejected and reported.
source language

M,T,C,S

Source language is the programming language in which the source code is written.
stack

C,S

Stack or push down stack is a structured access array. Entries are entered on a stack by a push operation and removed by a pop operation. The call stack holds local variables and return locations of functions. Backtrack blocks are on the call stack and are also a back linked stack. Node and parse stacks are used in building abstract parse trees.

The dictionary is a symbol table stack allowing symbol table creation pushing the curent table down a level. Dictionary look up may proceed down the stack when locating a symbol. A symbol table is itself an object and may be assigned to a variable or symbol's attribute.

target language

M,T,C,S

Target language is the language into which the source language is translated.
test

M,T,C,S

Test refers to an operation eveluating to success or failure and the success or failure result. In a parser equation a quoted string is a test. A parenthesized group is a test.. The parser equation is a test. Success or failure may be considered boolean values.
token

M,T,C,S

A token is variable character string defined in the object language specification. Tokens are lexemes. Not all lexemes are tokens. Tokens are recognized using token equations. Of note. A literal string in the metasyntax parser language is not a token object but a lexeme test that recognizes a string constant.
tree

T,C,S

A list who's first element is a node object is recognized as a tree. For example:
[ADD,[MPY,2,foo],[MPY,5,[XPN,foo,3]]]
is a list structure containing atoms and other lists that represents the abstract parse tree:
         ADD
       /     \
   MPY        MPY
  /   \      /   \
 2    foo   5     XPN
                 /   \
              foo     3

There are several versions of TREE-META representing trees internally different then described here. Their function is simular none the less.

variable

C,S

A variabke is an object container. A pointer to an object. A variable may contain any data class, data type called class to distinguish them from a target language type. An integer at one point, a string at another. This is old-school object programming. Objects classes are specified as part of the metacompilers language specification. Every piece of data is an object. There are several numeric objects. Numeric objects may be used in arithmetic calculations. Node and symbol objects may have attributes that are objects. Lists are object's that are object containers. List are like an array of object pointers.

Reductive Grammar[edit]

Reductive grammar was used in TREE-META documents describing it's parser language.

To understand the term reductive grammar used to describe Shorre parser languages we look at how reductive methods are applied in mathmatics and computer science finding reductive is a type of top down analysis that breaks down a problem into smaller or simpler parts.

In mathematics we have specific explanations of how reduction is used in solving math problems: Matrix row and column reduction. Reductions in solving integrals by parts. Differential equation and Laplace transform reduction methods.

In computer science:

The Prolog family of languages, logic programs have a procedural interpretation as goal-reduction procedures.

A 2007 published paper establishs reductive logic as a methodology of proof-search analysis in Math, Logic and Computer Science. .. 'DEDUCTIVE LOGIC, REDUCTIVE LOGIC, AND PROOF-SEARCH Published: Oxford Scholarship Online: September 2007

Reduction is a scientific descriptive term used in mathematics and computer science. Mathematical reductions are basicly breaking down a problem into smaller and/or simpler parts that have known soloutions or are simpler to solve. Similarly in computer science and applied logic reductive relates to top down goal seeking methods of soloutions that break down a problem into smaller simpler parts.

When we use a top-down goal seaking method of analysis that reduces a problem into sub-goals. Each sub-goal a reduction in complexity. We are using a reductive method of analysis.

In TREE-META papers we find its parser language described as a reductive grammar. And explained as a top-down goal seaking method of syntax analysis. The main goal is to match the complete program with a top driving syntax equation:

program = $((declaration    // A program is a sequance of declarations
            | .EOF .STOP)   // terminated by End Of File
          \ ERRORX["Error"] // Backtrack and report error
                            // flaging furthest parse point.
            $(-';') ';');   // Error recovery find a ; to continue.

The program rule above defines a program to be a series of declaration(s) followed by an .EOF(End Of File) at which point it stops terminated by .STOP. A backtracking alternative catches exceptions, reports them and scans for a declaration terminating ";" character. The above top driving program parsing goal seeking rule is one you might find in a CWIC or SLIC program, excluding the c style comminting. TREE-META and other earlier parser languages do not have backtracking alternatives or programed error reporting. The top down method of analysis though is the same.

A top-down method of syntax analysis is used in the TREE-META system. The main goal is to match the complete program with the Syntax Rule whose name follows .META at the head of the program. TREE-META

A reductive grammar defines higher level language structures then just sentences. We can define a book:

 book = table_of_contents
        chapter chapter*
        index;

They are however designed to define the structure of a valid program module. A COBOL program for exalple is made up of divisions in a specific order:

program = IDENTIFICATION_Division
          ENVIRONMENT_Division
          DATA_Division
          PROCEDURE_Division;

The reductive grammar allows parsing of context sensitive languages like COBOL. Each COBOL division has it own set of parsing equations. Though some may be common and shared. The DATA_Division and PROCEDURE_Division have variable names in common. Other tokens such as integer constant's are common to several divisions.

A C program is made up of declarations:

 program = $declaration;

The top-down goal reduction parsing equations describe and are commonly compiled into a recursive decent parser.

Trees, Stacks and Lists[edit]

The early Schorre langauges use a call stack and * (star) stack. Syntax equations compiled into recursive functions that push parsed tokens on the *(star) stack. All of these compiler writing languages of course have a call stack. Later parser languages use separate stacks to create an abstract parse tree. CWIC tree transforming operators are explained as using node and parse stacks. Having separate node and parse stacks removes constraints imposed by earler parser languages.

The TREE-META tree-transform operators are simular to the later CWIC parser language. However TREE-META tree transformation operators are not explicitly described as using stacks.

CWIC formalized stack usage in it's parser language defining two stacks used to create Abstract Parse Trees. The * (star) or parse stack holds token objects and lists or tree structures. The node stack holds node objects created and pushed on it by the :<node> operator.

Trees are created by the !<number> operator. A list of <number> +1 elements is created. The first element is the poped top node stack entry. The tree branch elements are poped from the parse stack into the remaining list elements. Parse stack elements are poped filling the list in reverse order. Top parse stack entry to last avalable free list location.

SLIC developed from CWIC uses node and parse stacks in building abstract parse trees. A tree node is a typed atom. A non-atomic tree branch is a list whose first element is usually a node identifier. In CWIC list display form:

[ADD,x,[MPY,2,Y]]

represents the tree:

   ADD
  /   \
 x     MPY
      /   \
     2     y

Or in tree text <node>[<branches>] form found in CWIC and SLIC documentation:

ADD[x,MPY[2,y]]

The abstract parse tree is a list structure equilivant to an abstract syntax tree. Lists may also be crated using +[ and ]+. Parsed items between +[ and ]+ are gathered into a list that becomes the top parse stack item. The +[ and ]+ constructs are commonly used to gether function parameters:

function_call = id '(' +[arg $(',' arg)|.EMPTY]+ ')':CALL!2;

Such lists not having a node identifier are processed by context The function_call equation would generate a CALL tree whose left branch is the function id and right branch is the list of arguments:

max(x,y) ->
  CALL[max,[x,y]]

Tree building operaters use stacks that hold node and branch parameters.

node_maker = ':' id :NODE!1
tree_maker = '!' int:TREE!1

These languages all have a call stack. TREE-META versions vary on how thay retain nodes and trees. A tree is identified by it's node id string.

Parsing the expression:

x + y

using the equations:

expr = term $(('+':ADD|'-':SUB) term!2);
term = factor $(('*':MPY|'/':DIV)factor!2);
factor = '(' express ')' |int|id;

Starting with expr calling term calling factor recognizing x as an id and pushing the token x object on the parse stack. Returning to term that failing to recognize a '*' or '/' returns success to expr. The + would be recognized and :ADD would create an ADD node pushing it on the node stack. Decending thru term and factor again y is recognized as a symbol by id and pushed on the parse stack. Returning to expr !2 would create a 3 element list representation of the ADD tree:

[ADD,x,y]

The ADD node poped into the first element. y into the last element. x in to the next to last element. The list then pushed onto the parse stack. The expr syntax equation implementing the tree forming operation described. Programming a parsing formula it is expected that referenced formula return with specific stack configurations. Usually a token object or tree representation of the constituent formula called.

In programming the parser transforms one thinks top-down. Each equation is a parsing function in a recursive decent parser. In the expr formula above we expect term to return a single term, atom or tree, on the parse stack. That term may be an atomic object or a tree object.

In the $ indefinite loop we expect the left tree branch to be on the parse stack when the '+' or '-' id is recognized. In creating a tree from:

x + y - z

it has been explained how x + y is transformed into the list [ADD,x,y] and left on the parse stack. The $ loop would continue recognizing the - and pushing a SUB node on the node stack. The term function would recognize z pushing the z symbol on the parse stack. !2 would again create a tree fom the last node pushed on the node stack and top two parse stack entries creating the list:

[SUB,[ADD,x,y],z]

The loop creates a left handed tree.

       SUB
      /   \
   ADD.    z
  /   \
 x     y

A right recursive loop is used to create a right handed tree:

expr = term (('+':ADD|'-':SUB)expr!2);
x+y-z
 ->
 ADD[x,SUB[y,z]]
 ...
   Add
  /   \
 x     SUB
      /   \
     y     z

Looks like it makes little differance. Right. Then here:

x-y+z
  ->
   SUB
  /   \
 x     ADD
      /   \
     y     z

The left handed tree:

       ADD
      /   \
   SUB     z
  /   \
 x     y
 


Stack (abstract data type)


Tree (data structure)

Tree (data structure)#Terminology

Tree (data structure)#Terminologies used in Trees

Parse tree

Tree structure

abstract syntax tree

sentence diagram

Node (computer science)

Lexing Formula[edit]

These are lexerless parser programming languages in which a lexeme is simply a sequence of characters recognized as unit. The term token is reserved, referring only to a lexeme recognized by a token formula (rule, equation, or function). A lexeme, that is not considered as a token, may be recognized by a quoted string literal test. Such matches are not normally kept. A + may preceed the string to indicate it is to be kept.

Programed token and character class formula, string modifiers '+', '-', and ',' first appeared in CWIC. Early parser languages using builtin token functions do not provide the action operators '+', '-', and ','. Many features discussed here only apply to CWIC, SLIC and later generations of these languages

META II and TREE-META use built-in token recognizers:

.ID identifier
.NUMBER number
.STRING string

CWIC and SLIC use token formula identified by the token formula ".." operator:

id .. let $alphanum;

A quoted string literal is a constant lexeme test used to recognize a specific character or character sequence. The recognized character(s) are not normally kept. A string may be preceeded by a plus '+' or comma ','. The sequence +<string> is a string test that, in a token equation, if successful, is copied or appended to the token. In a syntax equation the +<string> is a test that creates a string object if successful. A ,<string> is not a test. In a token equation it simply appends the string to the token. In a syntax equation it creates a string object and pushes it on the * stack. Token rules are used to recognize token lexemes that are variable character sequences. A string token rule recognizes a string token:

string .. (''' .ANY ''' | '"' $(-'"' .ANY | """""",'"') '"') MAKSTR[];

The above rule recognizes a single character string delimited by ' characters or a multi-charater string bounded by " characters. A single character ' delimited string needs no special case for recognizing its bounding ' character. The multi-character string bounded by " character special cases including it's bounding " character in the string recognizing two consecutive "" characters as representing a single " character inserted on recognizing """""" with ,'"'. A quoted string literal in the parser language rrcognizes a constant character sequence.

Token equations by default create dictionary entries.

id .. let $('_' alphanum | alphanum);

The .. token equation above defines an id. An id must start with a let. It may include any number of '_'s or alphanum characters following the initial let character. An '_' must be followed by an alphanum creating the restriction that an id may not end with an '_' or have two or more consecutively. The '_' is not significent as it is not kept as part of the id. If it should be significent, kept as part of the id string, the equation is written:

id .. let $(+'_' alphanum | alphanum);

A + preceeding a string literal signifies it is to be kept. Appended to the token buffer.

id .. let $(+'_' | alphanum);

Written as above id may end with '_' and have multiple '_'s together.

The id equation references character class equations let and alphanum. Token equations use a subset of the full syntax language and may only reference character class rules or call generator functions as their last operation.

White space or non-printing characters are handled by lexeme string and token tests. Initial, or leading white space, skip_class, characters are skipped. CWIC and SLIC differ on skipping algorithms. CWIC simply skips skip-class characters and then begins the test. SLIC's algorithm allows leading skip_class testing specified in the token formula.

Character Class Rules[edit]

A character class, defined by the ":" define operaror following its name, is used to define or name a set of characters. They are not functions in the same sense that token and syntax equations are. They are named character group literals, coded using the same syntax for consistency. Used in a syntax equation a character class works the same as a literal match. Generally they are used to create a table indexed by a character's numeric code. Character classes are basicly a list of characters, each character separated by the "|" alternative operator. Characters are specified by their quoted glyph or numeric code.

bin:         '0'|'1';  // quoted glyph form

A bin can be the character 0 or the character 1. The bin class may be defined using numeric code values for the ascii characters 0 and 1 expressing the identical same set of characters:

bin:         0H30|0H31;  // ascii hex numeric codes

Using bin in a syntax or token rule will match any character in the bin character set. The name of a previously defined character class includes all characters in the named class as illistrated by the folllowing commonly used character classes:

bin: '0'|'1';

oct: bin|'2'|'3'|'4'|'5'|'6'|'7';

dgt: oct|'8'|'9';

hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'|'a'|'b'|'c'|'d'|'e'|'f';

upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'
    |'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z'

lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'
    |'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';

let: upr|lwr;

alphanum: let | dgt;

symchr: alphanum | '_';

skip_class: 0h08 | 0h09 | 0h0A | 0h0B | 0h0C|  0h0D | ' '; 

Above many common character classes are defined. These will be used in the following token rule example. Character class equations generally do not generate executable code. In implementation they are not called but generate in line code. Their most common use is in token rules to match characters of a token. Characters thus matched are appended to the token string. Used in syntax rules they will match a character belonging to the class, like a literal match the character is not normally kept in syntax equations. The action modifiers + - and ? may be applied to a character class in syntax equations. The built-in .ANY character match operator will match any character code only failing if there is no character .i.e at the end of input. The .ANY operator is used and performs like a character class though its implementation is quite different in only checking file status and only failing at the end of input.

It may be worth noting that implementing the above character classes, using a class map bit table, only requires 8 bits. A bit is only required for a class that has characters. One made up wholly of other classes, alphanum and let for example, requires no flag bit to be allocated. Each class is a bit mask:

bin 0b00000001
oct 0b00000011
dgt 0b00000111
hex 0b00001111
upr 0b00010000
lwr 0b00100000
alpha 0b00110000
alphanum 0b00110111
symchr 0b01110111
skip_class 0b10000000

Token and string literal lexeme tests have in common skip_class skipping when used in syntax equations. Leading skip_class characters are skipped over. CWIC simply skips over leading skip_class characters. SLIC repeatedly attempts the (first)test advancing over skip_class characters until a first character is matched or a skip_class character match fails.

Token formula[edit]

Many compilers have a seperate lexical pass. That is not the case here. Tokens are parsed on the fly. Token equations create atomic objects or words, strings and numbers of the target language.

Token equations may only use static matching. That is: They may not call any function. Except a conversion function may be called as the last operation. (Language features implemented by functions excepted. i.e. _CmpStr) On success an object is created and pushed on the parse stack. The defult is to create a symbol object and catalog it in the curent symbol table. Conversion functions may instead be called to create numeric, string, or other typed objects.

Token equations reference character classes that generate in line tests and are not considered a function call.

num .. dgt $dgt MAKEINT[];

Above the number equation recognizes a sequence of digits as a token. Calling MAKEINT[] as it's last operation converts the characters to a numeric object bypassing symbol cataloging. Numeric objects may be used in numeric calculations in the generator, PSEUDO, and MACHOP languages.

Token rules do not use the common regexp for lexeme analysis. CWIC is a programming language developed on an IBM 360 computer. It's objective was the development of efficient optimizing compilers that are easily maintained. Readability is important in programming. The token and character class equations being the same language form as the syntax equations is simpler to read and understand. They are also more general. An integer token may be expressed in binary, hexadecimal, octal or base 10. The integer equation below defines an integer constant that may be expressed in thoes radixs.

.
integer .. ("0b"|"0B") bin $bin           MAKEBIN[]  // binary integer
          |("0o"|"0O") oct $oct           MAKEOCT[]  // octal integer
          |("0x"|"0X"|"0h"|"0H") hex $hex MAKEHEX[]  // hex integer
          |dgt $dgt                       MAKEINT[]; // decimal integer

In the above integer equation the leading radix designation strings are recognized but not kept as part of the integer. In token equations once an initial character is matched skip_class skiping stops. Token equations use backup that only saves the input stream state on entry. All outer first level alternatives within a token equation may backup.

The token interceding numeric object creating functions allow calculations involving numeric token. It allows decisions effecting code production. Determining the use of an immediate mode instruction or planting a constant addressed by an instruction. Compile time calculations allow reducing instructions generated.

Some token equations examples, part of the metasyntax language

integer ..  ("0b"|"0B")           bin $bin MAKEBIN[]  // binary integer
           |("0o"|"0O")           oct $oct MAKEOCT[]  // octal integer
           |("0x"|"0X"|"0h"|"0H") hex $hex MAKEHEX[]  // hex integer
           |                      dgt $dgt MAKEINT[]; // decimal integer


id .. let $symchr;

string .. ('"' $(-'"' .any|"""""","""") '"'
          |''' .any ''') makestr();

char .. ''' .any ''' makestr[];

Token operators are described in table below

With lixical processing done on the fly. The skipclass character class defines white space and non-displaying characters that are skiped when lookilng for a token. Skipclass skipping is automatic until a character is matched by the token equation. No white space is allowed between any of the successive character matches unless specifically expressed in the token equation. Normally a token equation when successful would create a symbol object. A dictionary token would then result. Symbols are automatically cataloged in the dictionary and the dictionary object placed on the parse stack. Every token matched by a token equation will automatically be pushed on the parse stack. Functions that intercede the normal symbol processing can be used to create other object types.

Below is the metacompilers integer token equation. The leading radix code strings preceeding the numeric string determine valid characters and the interceding conversion function creates an integer object. Alternatives parse the various radix strings. A binary integer starts with 0b or 0B followed by one or more bin characters. 0b100101 is a valid sequence defined by the first alternative. The character class bin was defined in the character class section above.

.
integer .. ("0b"|"0B") bin $bin           MAKEBIN[]  // binary integer
          |("0o"|"0O") oct $oct           MAKEOCT[]  // octal integer
          |("0x"|"0X"|"0h"|"0H") hex $hex MAKEHEX[]  // hex integer
          |dgt $dgt                       MAKEINT[]; // decimal integer

The IA86 asembly code is shown below.

The integer equation above has alternatives separated by the '|' characters. ("0x"|"0X"|"0h"|"0H") defines the following character strings identifying it as a hexadecimal integer. Ox or 0X or 0h or 0H preceed a hex, base 16, number. The use of MAKEBIN[], MAKEOCT[], MAKEHEX[], and MAKEINT[] intercede default dictionary entry, converting the sequence of characters matched into an integer object, and pushing the object onto the parse stack. Conversion function may only be used at the end, as the last entry, in an outermost alternatives of a token equation. The integer equation illustrates how the programmer controls parsing and placement of conversion functions. A decimal number having no radix prefix is the last alternative. Alternatives are tried in order resulting in the first matched satisfying the rule. The programmer is responsible for avoiding ambiguities. Strings, for example "0b", matched in token equations are not kept as part of a the token. A +<string> may used when it is desired that a string be kept as part of a token. A string may be inserted into a token using the insert ,<string> operator.

Interceding functions are generator functions that return success or failure. If it were required to have a radix specifier at the end of an integer it would be a simple matter to parse all possible numeric characters and have the conversion function check for valid radix characters.

Backup is builtin to token rules. The input state is saved on entry and restored on failure. Or discarded on success.

With a programmable parser we are not confined to token equation rules. A higher level grammar rule could be used:

binary .. bin $bin ("b"|"B")         MAKEBIN()  // binary integer
octal  .. oct $oct ("o"|"O")         MAKEOCT()  // octal integer
hexnum .. hex $hex ("x"|"X"|"h"|"H") MAKEHEX()  // hex  integer
decmal .. dgt $dgt                   MAKEINT(); // decimal integer
integer = binary | octal | hexnum | decmal;

In the above we have used a syntax equation with token alternatives having the radix character appearing on the end of the numberic string. Token rules automatically backup to their start point is a simple reseting of the input stream to a non-stacked saved point. A token equation does not long fail. The above set of rules is not significantly different then the first integer equation.

These lexerless parsing language have proven simpler and faster then the common method of separating lexer and parser. Syntax directed lexing deturmins lexical possibilities reducing lexical analysis required. Rether then taging lexemes, dictionary and manipulatable objects are created that can then be directly used in generating code.

Token operators[edit]

There are two string literal forms. The single character string bounded by single quote marks '<character>' and the multi-charater string bounded by double quote marks "<string>". Either may be used as a string. Where a single character is specificly specified the single character string is used. Dot word operators may be in upper or lower case. .any or .ANY are equilivant.

OPERATOR OPERAND FUNCTION
.ANY none Matches any character advancing input stream. Appends matched character to token.
.LNBRK

.LB

none Matches the line break character sequance skiping any leading skip_class character. Used to match start or end of a line. Used as first test in an token equation it will anker the token to the first character of a line. Used otherwise it will match skip_class character to a line break.
quoted

lexeme test

"<string>" '<char>' The literal character string is matched aginst the input stream. If successful the input is advanced over matched characters. Quoted characters '?' and strings "xyz" are literal character strings.
+ "<string>" '<char>' String is matched aginst input stream. If successful it is appended to the token and input advanced over matched characters.
- "<string>" '<char>' <char class> Fails if string or character class matches input stream character. Input is not advanced in any case.
~ '<char>' <char class> Matches any character not its argument. Any other character is appended to the token.
? "<string>" '<char>' <char class> Succeeds if string matches input stream. Fails otherwise. Input is not advanced in any case.
, "<string>" '<char>' Comma appends string to token. ,'"' appends a " character to the token.
$ ... Zero or more loop operator. Loop following basic construct until failure. See basic rule in Token grammar.
-- none Always successful. Equivalent to .EMPTY. The expression
(a|--)

describes a to be optional.

.EMPTY none Always successful. The last alternative of optional constructs.

Note. These operatores may operate differently in syntax rules.

Backtracking, Backup, and Long Failure[edit]

program = $((declaration     // A program is a sequance of declarations
            | .EOF .STOP)    // terminated by End Of File
           \ ERRORX["Error"] // Backtrack and report error
                             // flaging furthest parse point.
               $(-';') ';'); // Error recovery find a ; to continue.

Backup applies to lexeme recognition. During the matching of a lexeme, until the lexeme is fully matched, only the input stream is effected advancing over input as characters are matched. A partial match only need reset the input stream state. A lexeme is recognized by a quoted string or token equation. On entry a lexeme test saves the input state. On failure the input state is restored to the state saved.

Backtracking is programed using the backslash "\" alternative operator.

<left alternative>\<right alternative>

On a failure parsing the <left alternative> the parse state is reset to its initial state upon starting the <left alternative> and control transfered to the <right alternative>.

In the sequence of tests:

x y z | w;

if the first test x returns failure the alternative w is atempted. When x returns failure it is assumed the parse state is unchanged. When the first test is successful the parse continues to the next test y above. It is assumed the parse state has changed and any failure must reset the parse state before an alternative can be atempted. A partial match of a sequence may of created objects and/or partially constructed a syntax tree. Symbols have been created and cataloged. Befor attempting a different parse the partial parse operations must be undone. In the case of a partial match of a lexeme the lexeme test has already restored the parse state. Note the first test of a sequence may be an equation that may longfail. requires a backtrack alternative. If a long failure occurred from x or any equation it called the alternative w being a non-backtracking alternative would not catch it.

Failure of a first lexeme in a sequance does not require a backtrack. However if in a backtrack alternative any type failure would cause a backtrack to clear the backtrack stack frame.

Backtrack alternatives should only be used when absolutely neccessary. Most programming languages do not require backtracking in recognizing valid programs. Backtracking is still used in recovering from errors. The program rule above illistrates backtracking that reports the error using ERRORX and a search for a semicolon that should indicate the end of a declaration as a point to continue.

Backtracking skips intermediate rules returning directly to the rule and alternative backtracked.

a = b \ c;
b = d e f | s h;
s = r t;

In the above equations if e, f, h, or t return failure their respective equations long fail to the backtrack alternative c. On the other hand if r returns failure to s. The equation b returns failure as s is the first test in it's last alternative. The failure return of b to a's backtracked alternative clears the backtrack and transfers control to the alternative c That is a normal failure returned has not advanced the parse as s inturn, not having an alternative returns failure. In a series of tests a failure of other then the first will invoke a backtrack returning control directly to the most recent backtrack alternative. I.e. When t fails. s does not fail. Nor does b fail. Control is returned directly to the bactrack c alternative of a. The parser state reset to state before calling the left alternative a. In SLIC only the left alternative of a backtrack alternative was backtracked. It was unclear if CWIC backtracked the right alternative. My thinking was that backtrack alternative of .FAIL could be coded. For in a sequance of backtracking alternatives:

a\b\c\d

The left alternatives a,b, and c are backtracked. The last alternative c is not. But one could force d to be backtracked:

a\b\c\d\.FAIL

Backtrack and non backtrack alternatives may not be mixed in a series. The invalid series below would be flaged in error at the \ alternative operator.

a/b/c\d

It is ambiguous as to what is to be backtracked. We need to specify exactly the backtracking. Grouping may be used:

a/b/(c\d)

or

(a/b/c)\d

This programed control of backtracking is quite different then other parsers. My thinking on not backing the right alternative is that in most cases it is used for error reporting. Backtracking an expensive time and resource operation. It was also simpler in reporting error's flagging the parse point.

On a backtrack failure tree transformations and pending dictionary changes are released. Code once produced cannot be backtracked and will cause a compiler error. Program controled backtracking, factoring and first match eliminates the exponential time consumption associated with backtracking parser algorithms. Backtracking is commonly only used in error recovery.

In the alternative sequence:

a \ b \ c

a and b are backtracked. If a fails b is always tried. Likewise if b fails c is always attempted. c is not backtracked and a long failure would fail to a previous backtrack point. Or the failure would result in a compiler backtrack error being issued. A backtrack alternative creates a backtrack stack frame before beginning it's left alternative. The backtrack stack frame holds the parse state that enables the parse state to be reset. A backtack failure will result in a compiler backtrack error being issued when no user set backtrack alternative exists.

Backtracks may be nested, their stack frame linked to the previous backtrack stack frame. A backtrack failure loads the stack pointer with top backtrack and returns failure. This is simular to a long jump in C. Backtrack resets the call stack, node stack and parse stack, releasing created objects and returns to the backtrack alternative. An initial system backtrack frame catches a long failure when no programed backtrack is in effect and results in the backtrack compiler failure described above.

Understanding long failure is key to understanding backtracking. In any sequence of tests a failure of any test other then the first test of the sequence results in a long fail. The top backtrack's stack frame address is loaded to the call stack and failure returned. Backtrack code takes control reseting the parse state to the saved state in the backtrack stack frame. A backtrack alternative allows its left alternative to be backtracked by catching any long failure occurring locally or in a called equation. The first test may return failure which returns to the back track process. But sense the parse state had not advanced backtrack simple clears the backtrack stack frame and transfers control to the alternative. Backtrack implementation is not all that complicated to an assembly language programmer. One must understand hardware call stack and how function stack frames are created.

Look Ahead[edit]

Look ahead is a very powerful feature that depends on backtracking. It can:

  • test ambiguous sequences and select the correct path.
  • be used to recover from errors skiping over unrecognizable elements.
  • recognize bounded sequences such as String literals having embedded bounding characters.

Look ahead is programed using the "-" negative and "?" positive operators. T

$(-";" .ANY) ";"

The common error recovery expression above searches for a ";". Matching zero or more, $, any character not a semicolon, (-";" .ANY), and then matching the ; character.

This program control of backtracking alone makes these programming languages.

Comparison to Parsing Expression grammars[edit]

Parser programming languages are simular to PEGs or TDPL. However META II preceeded TDPLs by a decade. Ford's paper describes many features that are equivalent to those of Schorre parser programming languages. Except parser programming languages include programming constructs directing semantic transformations to assembly code or abstract syntax trees. Allmost everything that distinguish PEGs as top down analytical grammars suited for defining programming language's is descriptive of the technology developed decades earlier by Dewey Val Schorre. The operational interpretation of PEG expressions:

Each nonterminal in a parsing expression grammar essentially represents a parsing function in a recursive descent parser, and the corresponding parsing expression represents the "code" comprising the function. Each parsing function conceptually takes an input string as its argument, and yields one of the following results:

  • success, in which the function may optionally move forward consuming one or more characters of the input.

is the same as META II:

The method of writing compilers which is given in detail in the paper may be explained briefly as follows. Each syntax equation is translated ino a recursive subroutine which tests the input string for a particular phrase structure, and deletes it if found.-Dewey Val Schorre[1]

There are many equalivent features to the listed features given in the Parsing expression grammar topic:

Given any existing parsing expressions e, e1, and e2, a new parsing expression can be constructed using the following PEG or Schorre language operators:
PEG META II CWIC Operation
e1 e2 e1 e2 e1 e2 e1 followed by e2

Implied e1 AND e2 sequence

N.A. e1/e2 e1| e2 Non-backtracking

Ordered choice

e1/e2 N.A. e1 \ e2 Backtracking ordered choice
e* $e $e Zero-or-more sequance operator
e+ e $e e $e One-or-more
e? e/.EMPTY e/.EMPTY e optional
&e N.A. ?e (look ahead)

And-predicate

!e N.A. -e (look ahead)

Not-predicate

( ... ) ( ... ) ( ... ) grouping
N.A. N.A. $(e1 .BREAK / e2) Zero-or-more

forced termination

N.A. N.A. .FAIL Forced failure
N.A. N.A. .ANY Recognize

any character

N.A. N.A. +[ e ]+ make list.
N.A. N.A. :<node> push node
N.A. N.A. !<number> form tree
N.A. N.A. g[<args>] call functions in

other languages

N.A .OUT( ... ) < exp > instruction output
N.A. .LABEL() %G1: Label output

Backtracking is not a programed feature of PEGs or a parser generator's input. Explicite programed backtrack points and long failure provides far more efficient backtracking.

Comparing PEG's to Schorre parser programming languages one can see the similarities. CWIC extends the parsing features adding token rules, backtracking alternative operator, loop break, forced failure, and calling functions in other languages. Using SLIC separately compiled modules are linked allowing common language features to be shared across compilers and languages. What happened that these older languages are more advanced then later TDPLs or PEGs?

Phase Structure/Constituency Grammar[edit]

In these languages, the parser is programed using test functions resembling phrase structure/constituency grammar rules.

<constituent_name> = <test conditional expression>;

A rule assigns a unique name to a constituent (test) function defined using a conditional expression. That function may be called by referencing it's name to recognize the constituent.

All these languages have transforming operations included in the parsing language. META II used ".OUT" and ".LABEL" to write intermediate virtual stack machine assembly code to a file using * to output the most recent token recognized by the rule. Scannerless parsing is employed in all Schorre based compiler writing systems. Initially having built in token recognizing functions: .ID, .NUMBER, and .STRING. In META II * references the most recent token recognized by the equation it is used in. *1 and *2 create and/or reference generated symbols local to the equation in which they are used. The most recent recognized .ID, .NUMBER or .STRING token in the rule may be referenced by using * as an argument of a .OUT or .LABEL production.

Assignment statement coded in META II:
ASIGN = .ID '=' EXPR ';' .OUT('STORE ' *);
EXPR = TERM $('+' TERM .OUT('ADD') / '-' TERM .OUT('SUB'));
TERM = FACTOR $('*' FACTOR .OUT('MPY') / '/' FACTOR .OUT('DIV'));
FACTOR = .NUM .OUT('LDL ' *)/.ID .OUT('LD ' *)/'(' EXPR ')';

In the ASIGN rule above, the * in the .OUT('STORE ' *); production referances the recognized .ID in the ASIGN rule. * references the most recent locally recognized token. META II was simple, generating code for a stack machine:

t = a+3*b-2;
processed by the above rules generates:
Label Operation Operand // Stack
LD a a
LDL 3 3,a
LD b b,3,a
MPY 3×b,a
ADD a+3×b
LDL 2 2,a+3×b
SUB a+3×b-2
STORE t

Being a recursive decent parser, the direct generation of code for stack machines is simple. However most computers are not stack machine archactures. In 1968 TREE-META made a notable change adding code generation unparse rules and changing the syntax equations to transform input source code into abstract syntax tree structures that are passed to unparse rules. :<node>[<number of branche>] creates a <node> tree having the specified number of branches. <node> is a node identifier string.

Assignment statement in TREE-META
ASIGN = .ID '=' EXPR:ASGN[2];
EXPR = TERM $( '+' TERM:ADD[2]
              /'-' TERM:SUB[2]
             );
TERM = FACTOR $('*'FACTOR:MPY[2]
               /'/'FACTOR:DIV[2]
               );
FACTOR = '(' EXPR ')'/.NUM/.ID;

The :ASGN[2] creating a tree of the form:

   ASGN
  /    \
ID      EXPR

The EXPR rule in turn creating an ADD[2] or SUB[2] trees:

     ADD            SUB
    /   \          /   \
TERM     TERM  TERM     TERM

Like wise TERM creates MPY[2] and DIV[2] trees.

       MPY                DIV
      /   \              /   \
FACTOR     FACTOR  FACTOR     FACTOR
t = a+3*b-2
generates:
  ASGN
 /    \
t      SUB
      /   \
   ADD     2
  /   \
 a     MPY
      /   \
     3     b

At about the same time with Schorre on the developer team CWIC did the same. Only with a far more advanced code generator language based on LISP 2. The CWIC syntax language gained token and character class rules, positive and negative look ahead and program controled backtracking. !<number> is used to create trees simular to the [<number>] operation in TREE-META. However node creation and tree forming are separate stack operations. CWIC uses a parse stack and node stack. Nodes are created and pushed on the node stack. Tokens are recognized, created and pushed on the parse stack by token equtions. !<number> pops the top node stack entry and <number> of parse stack entries combining them into a list whose first element is the poped node stack entry. ASIGN, EXPR and TERM rules below show the tree building.

Assignment statement in CWIC
ASIGN = ID = EXPR:ASGN!2;
EXPR=TERM $(('+':ADD/'-':SUB)TERM!2);
TERM=FACTOR $(('*':MPY/'/':DIV)FACTOR!2);
FACTOR = NUM/ID/'(' EXPR ')';
DGT: '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9';
UPR: 'A'/'B'/'C'/'D'/'E'/'F'/'G'/'H'/'I'/
     'J'/'K'/'L'/'M'/'N'/'O'/'P'/'Q'/'R'/
     'S'/'T'/'U'/'V'/'W'/'X'/'Y'/'Z' ;
LWR: 'a'/'b'/'c'/'d'/'e'/'f'/'g'/'h'/'i'/
     'j'/'k'/'l'/'m'/'n'/'o'/'p'/'q'/'r'/
     's'/'t'/'u'/'v'/'w'/'x'/'y'/'z';
LET: UPR/LWR;
ALPHANUM:  LET/DGT;
NUM .. DGT $DGT MAKENUM[];
ID  .. LET $(ALPHANUM/+'_');

The ".." token rules, ID and NUM, create token object and push them on the parse stack. The string equation in CWIC is a prime example of CWIC's analytical grammer recognition power. CWIC used the ' mark to delineate strings. Two ' marks within a string are used to represent a single ' mark. The string token equation uses strings, conforming to itself:

string .. '''' $(-'''' .ANY/ '''''','''') '''' MAKESTR[];

The above string parsing equation recognizes a string token. A .. token making equation skips leading skip_class characters looking for the first character match. Once the first match is made skiping stops. Characters recognized by ,ANY and character classes are copied to the token. Quoted strings in token equations are normally only recognized they do not become part of the token. Operators preceeding a string modify it's function. A , preceeding a string is not a test the string is unconditionally inserted into the token. A - is a negative look ahead. -'''' tests for a single quote character returning failure if matched. Successful when the string is not matched. The parse is not advanced in either case. ? is a look ahead eveluating to the test result but not advancing over any characters regardless of the test result. '''''','''' tests for two ' characters together and if successful inserts a single ' into the token. 'don''t' becomes the string don't. Two single quotes '' within a string are inturpeted as one single quote. A string literal preceeded by a + if matched is copied to the token. ,<string> inserts the <string> into the token. In '=' syntax equations + and , preceeding a string create string tokens pushing them on the parse stack.

SLIC is based on CWIC, extending it, adding features enabling the description of machine instruction set architectures and separating target machine code generation into a separate phases. The parsing language changes added single character string literals:

char .. ''' .ANY ''' MAKESTR[];
string .. (''' .ANY ''' | '"' $(-'"' .ANY/ """""",'"') '"') MAKESTR[];

The " character being used to delimit strngs and the ' used for single character strings. A char string was then used in character class equations. CWIC and SLIC have a built in SKIP_CLASS character class. The CWIC skip_class could be changed by redefining it in the runtime library. In SLIC the default SKIP_CLASS could be overridden by simply defining a SKIP_CLASS.

.ANY may be considered a character class as it's result is equivalent. The implementation is different. It simply sdvances the parse one character only failing at end of file.

Development stoped on these compiler writing systems in the mid 1970s or was not publicly available.

I first learned about the technology at an ACM SegPlan meeting in 1969. Erwin Book gave a talk on the CWIC compiler writing system. There is a paper on CWIC available from the ACM digital library dated 1970. Erwin brought a few programming manuals to the meeting and I got a set. At that time I was employed at Cerritos Collage, where I later wrote SLIC.

Schorre META language compilers are clasical examples of syntax-directed translation. A compiler's source code language specification and syntactic analysis is programmed using boolean-valued parsing equations that appear to be phrase structure grammar rules. Each boolean-valued parsing equations specifying a linguistics constituent test against the input source character stream. With their boolean algebraic base ambiguous language constructs are combined and common ambiguous sub-expressions factored out eliminating the need of backtracking in most cases. Backtracking was in the original design of CWIC. Early Schorre metalanguages not having backtracking relied completely on factoring to eliminate ambiguous constituents. All have backup that is used in their lexical analysis functions. Backup only saves the input state were backtracking saves the parse state. Backtracking is also limited to parsing. Generating code within an active backtrack causes a compiler error. Note generators may be called to do tests.

In the old metacompiler documents they define metacompiler as "a computer program whose input is a metalanguage program" or "taking as input a computer program written in a metalanguage". An effort is made here to use modern terminology. Metasyntax for example is today a more precise term for the compiler-writing programming language. So today we would say a metacompiler takes as input source code written in a metasyntax programming language. The old metacompilers used $ as a preceeding sequence operator. For example:

'(' arg $(',' arg) ')'

would express a parenthesized argument list. Today the Kleene star * notation is commonly used:

'(' arg (',' arg)* ')'

In order to understand these compiler writing systems one must first understand them as programming languages.

An important distinguishing feature as many compiler compilers now take as input only the syntax of a language and output a parser. Usually in the form of a table. An example of which is yacc that generates a LALR parser. A grammar or syntax is a metalanguage. But not necessarily a programming language. Metacompilers are commonly self-hosting compilers, written in their own metasyntax dialect. That is they can be expressed in their own metasyntax language. The metasyntax program constitutes a complete programming language specification, syntax and semantics. A programmer reading that programming metalanguage specification can discern exactly the function of any valid metasyntax language construct.

The atempt here is to explain the programming languages. An interesting terminology often overlooked is that the syntax language in many of these netacompiler's documentation is explained as syntax equations. Not as syntax rules. Syntax equation is a more descriptive term as they are in part boolean equations. These need to be recognized as what they actually are. A top-down analytical goal orianted programming language.


In the advanced CWIC and SLIC metacompilers the syntax, parser programming, language is designed to simplify the following aspects of compiler construction:

1. Scanning, parsing and lexocographic analysis.

2. Detection and pictorial reporting of errors.

3. The recovery from errors to continue compilation.

4. Hashing and construction of dictionary or symbol table entries.

5. The construction of an intermediate Abstract Parse Tree.

Their code generation, or sequencing, languages are powerful programming languages having features designed to simplify:

1. Recognizing and optimization of special case code sequences.

2. Global optimization over the directed graph of a program.

3. Evaluation of constant expressions at compile time.

4. Elimination of common sub expressions.

5. Mixed mode arithmetic.

6. Detection of duplicate and undefined labels.

7. The generation of data and instruction code sequences.

8. Features for writing and reading tree and dictionary structures facilitating the creation of multi-pass compilers

9. Stacked dictionaries that facilitate nested block structure languages.

10. Attaching attributes to symbol facilitating the association of symbolic, address or numeric data with simbols in the dictionary.

11. Interrogation and minipulation of symbol attributes.

SLIC improved on the code generation features and strategy:

1. Highlevel assembly defining language simplified code generation and improved readability of code generated.

2. PSEUDO language used to define PSEUDO, macro like, instructions. The generator language produces PSEUDO instructions that when executed output machine code.

Parser program goal[edit]

Note that a formula is a programed parser function in the parser language.

The main goal is to match a complete program module starting with a program defining formula. The starting formula drives the parser describing a complete program modual.

The name of a parser function appears at the start of a formula. The = symbol is most commonly used in defining a parsing formula. The == is used in defining a backtracking formula. CWIC added token formula using .. formula. Token formula may not call other formula. Character classes referenced by token formula generate in-line tests. The right-hand side of a formula indicates which entities and in what order they are to be recognised to achieve its goal. The function achieves its goal by looking for these entities, from left to right, as individual subgoals. These subgoals are themselves literal lexeme tests or other formula. The process of recognising a complete program becomes a matter of looking for progressively smaller entities, right down to the level at which basic lexemes are recognised. It is a reductive analysis starting with a top driving parser formula reducing complex language structures into simpler and simpler structures down to atomic elements.

syntax program goal[edit]

The main goal of which is to match the complete program with the initiating/starting syntax formula. The start formula drives the parser describing a complete program or module. The name of a syntax formula appears at the start of the formula followed by the = symbol. The right-hand side of a formula indicates which entities are to be recognised to achieve this goal. The recogniser achieves its main goal by looking for these smaller entities, from left to right, as individual subgoals. These subgoals are themselves literals or other named rules. The process of recognising a complete program becomes a matter of looking for progressively smaller entities, right down to the level at which basic tokens are recognised. It starts with a top formula reducing complex language structures into simpler and simpler structures down to atomic elements.

Edit a[edit]

Today we commonly use a seperate lexer system. The parser programming languages use token recognizing functions to recognized character sequence and create tokens. In the Schorre META languages tokens are objects created by token functions.

META II having built in token functions. .ID, .NUM and .STRING that are built into the META II language. They recognize tokens following ALGOL 60 specifications. CWIC and SLIC use ".." token formula.

ID .. LET $(ALPHANUM|+'_');

An identifier is a sequence of characters. The starting character must be a LET. The rest may be a LET, DGT or underscore.

ALPHANUM: LET|DGT;

ALPHANUM is defined as a character class designated by the :. To the right of the :, ALPHANUM is a defined to be a LET or DGT. Defining a character class using other classes makes for a more efficient class test. In ID formula defined as as:

ID .. LET $(LET|DGT|+'_');

compiles to making three separate tests in the $, loop, sequance. LET, DGT and +'_' are separate tests. We can define:

SYMCHR: LET|DGT|'_';

and define:

ID .. LET $SYMCHR;

SYMCHR is a single test. Literally SYMCHR is a bit mask anded with a character class entry.

An ID (identifier) is a LET followed by a sequance zero or more ($) ALPHANUM or '_'. The ID equation is compiled into a test function. Leading skip_class characters are skipped until a LET is matched. If a LET is not matched the input stream is reset to the functions entry state and failure returned. With a LET successfully matched the zero or more loop continues while a LET, DGT, or '_' continue to be matched.

It is believed the parsing language was initially modelled after ALGOL 60's BNF metalanguage. What ever the case there needed to be a delineation of matching a token and a language structure containing tokens. Tokens are symbols, variables, numbers, string literals and such. CWIC and SLIC token equations use string literals and character classes to recognize character sequances. Recognized characters are gethered and used to create a token object. Character classes define a name to be a class of character. The character class defining operator : signifies a character class defination. A .. defines a token formula. = and == are syntax equations.


The main goal is to match or parse the complete program with the starting syntax equation: i.e.

program = $decl;

This program equation's objective is reduced into recognizing, or parsing, a sequence of decl, denoted by the zero or more $, sub-goal objectives. The goal defining operator "=" indicates a language structure goal. A program now defined as a sequence of decl. We need to define decl.

decl = id "==" syntax ";" :BCKTRK !2 code_gen(*1)\
       id "="  syntax ";" :SYNTAX !2 code_gen(*1)\
       id ".." token  ";" :TOKEN  !2 code_gen(*1)\
       id ":"  chrcls ";" :CLASS  !2 code_gen(*1)\
       id +[genr $genr]+ :GENR   !2 code_gen(*1);

The decl syntax equation again breaks down the decl goal into sub-goals, all of which start with id. Of note is the use of the \ alternative. The \ alternative is used to set a backtrack point. This is necessary as all alternatives start with in id. In this equation it is not hard to eliminate backtracking. Backtracking may be eliminated by factoring out id:

decl = id ("==" syntax ";" :BCKTRK !2 code_gen(*1)|
           "="  syntax ";" :SYNTAX !2 code_gen(*1)|
           ".." token  ";" :TOKEN  !2 code_gen(*1)|
           ":"  chrcls ";" :CLASS  !2 code_gen(*1)|
           +[genr $genr]+  :GENR   !2 code_gen(*1));

Factoring out id we have a sequence of 2 sub goals id followed by a goal of grouped alternatives.. Further factoring out of common terms reduces generated code. It is a good exercise doing this. The compiler could do these same reduction on the APT. Personally I find the reduced form easier to follow. Condensed in size it's easer to take in the full equation.

decl = id ((("==":BCKTRK|"=":SYNTAX) syntax)
           |"..":TOKEN token
           |":":CLASS  chrcls
           ) ";"
          |+[genr $genr]+:GENR
          )!2 code_gen(*1);

It defines four types of syntax equations and a code generator declaration all starting with the factored out id. The sequence *1 is used in passing a transformed tree or list object to code generators. A code generator function is a named set of unparse action pairs:

genr = "(" +[unparse $("," unparse)]+ ")" "=>" 
           +[( action ";"
             |"{" $(action (";" | ?"}" .break)
                  vectordefs
              "}" 
            ]+ ;

Generator functions are called from syntax equarions. The reduction continues expressing each sub-goal in terms of sub-goals down to tokens expressed as character sequences. The main or starting goal is defining a valid program.

In the decl equation there are five sub goals, all starting with id as their first goal. The decl equation is an example of avoiding ambiguous grammar sequences by grouping and factoring. The id leading sub-goal factored off leaving a group sub goal of alternatives that began with id allowing a left right continuous progression without backtracking.

A COBOL program driving equation defines the COBOL division order:

program = IDENTIFICATION_Division
          ENVIRONMENT_Division
          DATA_Division
          PROCEDURE_Division;

Each division equation would in turn define the division's specific syntax. Each division is a parser goal to be recognized in the order specified.

Syntax formula are test functions written in the form of analytical grammar rules. They are also boolean equations that evaluate to boolean states of success (true) or failure(false). Each syntax equation is compiled into executable machine code instructions. Test status (success or failure) is commonly returned in the processor's status zero flag. Instructions set and clear status flags as a result of their operation. A compare instruction is essentially a subtract operation that only effects the status flags. The zero flag indicates the equality of two entities. beq and bne branch equal and branch not equal test the zero flag and transfer control depending on the flags state. The equal condition indicates success. Not equal is failure.

The recogniser achieves its main goal by looking for these subgoal entities, from left to right, as individual goals. These subgoals are themselves defined in terms of other entities, and so the process of parsing a complete program becomes a matter of looking for progressively smaller entities, right down to the level at which characters are recognised. Wow a 5GL goal oroanted language from the 60s.

Tests and backtracking[edit]

The number one thing a parser does is test the input characters to determine if they conform to the language specification. Another way of looking at a syntax equation is as a test. A test is like a boolean expression evaluating to success(true) or failure(false). Thus a syntax equation is a test. A quoted string match is a test. A perenthised syntax expression is a test. Tests are performed in the order they are coded. From left to right and down. A sequance of tests implies AND operation. That all tests must be successful or failure results. On success the input is advanced on failure the input is not advanced. The input may be thought of as a character stream.

There are two alternative (or) operator types and two corresponding forms of failure, backup and backtrack. Backup is simple. The input stream is reset to a previous state when a token or string test fails. Backtrack failure resets the parse state to the most recent backtrack stack frame saved parse state on the call stack. Backtracking resets the call stack to the backtrack stack frame and returns failure. A backtrack failure occurs on the failure of a test other then the first in a sequence of tests rather or not in a backtrack alternative. A backtrack failure will terminate the compile if there is no backtrack alternative. A system backtrack frame precedes the call to the first driving syntax equation. The system backtrack frame does not restore a saved state. It catches compiler errors and reports as such. It is a coding error of the compiler. Not the program being compiled.

It is better to parse ambiguous language constructs by factoring them out if possible.

Below the syntax equation specifies a series of tests:

X = b C k;

For X to be successful(TRUE) tests b C and k must be successful. However not only do the tests b C k all have to be successful. The tests must be performed in the order written. An or is programed using a "|" or "\" alternative operator. (The equilivant of a boolean or operation.) Alternatives are attempted in the order they are programed left to right. Taking the first successful.

A = ((b ';' | ".end" .break)\errorx("error") (-';' .any)* ';')*;
b = c d | e ((f g) | k);

The '\' alternative is the backtracking alternative. Used in equation A above it would reset the parse state to state before the first alternative,

(b ';' | ".end" .break)

and proceed issuing an error

errorx("error")

and scanning for a ';'

(-';' .any)* ';'

Parser Programming Language[edit]

So far only grammar recognition features of the parser language have been described. The advanced metacompiles TREE META, CWIC and SLIC transform the recognized language structures and tokens into Abstract Parse Trees. CWIC and SLIC use specilized token formulae programed to recognize tokens. Earlier metacompilers have built in token rules: .ID .STRING and .NUM, That parse tokens conforming to ALGOL 60 rules. They all are scannerless parsers that match tokens on the fly. In today's terminology their syntax formulae are technically a grammar language.

CWIC and SLIC are object processing languages. Objects are the primary data. A variable may hold any type object. Thay may be assigned an integer on one instance, a list on another, or a string or symbol the next. A program may have many types of data. A generalized compiler writing system has to handle any data type. Numeric data of variable precision and different formats. Underlying the CWIC and SLIC systems is a full implementation of LISP 2. Lists are used to represent tree structures. Special object types are added to these languages to facilitate the programming of compilers. Added types are node, symbol, gen_sym(generated symbol), Symbol table, Code_referencer. The syntax equation have object construction operators that create nodes, lists and trees. Token rules parse and create symbols, numbers and strings. Operators in syntax equations can create string objects from literals.

The Schorre line of Parser programming languages are simular to PEGs or TDPL. However META II preceeded TDPLs by a decade. After reading Ford's paper, I would say the Schorre parser programming languages have a lot in common with PEGs. Except all Schorre based parser programming languages include programming constructs directing semantic transformations to assembly code or abstract syntax trees. Allmost everything that distinguish PEGs as top down analytical grammars suited for defining programming language's is descriptive of the technology I am talking about. The operational interpretation of PEG expressions is the same.

Each nonterminal in a parsing expression grammar essentially represents a parsing function in a recursive descent parser, and the corresponding parsing expression represents the "code" comprising the function. Each parsing function conceptually takes an input string as its argument, and yields one of the following results:

  • success, in which the function may optionally move forward consuming one or more characters of the input.
  • failure, in which case no input is consumed.

There are many features equalivent to the listed features given in the Parsing expression grammar topic:

Given any existing parsing expressions e, e1, and e2, a new parsing expression can be constructed using the following PEG or Schorre language operators:
PEG META II CWIC Operation
e1 e2 e1 e2 e1 e2 e1 followed by e2

Implied e1 AND e2 sequence

N.A. e1/e2 e1| e2 Non-backtracking

Ordered choice

e1/e2 N.A. e1 \ e2 Backtracking ordered choice
e* $e $e Zero-or-more sequance operator
e+ e $e e $e One-or-more
e? e/.EMPTY e/.EMPTY e optional
&e N.A. ?e (look ahead)

And-predicate

!e N.A. -e (look ahead)

Not-predicate

( ... ) ( ... ) ( ... ) grouping
N.A. N.A. $(e1 .BREAK / e2) Zero-or-more

forced termination

N.A. N.A. .FAIL Forced failure
N.A. N.A. .ANY Recognize

any character

N.A. N.A. +[ e ]+ make list.
N.A. N.A. :<node> push node
N.A. N.A. !<number> form tree
N.A. N.A. g[<args>] call functions in

other languages

N.A .OUT( ... ) N.A. Text output
N.A. .LABEL() N.A. Label output

Backtracking is not a programed feature of PEGs or a parser generator's input. Explicite programed backtrack points and long failure provides far more efficient backtracking. Backtracking also is different. A failure recognizing a lexeme causes a backup only affecting the input stream. Backtracking applies when recognized elements have to be undone. Machine code generation cannot be undone causing a compile failure. Backtracking is programed using the backslash "\" alternative operator. On a failure parsing the left alternative the input is reset to its initial state. Tree transformations and pending dictionary chages released.

These are recursive decent parser programming languages. A backtrack creates a backtrack stack frame. Backtrack's may be nested, their stack frame linked to the previous backtrack stack frame. A backtrack failure loads the stack pointer with top backtrack and returns failure. This is simular to a long jump in C. Backtrack resets the call stack, node stack and parse stack, releasing created objects and returns to the backtrack alternative. An initial system backtrack frame catches a long failure when no programed backtrack is in effect and results in a compiler failure.

Look ahead is a very powerful feature. It can:

  • test ambiguous sequences and select the correct path.
  • be used to recover from errors skiping over unrecognizable elements.
  • recognize bounded sequences such as String literals.

Comparing PEG's to Schorre parser programming languages one can see the similarities. CWIC extends the parsing features adding token rules, backtracking alternative operator, loop break, forced failure, and calling functions in other languages. Using SLIC separately compiled modules are linked allowing common language features to be shared across compilers and languages. What happened that these older languages are more advanced then later TDPLs or PEGs?

META II and languages based on it, came out of early work on syntax directed compilers done by members of a Los Angeles ACM SEGPLAN special interest group on syntax directed compilers. They are compiler writing languages of which a part is a parser programming language. The transformation aspects need explaining as they depend on the reductive syntactic analysis methodology.

In these languages, the parser is programed using test functions resembling phrase structure/constituency grammar rules.

<constituent_name> = <test conditional expression>;

A rule assigns a unique name to a constituent (test) function defined using a conditional expression. That function may be called by referencing it's name to recognize the constituent.

These are lexerless parser programming languages in which the terms lexeme and token are a bit different then described in modern compiler text books.[3] A lexeme is recognized by a quoted string literal or a token rule. All tokens are lexemes but not all lexemes are tokens. A quoted string literal is a constant lexeme test used to recognize a specific sequence of characters in the immediate input stream. The recognized characters are not normally kept. A token rule is used to recognize a token lexeme that is a variable character sequence that is kept. Where it gets confusing is a string token rule recognizes a string token lexeme. A token is a variable lexame. A quoted string is a constant lexeme. Constant lexemes may be significant to the input programs operation. But are usually handled by programed transformations in the parser language. A string recognized by a string token rule, is a variable character sequence that is a constant of the input source program. A character sequence recognized by a quoted string literal is a constant of the parser language. (That's about as clear is mud. Hopefully it will be made clearer by examples.) Early Schorre languages used built in token recognizes .ID, .NUM and .STRING. Later advanced parser programming languages have token rules.

In early Schorre META (parser programming languages) a token rule saved the last recognized character sequence that could later be output as part of an assembly instruction. Later generations of Schorre META (parser programming languages) having LISP based generator sub-languages created typed objects. Symbols being catalog in a symbol dictionary. Numeric and string objects are created by calling supplied library functions. Recognized tokens being pushed on a parse stack.

These parser languages are goal orianted. The syntactic analysis starts with a main goal that is the top driving rule. That rule defines the constituent make up of a valid program module. The recogniser achieves its main goal by looking for these constituent goals, from left to right and down, as individual subgoals. These subgoals are themselves defined in terms of other constituents, and so the process of recognising a complete program module becomes a matter of looking for progressively smaller entities, right down to the level at which lexeme's are recognized. This top-down syntax analysis method commonly known as a recursive decent has advantages and disadvantages. The advanced lisp enhanced language features of CWIC and SLIC overcome many if not all disadvantages associated with recursive decent parsers. A generator function may be called anywhere a syntax rule may be called. Generator functions return success or failure status so they effect parsing the same as calling a syntax rule. The first input to a SLIC produced compiler is the command line initiating it. It parses its initiating command line calling Generator functions to setup input, output, and compile options. Some options were built-in to the SLIC system. Program listing and debug options are part of the compiler writing system. Such option if used had to be programed into the command line analysis. Supplied library functions handled built-in option setting.


Though they appear to be a declarative programming languages, programming in them is more like an imperative programming language. One writes a rule in imperative mode, knowing they are test function's that direct the syntactic analysis of an input character stream. Test is a boolean type or an analysis action that returns a test(status) result of success or failure. A quoted string literal is a conditional construct used to recognize a specific sequance of characters in the immediate input stream. We write syntactic analysis test equations that are conditional expressions specifying language constituent tests that are actually test functions. We are in analysis mode of thinking writing a program to analyze textual input. We use constituent names and quoted string literals to express constituent sequences. For example we wish to parse a simple algebraic expression expressing algebraic hierarchy using algebraic terminology: expression, term, and factor. In algebra an expression is a sequence of terms separated by sum or difference operators + and -. We start with the expression rule:

expression = term (('+'|'-') term)*;

The parsing rule expression is a test function consisting of two test sub-expressions. The first test term reference's another rule. term may return failure resulting in expression returning failure. If term is successful the second test, a grouped Kleene star (zero or more loop) (('+'|'-') term)*, is atempted. There is an implied and conective between sequential tests. All must be sucessful for the sequence to be sucessful(true). Separating alternative sequances the alternative operators \, /, | express the boolean or condition. Alternatives are attempted in the order they are coded. The Kleene star loop applies to a grouped sequence of two tests

('+'|'-') and term.

The grouped loop sequence is used to recognize a + or - operator that is followed by a term. The grouped alternative's ('+'|'-') tests are attempted in order. First attempting to recognize the string '+'. On success the grouped test succeeds. On failure the alternative '-' is atempted. On failure of the '-' the grouped loop's first test fails and the Kleene star zero or more loop terminates successfully. If the test ('+'|'-') succeeds the term test function is called. On success the loop repeats. The term test returning failure and not being the first test of a sequance is indicatave of the sequance failing. The recovery after having successfully completed one or more tests of a sequance requires backtracking. Languages not having backtracking aborted the compile. The advanced languages used operators setting backtrack points allowing recovery from such a state. A backtrack or long failure loads the call stack with the top backtrack stack frame and returns failure.

Above I used the more familiar Kleene star for simplicity of discussion. These languages use a preceeding $ zero or more loop operator:

expression = term $(('+'|'-') term);

I know when I write term and the quoted '+' and '-' strings in expression they are test functions or actions that succeed or fail. I know a non-backtracking alternative will catch a failure recognizing a lexeme. On entry a lexeme test saves the input state. On failure the input state is restored before returning. I use the term backup as only the input stream state is involved.


In order to produce code we need to transform the recognized entities into a manipulatable or executable form. META II and other early Schorre metasyntax programming languages used output constructs to write stack machine assembly code lines to an intermediate file. META II uses .OUT(...) and .LABEL(...) to write stack machine assembly code lines to an intermediate file.

-Algebraic expression coded in META II-

EXPR = TERM $('+' TERM .OUT('ADD') /'-' TERM .OUT('SUB'));
TERM = FACTOR $('*' FACTOR .OUT('MPY')/'/' FACTOR .OUT('DIV'));
FACTOR = '(' EXPR ')'/.ID .OUT('LD ' *)/.NUM .OUT('LDL ' *);

In the .OUT intermediate file writer operation * referenced the most recent recognized token. .OUT(...) writes instructions beginning in the opcode field. In FACTOR when .ID recognizes an identifier a .OUT('LD ' *) outputs a

LD <id string just recognized>

LDL, load literal is used when a number is recognized. Code generated is for a stack machine. The load instructions push their operand onto the stack. Arithmetic operates on the top 2 stack entries. The code generated is a reverse polish sequance.

a+3*b-2
generates:
Label Operation Operand // Stack
LD a a
LDL 3 3,a
LD b b,3,a
MPY 3×b,a
ADD a+3×b
LDL 2 2,a+3×b
SUB a+3×b-2

.LABEL is used to write labels beginning in column one. Two generated symbols may be created local to a rule's executation. On the first referencing of *1 and *2 within a rule a generated symbol of the form A01, A02 ... B01 ... etc, is created. Each symbol being unique to the rule's invocation. Using *1 or *2 in a .OUT or .LABLE (*1) outputs a generated symbol string. .LABEL (*) outputs the most recent matched token.

It is important to note that these lexerless parsing languages used special rules and quoted string to recognize lexemes.

TREE-META separated code generation and syntax analysis creating an Abstract syntax tree using :<node>[<number>].

-Algebraic expression coded in TREE-META-

EXPR = TERM $('+' TERM:ADD[2] /'-' TERM:SUB[2]);
TERM = FACTOR $('*' FACTOR:MPY[2])/'/' FACTOR:DIV[2]);
FACTOR = '(' EXPR ')'/.ID *)/.NUM;

The <node>[<number>] tree forming operation creates a tree having <number> branches.

  a+3*b-2

          SUB
         /   \
       ADD    2
      /   \
     a    MPY
         /   \
        3     b

I believe TREE-META had a gen-sym system strategy simular to META II. Of note is that the tree building operations, node and branch specification being combined. I do not know the details of how parsed entities or nodes and trees are implemented in TREE-META. The intermediate tree generation allows a more general code generation strategy. But the tree building operation disallowed factoring node and branch directives. There have been different TREE-META implemetations. The oldest described in a 1967 paper. A 1974 paper calmed to have both a TREE-META compiler and a parser generator for a reduced subset of the parser language and describes backtracking implemented using <- preceeding the first alternative of an alternative sequance.

CWIC and SLIC whose generator functions are LISP 2 dialects have dynamic memory objects. Objects are typed and may be tested as to their type. Push down node and parse stacks are used to build syntax trees. A tree is a list whose first element is a node object. Node objects are created and pushed on the node stack using :<node name string>. !<number> creates a tree, poping the top node stack entry and top <number> of parse stack entries. The tree is pushed on the parse stack. Tokens are pushed on the parse stack as they are recognized. A string object may be created and pushed on the parse stack using the ,<string literal> sequance to create a string object and push it on the parse stack. A +<string literal> sequence is used to recognize a string literal creating a string object and pushing it on the parse stack if recognized. CWIC added token and character class rules using consistent parsing language syntax. Character class being the simplest is designated by the : define operator:

CLASS = ID ':' CHAR $('|' CHAR) ';';
CHAR = ''' DCHR ''' | integer;
DCHR: 'A'|'B'|'C'|'D'|'E'|'F'|
  ''all displayable characters''

Character classes define literal character recognizers. They can be thought of a literal define matching any character listed. A class name used in a token rule recognizes any character in the class and becomes part of the token. Used in a syntax rule they work like a quoted string literal test not normally being kept. Token rules are designated by the '..' rule lexicon.

-Algebraic expression as might be coded in CWIC or SLIC

EXPR = TERM $(('+':ADD|'-':SUB)TERM!2);
TERM = FACTOR $(('*':MPY|'/':DIV)FACTOR!2);
FACTOR = '(' EXPR ') | ID | INT;

DGT: '0'|'1'|'2'|'3'|'4'|'5'|'6"|'7'|'8'|'9';
UPR: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
     'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';
LWR: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
     'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';

LET: UPR|LWR;

ALPHANUM: LET|DGT;

ID .. LET $(ALPHA_NUM|+'_');
INT .. DGT $DGT MAKENUM[];

I know when I write the EXPR equation that TERM is a test function and expect it to return a term on the parse stack. The parser is usually written top down. When syntax or token rules are referenced it is usually the case a tree is left on the parse stack. It is expected that EXPR, TERM, and FACTOR fully recognize their constitute and transform it into a tree that 's left on the parse stack. The double dot ,, token rules create token objects. Intercede functions like MAKENUM[] in the INT rule intercede the default operation of cataloging a symbol. MAKENUM creates an integer numeric object that is pushed on the parse stack. When no intercede function is used as in the ID rule the token is assumed to be a symbol. Symbols are cataloged locating the symbol in the dictionary or creating a dictionary entry if it does not exist. In any case a symbol object is created and pushed on the parse stack. LET, DGT, and ALPHA_NUM are character class rules. In SLIC character class rules crate a character class table that is indexed by a character's numeric code. The class name gets defined as a bit mask that is tested(logical and operation) against the indexed location. A non zero result indicating membership in the class. A pre-defined skip_class matches non-printing and white space character codes. In SLIC you may overide the default skip-class by defining a skip_class character class. White space is automatically handled. Token rules skip leading skip_class characters. In SLIC a token rule tries repeatedly to make a first character match failing the first character match it tests for a skip_class match. On matching a skip_class character the input is advanced and the first match test repeated. Once a first match is made, skip_class skipping stops and following characters continue being matched against the rule. In CWIC skip_class character's are avanced over until a non-skip_class character terminates skipping. The token rule then proceeds matching charactes. CWIC tokens may not start with a skip_class characters. That was originally done in SLIC so symbols could be constrained to start in column 1 of a line. A line_break test was later implemented.

Understanding the stack manipulation, specifying parsing order etc is imperative mode programming.

The CWIC compiler writing system has 3 seperate languages. The SYNTAX language, GENERATOR language, and MOL-360 each having a separate compiler. The MOL (Machine Orianted Llanguage) compiler is a separate development supplied with CWIC for writing low level compiler support code.

Abstract Parse Tree[edit]

(3•x-5)/(x+4) ->

            DIV
           /   \
        SUB     ADD
       /   \   /   \
    MPY     5 x     4
   /   \
  3     x

The APT (Abstract Parse Tree) refers to the Abstract Syntax Tree structure created in the parser languages of CWIC and SLIC.

Using two operational stacks parsed tokens, sub-trees, and nodes are transformed into an abstract syntax tree The APT is constructed on the parse-stack from parse-stack and node-stack entries. Tokens are placed on the parse stack by token rules. The :<node> operator is used to create and push a node object on the node stack. The !<number> operator builds a list of <number>+1 elements whose first element is poped off the node stack. The top <number> of parse stack entries are poped filling out the list. The list is pushed on the parse stack. The transform rules below perform the arithmetic expression left-most tree derivations transformation illistrated above.

expr = term $(('+':ADD|'-':SUB) term!2);
term = factor $(('*':MPY|'/':DIV) factor!2);
factor = '(' expr ')'|number|id;

Parsing the expression:

(3•x-5)/(x+4) ->

An equivalent functional tree is is built on the parse stack:

           DIV
          /   \
       SUB     ADD
      /   \   /   \
   MPY     5 x     4
  /   \
 3     x
[DIV,[SUB,[MPY,3,x],5,[ADD,x,4]]

The !<number> transform operation pops the top <node name> on the node-stack and combines it with the top <number> of parse stack entries. Poping the <number> entries and pushing the created tree onto the parse stack.

List operators +[ and ]+ are used to create a list of parsed items.

arg_list = +[argument, $(',' argument) ]+

+[ and ]+ are grouping operators that when enclosing a parse sequence create a list of the parsed items.

The tree stacks and objects[edit]

In CWIC and later metacompilers every data item is an object dynamically allocated from heap memory.

The abstract parse tree expresses the language in the same way as an abstract syntax tree. It however is created by tree and stack operators in the parser programming language. It is further processed in the generator language.

The grammar equations of the parser language direct the abstract parse tree's construction. Transformations are performed by creating objects and pushing them on stacks. There are two stacks used in creating the abstract parse tree. The parse stack, also referred to as *stack (star stack), because *<int> is used to reference *stack entries. *1 is the top entry *2 the next entry. *1 used in a generator argument list will pop the top parse stack entry. *<number> originated in META I used in referancing recently parsed enties and later in TREE-META to referancing stack enities.

Token rules recognize tokens and create token objects. The created token object is pushed on the *stack. Memory management is automatic. Objects are dynamically created and destroyed automatically.

Syntax rules recognize language constructs and direct the creation of list representations of trees. A tree node is an object created by the :<node> operator. The node created is pushed on the node stack. The tree construct operator !<number> creates a list containing <number>+1 objects. The first object of the list is the top node on the node stack. The node is poped off the node stack and placed as the first list element. Parse stack element's are then poped filling the list. The number of entries poped is the <number> specified by the !<number> tree operator. A *stack object is poped into the last free position. The result is that they are in the list in the order they were recognized.

The abstract parse tree should expresses the language structure. The expr rule with tree building operators illiterate how the tree is constructed.


expr = term $(('+':ADD|'-':SUB) term!2);
term = factor $(('*':MPY|'/':DIV)factor!2);
factor = id | number | '(' expr ')';

Assuming that term will leave a single object on the parse stack we can program the construction of a binary tree. Term is called and recognizing a valid term language construct leaves a term tree on the *stack. expr then tests the input stream for a + or - character. String matching automatically skips leading white space characters. On matching '+' the :ADD creates a ADD node and pushes it on the node stack. Or alternatively the '-':SUB creates and pushes a SUB node on the node stack. If neither is matched expr terminates leaving the term tree as the top parse stack entry. If a '+' or '-' were matched the corosponding node is on the node stack and term is again called. On successful return from term the !2 creates a 3 element tree from the top node stack entry and the top 2 parse stack (term) entries and the $ loop repeats. If in the loop, after matching a + or -, term fails a backtrack failure would occure. Somewhere in the call stack there must be a programed backtrack stack frame or the long fail will cause a compiler error.

The rules above given the expression:

(3•x-5)/(x+4)

will be transformed it into the abstract parse tree below:


  DIV[SUB[MPY[3,x],5],ADD[x,4]]

             DIV
            /   \
         SUB     ADD
        /   \   /   \
     MPY     5 x     4
    /   \
   3     x

The two notations both represent the same tree. The first is the display form used by the debug mode of CWIC. SLIC had a dynamic interactive debuger and used the same display. The second is the common pictorial mode.

Brief history[edit]

The first metacompiler META II compiled to a stack machine. It had no intermediate parse tree or any seperate intermediate structure representing the source. Recognized tokens were simply returned on the stack. It simply translated to code using '.out' '(' $(.STRING / .ID/.NUM ')'. META II was used to compile a subset the ALGOL 60 language. It also compiled it's own source code. Generating post fix stack machine code requires a specific rule structure:

expr = term $('+' term .out 'ADD' |'-' term .out 'SUB');

We assume that when the ADD or SUB instruction is output that generated code for the terms has left their evaluations on the stack. To be clear were talking about the stack machine's stack.

TREE-META advanced the technology, building an intermediate tree representation of language structures. The phrase structure Grammer is basically the same as META II. The embedded output within the grammar rules replaced by an embedded stack language. This is the same as CWIC except for the syntax. CWIC uses 3 stacks. a call stack. A node stack and a parse stack. Using  :<node symbol> pushed the <node symbol> onto the node stack. !<number> builds a <number> branch tree combinimg the top node with <number> of parse stack entries. The TREE-META expr rule might take the form:

expr = term $('+' term :ADD[2]|'-' term :SUB[2]);

There is several unspecified versions of TREE-META. One puts the tree transforms together as above. Others separating the node an tree branch opetation allowing:

expr = term $(('+':ADD/'-':SUB[2])term[2]);

The CWIC compiler writing system, introduced in 1970, implemented two of the phases described by Warshall and Shapiro.[* 1] CWIC is actually three seperate compilers. The syntax[* 2] language compiler is the Schorre component of CWIC used in coding the grammar analysis(parser) of a programming language and its transformation into an Abstract Parse Trees.[* 3] The generator language, used in transforming the Abstract Parse Trees into sequential instructions, is a refinement of the unparse rules of TREE-META using LISP 2 procedural actions. MOL-360 an independent machine oriented language origionally used in the development the ADEPT time-sharing system was included with CWIC to implement support libraries.


META_I MATA_II[edit]

In 1963 Dewey Val Shorre wrote META II, the first compiler compiler to use "META" in its name. There are generations of Schorre metacompilers. All take as input computer programs written in programming metalanguages and translate them into machine instructions.[* 4][* 5][* 6][* 7]

There are good sources for details of the original Schorre META I, META II. Briefly META I was a hand compiled version, a subset of the META II language, that was then used to compile the full version of META II.

META II is a grammar parsing language with an addition ".OUT" construct or function that is used to output stack machine assembly instructions. The stack machine was emulater on an IBM 1401 computer.

SLIC[edit]

Between 1969-1974 while a student at Cerritos Collage Norwalk Calif and later employed there as a computer operator/systems programmer/Tops-10 systems administrater I wrote SLIC (System of Languages for Implementing Compilers). SLIC was inspired by CWIC (Compiler for Writing and Implementing Complers) and A GENERAL-PURPOSE TABLE-DRIVEN COMPILER -- Stephen Warshall and Robert M. Shapiro].

I first learned of CWIC at an L.A. ACM SEGPLAN meeting in 1969 were Erwin Book gave a presentation on the CWIC compiler writing system. I received a set of CWIC SYNTAX and GENERATOR language manuals from Erwin. SDC released a short paper on CWIC in 1970. CWIC is the last publicly documented compiler in the Schorre line of metacompilers. CWIC was developed at System Development Corporation. Dewey Val Schorre was a member of the CWIC development team lead by Erwin Book. The CWIC system included 3 compilers. The syntax (language) compiler based on Schorre's work. The generator (language) compiler based on LISP 2 and the separately developed MOL-360 (Machine Orianted Language) compiler.

The paper A GENERAL-PURPOSE TABLE-DRIVEN COMPILER describes five translation phases:

1. Syntactic Analysis Converts input strings into tree-representations.

2. Generator Transforms the tree into a sequence of macro-instructions.

3 ISO in-sequence optimizer which recognizes macro patterns.

4. Macro expansion Transforms macros into machine code,

5. Assembly Outputing the machine code to an object code file.

SLIC expands on the code generation of CWIC, having a five phase translation strategy simular to thoes described in the Warshall and Shapiro paper. SLIC and CWIC are not table driven. They compile to machine code.

Phase 4 PSEUDO macro like instructions are executable functions. PSEUDO instructions in generator functions are not executed but appended to a list and later executed.

SLIC was developed and ran on a DEC-System-10 time-sharing system running TOPS-10. It does not include a Machine Oriented language. Like CWIC SLIC has domain specific lsnguages. Syntax and generator phases are programed in the SYNTAX and GENERATOR languages. The code produced however is a linked list of PSEUDO instructions. They are programed subroutines coded in the PSEUDO language. PSEUDOs call MACHOPs to output code to the object file. MACHOPs define the target machine instructions assrmbly format and transformation to binary object vode. MACHOPs transform their parameters into instruction fields.

During the years 1975 - 76 SLIC was used to write a COBOL cross compiler and a TI-990 cross assembler.

TREE-META[edit]

TREE-META advanced the technology by introducing parser directed tree building operators and unparse rules, seperating parsing and code generation. The unparse rules also provide some tree minipulation for optimizing generated code.

CWIC[edit]

Compiler for Writing and Implementing Compilers. Developed at SDC late 60 early 70s.

CWIC was a major leap in technological advancements:

  1. added vertuilly unlimited program controled backtracking to the syntax parser language. Only limited by call stack allocation or memory size.
  2. automated symbol table dictionary with assignable / testable symbol attributes.
  3. generated symbols. (usually branch points in generated code)
  4. added grammar rules for defining tokens, automating dictionary entry of token symbols or creation of non-cataloged token objects.
  5. dynamic automatic memory managed objects.
  6. unparse rules used object orianted LISP 2 procedural actions.
  7. numeric object types in the action language. Integer, fixed point and floating point numeric objects were built-in object types. Object types could easily be extended as necessary. For example BCD (Binary Coded Decimal).
  8. named sets of unparse rules called generators allowing large languages to be processed efficiently. Generator functions called from syntax or token rules directed code generation and other compiler operations.
  9. recursive unparse rules that pass list elements to generators and assign returned objects to local variables.
  10. binary code generation into named sections. Generation of data, code and constants could be directed to different sections. Any number of section could be declared.

A tree structure is simply a list whose first element is a node type. ADD[x,y] is the text display form of a two branch tree. The x and y may also be tree structures. In list display form the tree would display as [ADD,x,y]. The list elements would actually be displayed in what ever form their type. For example:

 ADD[ADD[MPY[5,ADD[x,3],y],5]]

5*(x+3)+y+5

The examples I use here are based on CWIC's capabilities. Using the full ascii character set. Some operator character sequences are expressed with different characters allowed by a full ascii set. The alternative operators are | and || instead of / and \. Some changes are those made for the transition to a single compiler that recognizes the different sub-languages be their syntax. I am only going as far as describing the generator language. Instructions generated are more easily expressed and understandable as machine independent pseudo code. This is a liberty I am taking to avoid using IBM 360 instructions that I do not remember and are really not significantly important.

SLIC[edit]

System of Languages for Implementing Compilers.

SLIC seperated binary machine code generation out of the tree crawling generator language and extended the code generation ability to target any computer instruction set architecture. Binary code emitters are writen using assembly like instructions defined by the MACHOP procedures that output binary machine code for the target machine. SLIC generated sequential pseudo instruction appending them to lists. Pseudo instructions are not machine hardware instructions. Pseudo instructions are procedures writen in the same, LISP 2 dialect, action language as generators. They call MACHOPs, machine operations, to produce binary machine code. A MACHOP is defined in the MACHOP language. It's a simple language that defines the instruction operands and their translation into bit fields.

Pseudo-instruction procedures have fixed argument lists. In actual usage they function as a very high level assembly macro. Assembly instructions defined in the MACHOP language are used in the pseudo-instructions to emit, produce, target machine code.

The MACHOP function translates its operands, including opcodes to a binary sequence of bit fields. The bit fields resided in a bit addressable memory space associated with a declared section. Addressing is relative within a section. Address arithmetic and instruction alignment is coded in the machop sub-language. .MORG aligned the output address to any modulo boundary. MACHOPS were also used for assembly pseudo ops. Data MACHOPS were just as easily defined. If statements could be used in machops to select instruction formats. Conditional expressions could be used in defining field values. These two additional sub-language allowed the abstraction of the target machine into a small set of pseudo instructions that could be coded for different target systems. Many pseudo instruction are language independent reducing the time and effort of developing different language compilers for multiple hardware targets. This makes program portability a lot more consistant.

An, ISO, In Sequence Optimization, sub-language was designed to operate on pseudo lists. ISO directives could also be put into the pseudo instruction lists that altered or refined sequential optomization. The ISO phase was optional. SLIC was never fully documented publicly. So can not be used in wiki topics. SLIC is a full implementation of CWIC except for code generation as described.

It is important to understand that the various languages are compiled into executable code. They are not seperate programs. A syntax rule calls a generstor passing it a constructed tree. The generator(s) unparses the tree into sequential pseudo code. A generator at some point will flush a code section. The flushing of a section optionally runs ISOs on its pueudo instruction list. The pseudo instruction list is then executed, calling each pseudo instructions in order and deleting it. The pseudo instruction are coded to produce machine code. The instruction list is then deleted. It is possible to program multi-pass compilers if desired.

Some compilers have a front end a amiddle and a backend. SLIC has the front end grammar language that transforms the source into an intermediate tree structure. Type checking may be performed by the token rules. The middle end sequencer (generator) language transforms the intermediate tree structure into sequential code. Being a LISP 2 dialect it can perform many tree transforms. Pruning unreachable code and many optimizations on the tree before and during generating sequential code. The In Sequence Optimizer can further optimize code. Insert register allocations etc. The back end pseudo expansion can further optimize machine code choosing optimal instructions.

cc[edit]

Compiler compiler.

cc is an updated SLIC using modern syntax and changes to the MACHOP assembly syntax specification rules. Making possable assembly operand description in the syntax of newer assemblers. The descriptions here express the concepts identical to CWIC's as expressed in the cc syntax. For example in CWIC the SYNTAX language included character classes, token and syntax equations. In cc the grammar sub-language is made up of character classs, token and syntax rules. This reflects the curent computer science use of syntax as describing grammar above the word or token level. Syntax changes make the various sub-language recognizable by their construct forms removing the need of language section declaration. The program rule is the start rule like main in c or c++. Comments are in a notation simular to c++. Comments may still be placed after the ; termination of a grammar rule.

Programed parser[edit]

The parser is coded using boolean functions returning success or failure. In the first matacopilers, META I and META II, code production was embedded in the parsing equations. In TREE-META, CWIC and those that followed, a tree representation of the language is created by the grammar sub-language. The tree construction is programed using operators. They all have in common a programable syntax parsing language made up of equations defining language constructs.

CWIC and SLIC are the more advanced language

Character Classes[edit]

A character class, defined by the ":" define operaror following its name, is used to define or name a set of characters. They are not functions in the same sense that token and syntax equations are. They are named character group literals, coded using the same syntax for consistency. Used in a syntax rule a character class works the same as a literal match. Generally they are used to create a table indexed by a characters numeric code. Character classes are basicly a list of characters, each character separated by the "|" alternative operator. Characters are specified by their quoted glyph or numeric code.

bin:         '0'|'1';  // quoted glyph form

A bin can be the character 0 or the character 1. The bin class may be defined using numeric code values for the ascii characters 0 and 1 expressing the identical same set of characters:

bin:         0H30|0H31;  // hex numeric form  ascii codes

Using bin in a syntax or token rule will match any character in the bin character set. The name of a previously defined character class includes all characters in the named class as illistrated by the folllowing commonly used character classes:

bin: '0'|'1';

oct: bin|'2'|'3'|'4'|'5'|'6'|'7';

dgt: oct|'8'|'9';

hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'
        |'a'|'b'|'c'|'d'|'e'|'f';

upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'
    |'H'|'I'|'J'|'K'|'L'|'M'|'N'
    |'O'|'P'|'Q'|'R'|'S'|'T'|'U'
    |'V'|'W'|'X'|'Y'|'Z'

lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'
    |'h'|'i'|'j'|'k'|'l'|'m'|'n'
    |'o'|'p'|'q'|'r'|'s'|'t'|'u'
    |'v'|'w'|'x'|'y'|'z';

let: upr|lwr;

alphanum: let | dgt;

symchr: alphanum | '_';

skip_class: 0h08 | 0h09 | 0h0A | 0h0B | 0h0C|  0h0D | ' '; 

Above many common character classes are defined. These will be used in the following token rule example. Character class equations generally do not generate executable code. In implementation they are not called but generate in line code. They are used in token rules to match characters of a token. Characters thus matched are appended to the token string. Used in syntax rules they will match a character belonging to the class, like a literal match the character is not normally kept. The action modifiers + - and ? may be applied to a character class in a syntax rule. The built-in .any character match operator will match any character code only failing if there is no character .i.e at the end of input. The .any operator is used and performs like a character class though its implementation is quite different in only checking file status and only failing at the end of input.

It may be worth noting that implementing the above character classes, using a class map bit table, only requires 8 bits. A bit is only required for a class that has characters. One made up wholly of other classes, alphanum and let, requires no flag bit. Each class is a bit mask:

bin 0b00000001
oct 0b00000011
dgt 0b00000111
hex 0b00001111
upr 0b00010000
lwr 0b00100000
alpha 0b00110000
alphanum 0b00110111
symchr 0b01110111
skip_class 0b10000000

Token Equations[edit]

Many compilers have a seperate lexical pass. That is not the case here. Tokens are parsed on the fly. Token equations create atomic objects or words, strings and numbers of the target language. Token equations may only use static matching. That is: They may not call any function, except as the last operation before terminating a conversion function may be called. Function implemented language features excepted. An object is created and pushed on the parse stack from the characters matched. The defult is to create a symbol object and catalog it in the curent symbol table. Conversion functions may instead be called to create numeric or string objects.

Token equations reference character class rules that generate in line tests. Character class tests are not considered a function call.

Some token equations examples, part of the metasyntax language

integer ..  ("0b"|"0B")           bin $bin MAKEBIN[]  // binary integer
           |("0o"|"0O")           oct $oct MAKEOCT[]  // octal integer
           |("0x"|"0X"|"0h"|"0H") hex $hex MAKEHEX[]  // hex integer
           |                      dgt $dgt MAKEINT[]; // decimal integer


id .. let $symchr;

string .. ('"' $(-'"' .any|"""""","""") '"'
          |''' .any ''') makestr();

char .. ''' .any ''' makestr[];

Token operators are described in table below

With lixical processing done on the fly. The skipclass character class defines white space and non-displaying characters that are skiped when lookilng for a token. Skipclass skipping is automatic until a character is matched by the token equation. No white space is allowed between any of the successive character matches unless specifically expressed in the token equation. Normally a token equation when successful would create a symbol object. A dictionary token would then result. Symbols are automatically cataloged in the dictionary and the dictionary object placed on the parse stack. Every token matched by a token equation will automatically be pushed on the parse stack. Functions that intercede the normal symbol processing can be used to create other object types.

Below is the metacompilers integer token equation. The leading radix code strings preceeding the numeric string determine valid characters and the interceding conversion function creates an integer object. Alternatives parse the various radix strings. A binary integer starts with 0b or 0B followed by one or more bin characters. 0b100101 is a valid sequence defined by the first alternative. The character class bin was defined in the character class section above.

.
integer .. ("0b"|"0B") bin $bin           MAKEBIN[]  // binary integer
          |("0o"|"0O") oct $oct           MAKEOCT[]  // octal integer
          |("0x"|"0X"|"0h"|"0H") hex $hex MAKEHEX[]  // hex integer
          |dgt $dgt                       MAKEINT[]; // decimal integer

The IA86 asembly code is shown below.

The integer equation above has alternatives separated by the '|' characters. ("0x"|"0X"|"0h"|"0H") defines the following character strings identifying it as a hexadecimal integer. Ox or 0X or 0h or 0H preceed a hex, base 16, number. The use of MAKEBIN[], MAKEOCT[], MAKEHEX[], and MAKEINT[] intercede default dictionary entry, converting the sequence of characters matched into an integer object, and pushing the object onto the parse stack. Conversion function may only be used at the end, as the last entry, in an outermost alternatives of a token equation. The integer equation illustrates how the programmer controls parsing and placement of conversion functions. A decimal number having no radix prefix is the last alternative. Alternatives are tried in order resulting in the first matched satisfying the rule. The programmer is responsible for avoiding ambiguities. Strings, for example "0b", matched in token equations are not kept as part of a the token. A +<string> may used when it is desired that a string be kept as part of a token. A string may be inserted into a token using the insert ,<string> operator.

Interceding functions are generator functions that return success or failure. If it were required to have a radix specifier at the end of an integer it would be a simple matter to parse all possible numeric characters and have the conversion function check for valid radix characters.

Backup is builtin to token rules. The input state is saved on entry and restored on failure. Or discarded on success.

With a programmable parser we are not confined to token equation rules. A higher level grammar rule could be used:

binary .. bin $bin ("b"|"B")         MAKEBIN()  // binary integer
octal  .. oct $oct ("o"|"O")         MAKEOCT()  // octal integer
hexnum .. hex $hex ("x"|"X"|"h"|"H") MAKEHEX()  // hex  integer
decmal .. dgt $dgt                   MAKEINT(); // decimal integer
integer = binary | octal | hexnum | decmal;

In the above we have used a syntax equation with token alternatives having the radix character appearing on the end of the numberic string. Token rules automatically backup to their start point. A simple reseting of the input stream to a non-stacked saved point. A token equation does not long fail. Although the first fail does apply in that no partial success backup is allowed. The formula fails.

id .. let $(alphanum |+'_' alphanum);

If in the alternative +'_' alphanum the '_' test is successful and alphanum fails the formula will reset the input stream to its initial state on entry and return failure. An id may not end with an underscore character.

Token operators[edit]

There are two string literal forms. The single character string bounded by single quote marks '<character>' and the multi-charater string bounded by double quote marks "<string>". Either may be used as a string. Where a single character is specificly specified the single character string is used. Dot word operators may be in upper or lower case. .any or .ANY are equilivant.

OPERATOR OPERAND FUNCTION
.any none Matches any character advancing input stream. Appends matched character to token. Fails if at end of input.
.LNBRK

.LB

none Matches the line break character sequance skiping any leading skip_class character. Used to match start or end of a line. Used as first test in an token equation it will anker the token to the first character of a line. Used otherwise it will match skip_class character to a line break.
quoted

lexeme test

"<string>" '<char>' The literal character string is matched aginst the input stream. If successful the input is advanced over matched characters. Quoted characters '?' and strings "xyz" are literal character strings.
+ "<string>"

'<char>'

String, or char is matched aginst input stream. If successful it is appended to the token and input advanced over matched characters.
- "<string>"

'<char>' <char class>

Fails if string or character class matches the input stream character. Input is not advanced in any case.
~ '<char>' Matches any character not its argument. Any other character is appended to the token.
? "<string>"

'<char>'

Succeeds if string matches input stream. Fails otherwise. Input is not advanced in any case.
, "<string>"

'<char>'

Comma appends string to token. ,'"' appends a " character to the token.
$ ... Zero or more loop operator. Loop following basic construct until failure. See basic rule in Token grammar.
-- none Always successful. Equivalent to .empty. The expression
(a|--)

describe a to be optional.

.EMPTY none Always successful. The last alternative of optional constructs.

Note. These operatores may operate differently in syntax rules.

Syntax formulae[edit]

Syntax formula perform multipal functions:

  1. Syntax formula specify the valid syntax of a language.
  2. Syntax formula transform source code into an abstract parse tree.
  3. Syntax formula control when generators are called to generate code.
  4. Syntax formula detect source code errors.
  5. Syntax formula control error recovery skiping source code in error.

All of the above are programed in the syntax language.

Syntax rules are identified by an = or == following the rules name.

x = a | b | c;

The above rule has three alternatives a or b or c. The alternatives are possible language constructs that are tried in order until one is successful. However there is a catch. No token of a construct may be successful or its alternative will not be tried using the non-backtracking alternative |. If a partial match occures and fails at some later point a long fail occurs. A Long fail is caught by a backtracking \ alternative operator or a backtrack == rule. Long failure bypasses all intervening stacked rules to the most recient backtrack alternative or rule. A partial token match is a simple failure that does not constitute a partial match. The same is true of a literal string match.

y = f \ g;

The above using the backtrack \ alternative operator will always try g. The == rule is a backtrack catching rule. The above allows f to long fail. But if g long fails it would fail to the previous backtrack point. If however we write y == f || g; and g fails it will be caught by the == and y will return failure. Understanding backtracking is important to parsing efficiency and error recovery. The loop $ operator has been used in earlier examples. It also has a backtracking varsion. The % is a zero or more loop operator that catches a long fail of an iteration. Unlike the $ operator were only the start match successfully terminates the loop. Within a backtracking zero or more loop any failure backtracks to the start of the failing iteration and the loop is successfully terminated. Backtracking should be used sparingly as there is overhead cost in saving the parser state.

Code production can not be undone. Long failure over code production is a compiler error. There is an initial non-state-restoring long fail on the stack before the driving program rule that catches long fail programing errors. Backtrack operators create a long fail block on the call stack. The long fail blocks contain saved states that are restored on a failure. Symbol table operations. Attribute assignments. The input stream must be buffered to the oldest backtrack. Backtracking should be used only when neccessary. Over short language constructs. Statements for example. Error recovery for an error within a statement is a good use of backtracking. Backtracking would then occur when an error occurs in the statement. Backtracking overhead is minimal when failure does not cause backtracking symbol table entries.

Backtracking is an important feature of the advanced metacompilers.

An example of a expr syntax rule was used previously. It was simply the grammar part: expr = term $(('+'|'-') term); The abstract parse tree has been described. So it's time to combine the to parts of a syntax rule that transforms source code into tree structures. expr = term $(('+':ADD|'-':SUB) term!2); Now when term is recognized we know it will be transformed and on the parse stack. When and + or - is recognized and ADD or SUB node will be created and pushed on the node stack and a term will be parsed. If successful the !2 will create a 3 element list with node and terms matched by the rule. a + 5 --> ADD[a,5] The expression a+5 is transformed to the tree ADD[a,5].

The expr rule above creats left handed tree. That is when a sequence of terms is recognized the previous tree will be the left branch of the new tree. a + b - c --> SUB[ADD[a,b],c] This is a programming language and the tree construction can be programed to produce a right handed tree. expr = term (('+':ADD|'-':SUB) expr!2); Using a right recursive rule builds a right handed tree. a + b - c --> ADD[a,SUB[b,c]] Another alternative is to transform expr to a list expr = +[term $(('+' term|'-' term:NEG!1)]+:SUM!1; The make list operators +[ ]+ creates a list of parse stack entries created between them. a + b - c --> SUM[a, b , NEG[c]] Generator action code may be placed in a syntax rule using brace { } grouping allowing examining parse stack entries. Conditional expresions evaluate to success or failure giving additional analytical power to the parser. Usually used in testing symbol attributes and seting node attributes. They add the power to recognize context sensitive languages. The classic context sensitive example is easily handled. con_equal_a_b_c = +[$a]+ +[$b]+ +[$c]+ {length(*1)==length(*2) && length(*2)==length(*3)}; The rule will recognize any number of a followed by any number of b followed any number of c. Each is collected into a list and pushed on the parse stack. The { action } compares the list lengths and fails if they are not equal. Even more complicated would to match a pattern requiring an equal number of each but apearing in any mixed sequance. i.e. abcaabcbc 3 of each. This ability wasn't ment for language recognition. It had uses in general text analysis. Tabulation of word apearances in a document might be one use. In a token rule it might test if a symbil is defined.

Syntax rules can of course call other rules. expr = term $(('+':ADD|'-':SUB) term!2); term = factor $(('*':MPY|'/':DIV|'\':REM) factor!2); factor = id | number | '(' expr ')'; The above recognizes an arithmetic expression having the normal algebraic hiarchy and transforms it into an abstract parse tree.

if_statm = "if" condition "then" statement:IF!2 ("else" statement:ELSE!2 | .EMPTY); if_expr = "if" condition "then" expression :IF!2 "else" expression:ELSE!2;

The driving formula[edit]

Formulae are functions that perform tests on the input. A formula is it's self a test that returns success or failure. The top formula or program rule drives the compilation. This sounds like a difficult task. But is usually the simplest formula. We start at the top and write a formula defining what a program is in the language we intend to compile.

program = $declaration; // A program is a sequence of declaration.

Comments are delimited using //. Text including the // to the end of line are ignored.

The above driving formula is simply declaring a program to be a sequence of declaration. The $ operator is specifying that zero or more declaration are to be parsed. If we needed to require at least one declaration the formula would be written:

program = declaration $declaration; // One or more declaration.

A more complicated driving formula for the COBOL language might look like:

program = IDENTIFICATION_Division
ENVIRONMENT_Division
DATA_Division
PROCEDURE_Division;

The formula above are descriptions of an entire program. They directly generate code. The first program formula above in assembly is straight forward. A formula is a function that returns a condition of success or failure. The $, zero or more, operator is a programed loop that applies to the operation following it. The code is generated within a loop construct. DO <something> WHILE <successful>. So that means we need a label for the start of the loop, generate code for the <something> and generate code to jmp back to the labeled start of the loop on success. On failure of the something the $ loop termenates. So in the program rule we have code to return success.

C++ procedures are used in the examples here. Success or failure is returned in the processor status zero or equel flag. The actual code is IA86 assembly language encapsulated in a "__declspec(naked) void" function. C++ is used as an MOL(Machine Oriented Language). Some support functions can be written in C++ while others are entirely encapsulated _asm constructs. As the parsing code exampale is shown below:
//   program = $declaration;
__declspec(naked) void program(){  _asm {
l1:     call	 declaration
	je	  l1
	cmp	 al,al
	ret
}}

Writing and implementing a metacompiler this way is simplified by using an existing development platform. The metacompiler can eventually produce linkable files using the same C++ support routines developed at this stage.

The implementation of the meta program formula in this case is vary simple. The __declspec(naked) creates a c++ function having no preamble or exit code. The _asm { ... } is declaring the contained code to be assembly language. Above the entire function is assembly code. There is no additional code being generated by the C++ compiler.

When program is called it is entered at the first assembly code line: were the declaration rule is called:

l1:     call    declaration

The call is an assembly instruction that calls a function. The l1: is a label used in addressing that line of code. A grammar rule returns success or failure in the processor status register zero flag. The __declspec(naked) allows programming outside the C++ context. We are not using C calling conventions writing in assembly. The void program() declaration is only providing a label for our rule.

         je      l1

Tests the processor flag and if a equal condition is true it jumps to the l1 tag calling declaration repeatedly as long as declaration is successful. The $declaration is zero or more declarations. If declaration fails the status flag will indicate a NE condition and the je l1 will instead proceed to the next instruction

        cmp al,al

that will set the equal condition as the $ zero or more operator has succeeded. The rule returns an eq condition to it's caller. This is assuming declarations will have specific backtracking error recovery. The alternate rule, program = declaration $declaration; requiring at least one declaration is coded with comments as follows:

__declspec(naked) void program(){  _asm { // program is declared to be a naKed function so there is only the assembly code.
        call    declaration     // call the declaration rule
        je    l1                // Test success
        ret                     // return failure
l1:	call	declaration     // calls naked void function declaration 
	je	l1              // declaration will be called repeatedly
                                // until ne condition indicates failure
	cmp	al,al           // comparing al to itself sets the
                                // EQ condition in the status register.
	ret                     // returns success
}}                              // rules have no declared local variables.

If we were coding the grammer of our metacompiler here. Writing the compiler, translating the language, coding it in assembly would be the start of the bootstrap process. The next step would be defining declaration. That is however not what this is about. We are explaining the parser programming. We have explained how the grammar rules are to be used as a top down parser. The COBOL driving rule is an example of an LR rule defining the order of the COBOL divisions.

IA86 ASM integer rule[edit]

The integer token rule hand translated to ia86 assembly code. In a c++ naked function. Commented for illistration. The CHARCLASS define is from an include file.

#define CHARCLASS  byte ptr [eax+classmap]

The curent character is commonly in the al register with higher bits of eax zerroed. The base of the character class table is classmap . Indexed by a char the contents containe bit flags of class membership. Note. In a token the first thing is a call to _TokenEntry. Of intrest is the operation of _TokenEntry and return explained below.

__declspec(naked) void integer() {
   // ..  ("0b"|"0B")           bin $bin  MAKBIN[] // binary number
   //    |("0o"|"0O")           oct $oct  MAKOCT[]  // octal number
   //    |("0x"|"0X"|"0h"|"0H") hex $hex  MAKHEX[]  // hex number]
   //    |                      dgt $dgt  MAKINT[]; // decmal number

char _str_0B[] = "0B";
char _str_0b[] = "0b";
char _str_0O[] = "0O";
char _str_0o[] = "0o";
char _str_0H[] = "0H";
char _str_0h[] = "0h";
char _str_0X[] = "0X";
char _str_0x[] = "0x";

 _asm {
	call	_TokenEntry  // Init token buffer, save stream state.


//   ("0b"|"0B")           bin $bin MAKEBIN()

	push	offset _str_0B		// "0b"
	call	_CmpStr
	je	l0
	push	offset _str_0b		// "0B" 
	call	_CmpStr
	jne	l2

l0:	test	CHARCLASS,bin		// ('0' | '1')
	jne	l1
	cmp	esp,0   // return failure (NE). 
	ret
l1:	call	_Toknadv
	test	CHARCLASS,bin		// ('0' | '1')
	jne	l1
	call	MAKEBIN
	ret


//  |("0o"|"0O")           oct $oct  MAKOCT[]

l2:	push	offset _str_0O             // "0O"
	call	_CmpStr
	je	l3
	push	offset _str_0o             // "0o"
	call	_CmpStr
	jne	l5

l3:	test	CHARCLASS,oct
	jne	l4
	cmp	esp,0
	ret
l4:	call	_Toknadv
	test	CHARCLASS,oct
	jne	l4
	jmp	MAKEOCT


//  |("0x"|"0X"|"0h"|"0H") hex $hex  MAKHEX[]

l5:	push	offset _str_0H             // "0H"
	call	_CmpStr
	je	l6
	push	offset _str_0h             // "0h"
	call	_CmpStr
	je	l6
	push	offset _str_0X             // "0X"
	call	_CmpStr
	je	l6
	push	offset _str_0x             // "0x"
	call	_CmpStr
	jne	l8

l6:	test	CHARCLASS,hex
	jne	l7
	cmp	esp,0
	ret
l7:	call	_Toknadv
	test	CHARCLASS,hex
	jne	l7
	jmp	MAKEHEX


//  |                      dgt $dgt  MAKINT[]; // decmal number

l8:	test	CHARCLASS,dgt             // Looking for a digit.
	jne	l9                        // If it is a dgt goto l9
	test	CHARCLASS,skipclass       // Not a dgt test if skipclass 
	je	l0                        // jump is not skipclass
	call	_Advance                  // advance over skip class
	je	l8                        // loop looking for first dgt
l0:	cmp	esp,0                     // failure return NE status
	ret

l9:	call	_Toknadv                // digit matched copy to token
	test	CHARCLASS,dgt           // is it a dgt?
	jne	l9                      // loop until not a dgt
	jmp	MAKEINT                 // amd then make a numeric object.

}}

Of note is that _TokenEntry does not return to its caller, integer in this case, but calls back its caller. The interger rule does not return to its caller, but to the _TokenEntry call back return where if failure the source stream is reset and failure returned to the token rules caller. On a successful return to the callback return the intercede flag is tested. If intercede is not set the token is cataloged in the curent top symbol table and the dictionary object pushed on the *stack. An interceding function will have created an object and pushed it's address on the *stack. And a successful return is made to the integer formula's caller.

Rules do not have a stack frame or local variables. Only the return address is on the stack. They simply return to the top entry on the stack. It's simple for _TokenEntry to pop it's return address into a register and call indirect that address leaving a return to its self on the stack.

Full backtracking in syntax rules is implemented in a similar fashion. Though a backtrack stack frame is created on the stack.

_CmpStr[edit]

String compare test function in IA86 assembly code. The __savePntrs function saves the input state and returns the skip_class skiping state via a callback That is the return address is called. _CmpStr returns to __savePntrs that depending the ZF processed flag may reset the input state on failure. See __savePntrs.

#include "common.h"
#include "CCclass.h"
#include "CCdata.h"

extern void __savePntrs();
extern void __rstrPntrs();
extern void _Advance();
extern BYTE TokenFlg;

/**************************************************************************
*
*  Match a "<string>"
* 
*  ecx -> "<string>"
*
*/
void __declspec(naked) _CmpStr(){_asm{ //_CmpStr = "..."
	call	__savePntrs            // returns ZF and ecx = char* "..."
	jne	l4	               // ZF = skipping skipclass state

// look for first string charactor that may be a skipclass.
// skipping skipclass that do not match

l1:	cmp	al,byte ptr [ecx]	// source char match string char?
	je	l3			// if found first match -> l3
	test	CHARCLASS,skipclass	// source stream a skipclass char?
	je	l5			// not skipclass. go null test l5
	call	_Advance.               // advance over skip_class char
	je	l1			// advanced? loop skip testing
	jmp	l5			// _Advance failed. null==<eof>
//	   ******* matched first character rest of string must now match
l3:	cmp	al,0	 		// if (end of string)
	je	l7			//   then ir's a match
	inc	ecx			// inc string pointer
	call	_Advance		// advance input stream next char
	jne	l5			// _Advance failed? -> l5
l4:	cmp	al,byte ptr [ecx]	// source char to string char
	je	l3			// jump if mach
l5:	cmp	byte ptr[ecx],0		// if (null, end ofstring)
	je	l7			//   then ir's a match
l6:	cmp	esp,0			//  else return ne failure  
	jmp	__rstrPntrs		// token backtrack return failure

l7:	test	TokenFlg,Token_Going_On	// In a token rule?
	je	l8			// If YES --
	or	TokenFlg,TokenWhiteSkip	// Stop skipclass skipping
l8:	pop	ecx			//  restore user ecx
	cmp	al,al			// return ==
	ret				// return match eq condign
}}

_Toknadv[edit]

/****************************** _Toknadv *******************************\
*									*
*		       THE SIMPLEST FUNCTION HERE			*
*									*
*	Puts a matched character on the token in the token buffer	*
*	then advances the input stream jumping ti _Advance		*
*									*
\************************************************************************/

__declspec(naked) void _Toknadv() {_asm {
	xchg  ecx,dword ptr tokenptr       // save ecx get tokenptr
	mov   byte ptr [ecx],al            // put character in TokenBuf
	inc   ecx                          // Advance tokenptr
	mov   byte ptr [ecx],0             // clean display when debuging
	xchg  ecx,dword ptr tokenptr       // restore ecx save tokenptr
	or    TokenFlg,TokenWhiteSkip      // disable skip_class skiping
	jmp   _Advance // Advance input srean, At end of input return null

//	_Advance is parse stream input function
}}

__savePntrs[edit]

/******************* Set up string matching function *******************\
*									*
*  Matching function are called from token and grammar rules to match   *
*  strings in the input stream.                                         *
*									*
*      !! This is assembly code in a C++ wrapper !!!!			  *
*									*
*  __savePntrs:								*
*  									*
*  __savePntrs is called on entry by string match function to save the 	*
*  input stream state required to backup the test on a failed match.    * 
*  Matching functions are not re-entrant. States need not be stacked as *
*  objects are not put on the parse or node stack by them. Backup is.   *
*  only  done when they do not succeed.  Only the input stream state.   *
*  is saved statically.					                *
*  									*
*  On success a match function simply returns success. On failure it 	*
*  must restore the input stream state. To restore the input stream 	*
*  it simply jumps to __rstrPntrs. 					*
*  									*
*	jmp	__rstrPntrs						*
*									*
*  __rstrPntrs:								*
*  									*
*  __rstrPntrs is always jumped to on failure by a matching		*
*  function and only restores the input stream state			*
*									*
*  Tokening skipclass skiping is also affected by matching functions	*
*  When called from a token rule a match will set the partial match	*
*  flag that prevents further skipclass skipping. The flag may only be	*
*  set when a token rule is active.					*
*									*
*  The input stream is a series of files. An FCB (file control block)	* 
*  is used to track files. FCBs are linked in order as given on the 	*
*  command line. An FCB basically manages file buffers. File buffers    *
*  are linked in file order. In put file buffers are read from files as *
*  the parse rules progress through the input. The FCB status keeps the *
*  file state. Opened and end of file reached. Error flags generate a   *
*  compile error. The stream position is a set of pointers. The data    *
*  pointer points at the current input position in a file buffer. The   *
*  BCB pointer points at the current file buffer. The data count holds  *
*  the number bytes processed. It is needed so as not to re-output      *
*  characters to the print stream. 			              	*
*									*
*  The token and input stream saved state are as follows:		*


 __CC_DBCH*	markbufr;      // buffer header pointer
 char*		markdataptr;   // data pointer
 short		markdatacnt;   // bytes remaining in buffer;
 char*		marktokenptr;

BYTE   token_stats;
int stream_inputcnt;
char* strng_parm;

__declspec(naked) void __savePntrs(){_asm{  // crazy stack manipulation!!

/***********************************************************************\
*   __savePntrs saves input stream state for string match function. 	*
*									*
* A string compare called as follows first pushing the string ptr	*
*									*
*	push	offset <string address> 				      *
*	call	_match_function_					*
*    <return from _matching function_>				      *
*		•							*
*		•							*
*		•							*
*__declspec(naked) void _match_function_() {_asm{			*
*	call	__savePntrs		// First thing save state	*
*    <return from __savePntrs>					      *
*									*
* stack on entry to _savePntrs:						*
* [esp+8]	<ASCIIZ string address> 	 	 		      *
* [esp+4]	<return to calling rule>			 	      *
* [esp+0]	<return from __savePntrs> to _matching function_ 	      *
*									*
*     rearrange stack: Stack on entry as above				*
*									*
*    preserving the calling rule's ecx					*
\***********************************************************************/

	xchg	ecx,[esp+4]     // swap ecx for <return to calling rule>
	xchg	ecx,[esp+8]     // swap return for <ASCIIZ string address>

/***********************************************************************\
* [esp+8]	 <return to calling rule> 			*
* [esp+4]	callers saved ecx					*
* [esp+0]	<return from __savePntrs> to _matching function_ 	*
*   ecx		 <ASCIIZ string address>  			*
*									*
* Stack now as above with ecx *string parameter from [esp+8] on entry	*
*									*
\***********************************************************************/

	mov	strng_parm,ecx		// string parameter saved
	mov	ecx,InPutFCB		// point ecx at FCB

	mov	eax,FCB_bufr(ecx)	// (+2) save buffer pointer
	mov	markbufr,eax

	mov	ax,FCB_datacnt(ecx)	// save data count
	mov	markdatacnt,ax

	mov	eax,FCB_dataptr(ecx)	//  
	mov	markdataptr,eax	// 

	mov	eax,__inputcnt
	mov	stream_inputcnt,eax

//	mov	al,FCB_status(ecx)	//  
//	mov	markstatus,al	// 

	mov	al,TokenFlg		// save current flags state
	mov	token_stats,al		// so can be restored on fail

//  Got TokenFlg skipclass state. Matching can not skip skipclass once any 
//  characters have been matched. That state is passed back to the 
//   calling function in the zero status flag by the following.

	test	al,TokenWhiteSkip	// z flag = state of WhitSkip

	mov	eax,tokenptr		// Save Token pointer
	mov	marktokenptr,eax	// after TokenWhiteSkip

	mov	eax,FCB_dataptr(ecx)		// 
	movzx	eax,byte ptr [eax]	// get current BYTE

	mov	ecx,strng_parm		// point ecx at string parameter
	ret

/***********************************************************************\
*									*
*    On return to caller:						*
*									*
* [esp+4]      original <return from _matching function_>		*
* [esp+0]      callers saved ecx  must be popped be for return		*
*   ecx        points at string to match <string address>		*
*   NOTE.  __rstrPntrs  will restore callers ecx. Do not pop ecx	*
*									*
*   ecx only need be popped on success before a return.			*
*	pop	ecx		// needs restoring			*
*	ret			// before a successful return		*
*									*
* NO POP ECX before:							*
*	jmp	__rstrPntrs	// ecx will be restored by __rstrPntrs 	*
*									*
*    Input stream state saved.						*
\***********************************************************************/
}}


// **************************************************************
// *********  NEVER EVER CALL THIS FUNCTION EVER NEVER  *********
//
__declspec(naked) void __rstrPntrs(){_asm{ // Backtrack input stream.

//   !*!*!*!*!*!*!*!*!  COMMON FAIL EXIT CODE  !*!*!*!*!*!*!*!*!
//   !*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!
//   !*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!
//   !*!*!*!*!*!*!*!*!*!*!*!*!  NOTE!  !*!*!*!*!*!*!*!*!*!*!*!*!
//   !*!*!*!*!*!*!*!*!*!*!*!*!  NOTE!  !*!*!*!*!*!*!*!*!*!*!*!*!
//   !*!*!*!  __rstrPntrs is always jumped to with saved !*!*!*!
//   !*!*!*!  ecx left on the stack to restore on exit.  !*!*!*!
//   !*!*!*!       THIS FUNCTION IS ALWAYS JMPed to      !*!*!*!
//   !*!*!*!      NEVER EVER !! CALLED !!  NEVER EVER    !*!*!*!
//   !*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!
//   !*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!*!

// Note the compare ZF processor status is unchanged here. 
// failure or success is unchanged restoring (backtracking) state
// this is used by ?"..." peek test.

	mov	eax,stream_inputcnt
	mov	__inputcnt,eax

	mov	eax,marktokenptr		// Restore Token pointer
	mov	tokenptr,eax

	mov   	al,token_stats
	mov   	TokenFlg,al                // Restore includes token flags

	mov	ecx,InPutFCB		// point ecx at FCB

	mov	eax,markbufr
	mov	FCB_bufr(ecx),eax	// (+2) save buffer pointer

	mov	ax,markdatacnt
	mov	FCB_datacnt(ecx),ax	// save data count

	mov	eax,markdataptr	// 
	mov	FCB_dataptr(ecx),eax	//  

	movzx	eax,byte ptr [ecx]
	pop	ecx
	ret
}}

Unparse Rule[edit]

An unparse rule consist of a pattern and an action. The patern is tested against the passed in tree. When matched the unparse is performed assigning local variables the pattern parts they occupie. A set of unparse rules is organized into a named generator group. A generator in a way is liken to an overloaded function, each unparse rule a function. Unlike a simple function having a argument list, unparse rules contain local variables within a pattern. A local variable in an unparse expression is recursively assigned the element of it's position. A generator names a set of umparse rules. Each rule consisting of a pattern and an action performed when the pattern is matched.

Token formulae[edit]

Token formulae are used to recognize tokens.

gener_id   .. alpha $symchr ?'['  {?TYPE:(*1):=GENR};

class_id   .. alpha $symchr {TYPE:(*1)==CLASS};

str_const  .. ''' .ANY ''' | '"' $(~'"' | """""" ,'"') '"'  MAKSTR();

chr_const  .. ''' .any ''' MAKESTR();

Character classes must be defined before they are used. Type checking distinguishes character class names. Token rules allow actions to be performed after a token is recognized. As the last thing in a token rule or an outer most alternative. A generator call may prevent the normal dictionary processing. Actions can test and/or set attributes values. The tests of token actions are performed before symbol creation is finalized. The test assignment

 {?TYPE:(*1):=GENR}

needs a bit of explanation. TYPE is a non-reassignable attribute. Within a token rule *1 refers to the token buffer string. For that test to succeed the token must not be in the symbol table or in the symtable with the assigned attribute value. The action will create a symbol with the assigned TYPE attribute value. Or validate the symbol exists with the assigned value. The simpler test

 {TYPE:(*1)==CLASS} 

succeeds only when the symbols TYPE attribute is equal to CLASS. If the symbol is not in the dictionary it does not have a TYPE attribute, the test fails.

Character classes must be defined before they are used. Generators may be forward referanced. A token rule may only referance character classes or a generator by name. The type checking inforces they are used corectly. And helps distinguish a generator call at the end of a rule.


token       = chr_pattern (gen_call:GEN!2| action:ACTION!2|--) comment
                   ('|' token:ALT!2|--) comment ';' $-(.break | "/*");

chr_pattern = comment +[basic $ basic)]+;

basic       = ( '$' -'$' basic :seq!1 | '(' tokenxpr ')' | str_const 
              | char_match | insert) comment;

tokenxpr    = chr_pattern ('|' (+"--" | tokenxpr) :ALT!2 | -- );

char_match  = ('+':RETAIN|'-':NEG|'~':NOT) (str_const | chr_const)!1
              | char_test;

char_test   = ".any" ,".ANY" | +".ANY" | chr_const | class_id;

insert      = ',' (chr_const | str_const):INSERT!1;

gen_call   == gener_id '[' +[ (m_arg $(',' m_arg) |--) ]+ ']' :CALL!2;

m_arg       = str_const | integer | chr_const | id | '*' integer:STAR!1;

action      = '{'  +[ attribute_expr $attribute_exp ]+ '}';

Success or Failure[edit]

The terms success and failure refer to the result status of a test. A test may be considered a boolean type having a value of success or failure. But in most cases the success and failure status being the state of a bit in the processor's status register. Most commonly the zero status flag bit. A machine branch on condition instruction is used to test the returned status (success or failure) of a test.

     call   test_function  ; A test function's result is returned in the
     je     success_loc    ; processor status register zero flag bit.
     jne    failure_loc    ; Jump on condition, e or ne tests the result.

Assuming test_function returns to the instruction after the call instruction. The je instruction can be used to transfer control to success_loc or a jne to failure_loc.

Note. The branch instructions illiterate the condition testing, not actual code. Normally only one or the other would be generated. On a failure return the parse input state is unchanged. A backtrack long-failure resets the parse state (including the call stack) bypassing any number of stacked test function returns.

The expr formula:

expr = term $(('+':ADD|'-':SUB) term !2);

compiled to IA-86 code would generate the following assembly code:

%s1:    ASCIIZ   "+"
%S2:    ASCIIZ   "-"
_ADD_N: ASCIIZ   "ADD"
_SUB_N: ASCIIZ   "SUB"
// Comments as below show string producing following instructions.
/*
     expr = term $(('+':ADD|'-':SUB) term !2);
*/
expr:
/*
     term $(('+':ADD|'-':SUB) term !2);
*/
        call term      // call term test
        jne  %g5       // branch on failure to %g5
/*
   $(('+':ADD|'-':SUB) term !2);
*/
%g1:
/*
   '+':ADD|'-':SUB) term !2);
*/
        Lea  si,%S1    // "+"
        call _cmp_str  // string compare function
        jne  %g2       // branch on failure to %g2
/*
   :ADD|'-':SUB) term !2);
*/
        lea  si,_ADD_N // point si at ADD node string
        jmp  %g3       // unconditional jmp to %g3
/*
   '-':SUB) term !2);
*/
%g2:    Lea  si,%S2    // "-"
        call _cmp_str  // call compare string function 
        jne  %g4.      // jump on failure to %g4
/*
   :SUB) term !2);
*/
        lea  si,_SUB_N // point si at SUB node string.
%g3:    CALL  _node.   // create and push node object
/*
   term !2);
*/
        call term      // call term test function
        jne  %g6       // long failure if term fails.
/*
   !2);
*/
        mov  ecx,2     // set number tree branchez
        call _tree_    // create tree having ecx branches
/*
   );
*/
        jmp  %g1       // continue indefinite  $ loop
%g4:    cmp  al,al     // Set success
%g5:    ret            // return test status
%g6:    jmp  _backtrk_ // long fail, undo partial parse 

Gensym %<alpha><number>, generated symbol notation, used in the hand coded example above is not normally recognized by assemblers. However the intent is ilisteating code generated and they do appear in the embedded assembly compiler listing output. Comments are not part of the assembly output in compiler listings. Here they aid in understanding parser function translation.

This example illiterates the simplicity of syntax formula translation.

In early Shorre metacompilers the test status is a global flag in memory. In tbe above it is the processor's zero status flag.

Junk Terms[edit]

element, component, constituent

production rules .. production (computer science)

Decomposition (computer science)

rewrite rule

syntax

declarative programming

Language-oriented programming

Basic operations[edit]

The basic operations of Boolean calculus are as follows.

  • AND (conjunction), denoted xy (sometimes x AND y or Kxy), satisfies xy = 1 if x = y = 1 and xy = 0 otherwise.
  • OR (disjunction), denoted xy (sometimes x OR y or Axy), satisfies xy = 0 if x = y = 0 and xy = 1 otherwise.
  • NOT (negation), denoted ¬x (sometimes NOT x, Nx or !x), satisfies ¬x = 0 if x = 1 and ¬x = 1 if x = 0.

Alternatively the values of xy, xy, and ¬x can be expressed by tabulating their values with truth tables as follows.

S → aSb

goal-oriented programming[edit]

In these parser programming languages a top-down goal-oriented reductive method of syntax analysis is programed. The main parser program goal is to match a complete program module with the starting formula.

program = $((declaration      // A program is a sequance of declarations
            | .EOF .STOP)     // terminated by End Of File
           \ ERRORX["Error"]  // Otherwise backtrack and report error
                              // flaging the furthest parse point
            $(-';' (.ANY      // Error recovery find a ; If
                              // .ANY only fails on EOF
                   |.STOP))   // So stop compile. ELSE
               .ANY);         // Match ; it and continue.

This main "program" goal is acompplished by looking for sub-goal entities from left to right and down as individual subgoals. The above, using the $ (zero or more loop operator), specifies a program to be a sequence of declarations that terminates successfully on end of file ".EOF". were .STOP breaks out of the $ zero or more loop.

$((declaration | .EOF .STOP) \ ERRORX["Error"] $(-';' .ANY) ';')

On failing to parse a declaration it checks for .EOF (End Of File). Not at .EOF or a long_fail on partial match of a declaration the \ backtrack alternative is taken, an error is assumed and reported using the ERRORX["Error"] function. After which recovery is attempted scanning for a ';' character using $(-';' .ANY) .ANY which scans for a ; character using a negative -';' look ahead. A look ahead, negative or positive, does not advance the parse. .ANY in the loop and outside the loop then match the character tested by the -';'.

There are several types of declaration. Each a sub-goal:

declaration =	 "#" dirictive
              | comment
              | global	  	DECLAR[*1]
              | (id ( formula   PARSER[*1]
                    | production GENERATOR[*1]
                    | optimizer ISO[*1]
                    | pseudo_op PRODUCTION[*1] 
                    | emitor_op MACHOP[*1]
                    )
                \ ERRORX["!Grammer error"] garbol);

These subgoals dirictive, comment, global, and id that is followed by nested subgoals: formula, production , optimizer, pseudo_op, and emitor_op defined by their respective formulae in terms of other entities are tested in order. and so the process of recognising a complete program becomes a matter of looking for subgoals right down to the level at which characters are recognised.

id .. let $(+'_' alphanum | alphanum);

Note, that an id token formula is required to start with a let and must end with an alphanum. An id may not start or end with an underscore '_' character. The +'_' makes the underscore included in the id string. Without the + operator underscores would not be significant. i_d would match id, treated is identical symbols. The above token rule does not allow multiple underscore characters to appear together. i.e. i__d is not a valid id.

Linking to NPV[edit]

Using the Wikipedia android app, it shows the lead paragraph of a linked artical when taping the link. You have an option to goto the artical. Following the parser artical link illustrates the problem:

Parsing is the process of analysing a string of symbols ... conforming to the rules of a formal grammar.

Then following the formal grammar link you only see it described as a set of productions rules:

In formal language theory, a grammar (when the context is not given, often called a formal grammar for clarity) is a set of production rules for strings in a formal language. The rules describe how to form strings from the language's alphabet that are valid according to the language's syntax.

Certainty not a nutral point of view. Made more confusing by production rules, strings, formal language, alphabet, and syntax all linking back to computer science topics.

That is the thing I am concerned with. Topics my have valid multiple conflicting descriptions. I believe the lead paragraph should have a general NPV description. I am not arguing that production rules are not the norm in computer science. Reductive and analytical grammars do exist and have been used..

Mortoray[edit]

I think the art of compiler writing has devolved over the years. One of the most advanced compiler-compilers, CWIC (Compiler for Writing and Implementing Compilers), was developed in the late 1960s with a paper published in the 1970 Journal of the ACM. But with the release of yacc in the 70s Chomsky's linguistic theories and the lex - yacc parser model have come to dominate the field. Compared to CWIC, yacc is less then half a compiller-compiler. As are most, maybe all, parser generators. CWIC is a full compiler-compiler system that outputs binary byte code. I was given CWIC SYNTAX and GENERATOR language manuals, by Erwin Book at an ACM SegPlan meeting in 1969. Irwin was the project leader. Dewie Val Schorre was also on the CWIC development team. I began implementing SLIC (System of Languages for Implementing Compiler) shortly after receiving the manuals. The SYNTAX and GENERATOR languages are almost identical. CWIC was implemented on an IBM-360 and was.designed to produced 8 bit byte code. I added languages that seperate target machine code generation out of the tree crawling generator language. I do not have hands on experience with CWIC as it was an in-house system at SDC. I communicated with Erwin Book several times while developing SLIC and he was always helpful. A CWIC paper is available in the ACM archives. It is an overview of the system with an example interpreter written in CWIC. It is interesting that there are now documents describing yacc as using an analytical grammar when the original document explains how it used production rules to generate an LALR parser.

CWIC is a compiler writing system having 3 compilers:

SYNTAX language compiler based on Dewey Val Schorre's transformational metalanguage work using tree transformation operators simular to TREE-META.

GENERATOR language compiler. A speclized tree-crawling language having named sets of advanced TREE-META like unparse action pairs. Who's actions are programed in a LISP 2 dialect. The terms plant, planting, planted etc refer to code generation. Code is planted into a named section using angle bracketed arithmetic expressions: < add+R1*16+R2 > These are 8, 16, or 33 bit values targeting IBM-360 computer systems. The instructions are planted into a block of memory associated with a section.


MOL360. Machine oriented language compiler that was separately developed at SDC. Supplied with CWIC and used in writing the underlying support package.

SLIC is a single compiler having five sub-languages:

SYNTAX. The SYNTAX sub-language is basicly the same as in CWIC. The differences would probably make little difference compiling a CWIC SYNTAX language program.


GENERATOR sub-language. Changed the planting of numeric values to instead plant PSEUDO, macro like, instructions. PSEUDOs are appended to a list associated with a named section. Target machine specifics moved the PSEUDO and MACHOP languages making the SYNTAX and GENERATOR languages mostly target machine independent.


ISO. In Sequence Optimizer language. The PSEUDO instruction lists could optionally be processed by optimizer trsnsforms. Optimizers were planed to work something like search and replace operations. The ISO was never implemented. Optimazations would have been section specific.


PSEUDO. PSEUDOs are compiled functions that serve the function of assembly macros. They are also coded in a block structured LISP 2 dialect much like the generator actions. Even sharing much of their syntax formula only now outputting actual code calling MACHOP functions to emit binary machine code. An abstract machine may be defined specific to a computer architecture and/or language. In appearance MACHOP calls look like assembly instructions.


MACHOP language. A MACHOP is a function that transforms its parameters into binary output sequences. One could view them as defining an assembler. And could be used to implement an assembler. A MACHOP may define a family of opcodes using a vectored entry. Machine operations have field arguments. A field might just be a register. A memory accessing field could have an index, offset and/or be indirect. Fields were seperated by a comma. #MOP %r,@i offset(%index) %r designates r to be a register argument. @i set i true if @ is present otherwise i is false. offset is an address or offset value. (%index) specifies index to be a register. MACHOPs are programed to output a sequance of variable length fields making up the instruction.

LISP 2 is a major component of the implementation. There is a library of support functions. A specialized dynamic memory management systems making allocation of common block lengths very fast. A dictionary is part of the system. The dictionary is a symbol table stack supporting nested variable scoping. Compiler created gensym, generated symbol, objects may be created in a generator or pseudo function. Gensym objects are created by using %<alpha><number> i.e. %g1 %L1 %x12 The created symbol reference name is local to the function creating it. However the gensym object can be passed around. Gensyms like dictionary symbols may have attributes. CWIC also has gensyms though there is no listing output of the generated code. My goal for SLIC was to provide a system that simplified debugging and could divide coding labor up into specialized areas.

Sorry for the preceeding long descriptions. I thought it important to explain the overall operations of the compiler-compiler in some detail to understand the parser language.

The parser language creates the APT (Abstract Parse Tree) using a dynamic list structures. Calls generator functions that transform the lists into sequential macro like PSEUDO instructions. The PSEUDO instructions are optimized and called to produce machine code using MACHOPs.

The thing of interest here is the SYNTAX or Parser Languages that use parsing formulas. CWIC used the term syntax equation. Probably inherited from Schorre's earlier work.

A top-down, recursive decent, method of analysis is used. A top driving formula defines the overall organization of a program:

program = $declaration;

In this simple version. $ is the sequence, zero or more, operator. The program formula states that a program is simply a sequance of declaration. The declaration formula is called repeatedly as long as declaration returns success. The above would compiled to IA86 assembly would look something like:

 __declspec(naked) void program() {
 _asm {
 
 %g1:
       call declaration
       je.  %g1
       ret
 }}


 #define CHARCLASS  byte ptr [eax+classmap]
 
 __declspec(naked) void integer() {
   // ..  ("0b"|"0B")           bin $bin  MAKBIN[] // binary number
   //    |("0o"|"0O")           oct $oct  MAKOCT[]  // octal number
   //    |("0x"|"0X"|"0h"|"0H") hex $hex  MAKHEX[]  // hex number
   //    |                      dgt $dgt  MAKINT[]; // decmal number
 
 char _str_0B[] = "0B";
 char _str_0b[] = "0b";
 char _str_0O[] = "0O";
 char _str_0o[] = "0o";
 char _str_0H[] = "0H";
 char _str_0h[] = "0h";
 char _str_0X[] = "0X";
 char _str_0x[] = "0x";
 
  _asm {
    	call	_TokenEntry  // Init token  buffer, save stream state.
     
 //   ("0b"|"0B")           bin $bin MAKEBIN()
    
    	push	offset _str_0B		// "0b"
    	call	_CmpStr
    	je	l0
    	push	offset _str_0b		// "0B" 
    	call	_CmpStr
    	jne	l2
 
 l0:     test	CHARCLASS,bin		// ('0' | '1')
    	jne	l1
    	cmp	esp,0   // return failure (NE). 
         ret
 l1:	call	_Toknadv
    	test	CHARCLASS,bin		// ('0' | '1')
    	jne	l1
         call	MAKEBIN
         ret
     
     
 //  |("0o"|"0O")           oct $oct  MAKOCT[]
      
 l2:	push	offset _str_0O             // "0O"
  	  call	 _CmpStr
  	  je	l3
  	  push	offset _str_0o             // "0o"
          call _CmpStr
          jne	l5
       
 l3:	test	CHARCLASS,oct
	    jne	l4
 	   cmp	esp,0
 	   ret
 l4:	call	_Toknadv
 	   test	CHARCLASS,oct
    jne	l4
    	jmp	MAKEOCT

 //  |("0x"|"0X"|"0h"|"0H") hex $hex  MAKHEX[]
      
 l5:	push	offset _str_0H             // "0H"
    	call	_CmpStr
    	je	l6
    push	offset _str_0h             // "0h"
    call	_CmpStr
    je	l6
    push	offset _str_0X             // "0X"
     call	_CmpStr
    	je	l6
    	push	offset _str_0x             // "0x"
    	call	_CmpStr
    	jne	l8
 
 l6:	test	CHARCLASS,hex
    	jne	l7
    	cmp	esp,0
    	ret
 l7:	call	_Toknadv
    	test	CHARCLASS,hex
    	jne	l7
    	jmp	MAKEHEX
 
  
 //  |                      dgt $dgt  MAKINT[]; // decmal  number
 
 l8:	test	CHARCLASS,dgt             // Looking for a digit.
    	jne	l9                        // If it is a dgt goto l9
    	test	CHARCLASS,skipclass       // Not a dgt test if   skipclass 
    	je	l0                        // jump is not skipclass
    	call	_Advance                  // advance over skip class
    	je	l8                        // loop looking for first dgt
 l0:	cmp	esp,0                     // failure return NE status
    	ret
 
 l9:	call	_Toknadv                // digit matched copy to token
    	test	CHARCLASS,dgt           // is it a dgt?
    	jne	l9                      // loop until not a dgt
    	jmp	MAKEINT                 // and then make a numeric object.
 
 }}

====================================================[edit]

Of note is that _TokenEntry does not return to its caller, integer in this case, but calls back its


That formula does not acount for the unexpected or proper termination. An error could cause declaration to fail. The program would simply terminate i.e. abend. The following program formula defines a program to be a sequance of declaration however the last declaration is expected to be followed by an end of file .EOF were .STOP terminates the compile. If not at end of file an error is reported and recovery attempted scanning for a ';' or .EOF.

program = $((declaration // A program is a sequance of declarations | .EOF .STOP) // terminated by End Of File \ ERRORX["Error"] // Backtrack and report error // flaging furthest parse point. $-(';'|.EOF .STOP)// Error recovery find a ; ';'); // match the ; and continue declaration loop

I should explain the c like commenting is an updated version I have been working on. CWIC and SLIC treated text following a properly semicolon terminated formula to the end of line as a comment.

Schorre's parser languages have been described as analytical, reductive, constituent, phrase structure grammars. I prefer the terms parser programming language and formula. A formula conveys information, a way of doing a thing. The term test, used in talking about the parser formula, may be considered a boolean type of success or failure. Test may also be an action resulting in success or failure. A parser formula is a test function. There is no separate lexer. Instead token formula and string tests are used. The term lexeme was not used. In CWIC and SLIC a token is an object created by a token formula. If we were to use the lexeme term it would be defined as any sequence of characters recognized as a unit. A quoted string in a syntax formula would be a lexeme test. However no token object is created. The parser language has three formula types:

Character class:

A Character class formula defines a group of characters that are recognized by an assigned name:

bin: '0'|'1'; oct: bin|'2'|'3'|'4'|'5'|'6'|'7'; dgt: oct|'8'|'9'; hex: dgt|'A'|'B'|'C'|'D'|'E'|'F';

upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'| 'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';

lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'| 'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';

let: upr|lwr;

alphanum: let|dgt;

A character class using the : define class operator is a named set of characters written as a sequance of alternatives that may only be a char constant or previously defined class name. In SLIC a characters numeric ordinal or it's quoted glyph may be used.

Character class formula, though not considered a function, are simular to c inline functions. A class table is created and class tests are performed inline. The class table entry indexed by a character's ordinal contains that character's class membership mask. A non-zero result when ANDed with a class_mask indicates class membership. A class bit is only assigned to a class having a character literal. The alpha and alphanum classes above containing only other classes (no character literals) would require no additional class bit be assigned. A skip_class is pre-defined to contain non-printing and white space characters. SLIC alliws defining a skip_class that overrides the default. The built in operator .ANY is like a character class of all possible characters, testing and failing only at end of input. It may be thought of as a character class. But in operation only checks the input stream for remaining characters. (No class mask for .ANY)

Character class tests used in token formula function differently then when used in syntax formula. Specifically when used in token formula the character matched is appended to the token. In syntax formula they are not normally kept.

Token formula:

Token formula are test functions that recognize a sequance of character and create a token object. Token formula skip leading skip_class characters not matched by the test. By default a token object is designated a symbol and cataloged into the dictionary.

The dictionary is a symbol table stack that provides nested scoping of symbols. Symbol tables are objects that may be assigned to a variable or a symbol's attribute. Token formula may only reference character class formula not calling other functions except as their last action an interceed function may be called to create a typed object from the recognized sequance interceeding the cataloging operation. Some common tokeens:

integer .. dgt $dgt MAKEINT[];

string .. ( .ANY | '"' $(-'"' .ANY | """""",'"') '"') MAKESTR[];

char .. .ANY MAKESTR[];

id .. let $(alphanum|'_');

MAKEINT and MAKESTR interceed the default cataloging of a token in the dictionary. MAKEINT converts the recognized string into a an integer, numeric token object. MAKESTR creates a string token object. The $ operator is a zero or more loop operator. Much like the Kleene star is used today. Note the negated matching in the string rule. (-'"' .ANY | """""",'"') The sequance -'"' .ANY using a negative look ahead tests the input stream for a " character and proceeds to match .ANY character if true that the character is not a " character. If however it is a quote character the alternative """""",'"' is attempted using its own rule that uses two " characters together to represent a single " character with in the string. It tests for two "" characters using its own double equal one """""" and the , insert operator that inserts a single " character into the token buffer. Note that two forms of strings are defined. A single ' mark delimited single character string and a " mark delimited multi-character string. A single char string rule provides a simple method of restricting character class to using single character alternatives. Also note string tests, that are part of the string formula are not kept as part of the string. Those recognized by .ANY are appended to the string. The , insert operator is used to append it's argument to the string.

I will finish up on token formulas with a more inclusive interger formula:

integer .. ("0b"|"0B") bin $bin MAKEBIN[] // binary integer |("0o"|"0O") oct $oct MAKEOCT[] // octal integer |("0x"|"0X"|"0h"|"0H") hex $hex MAKEHEX[] // hex integer | dgt $dgt MAKEINT[]; // decimal integer.

This formula illiterates matching strings that are not kept.

The leading base string tests do not become part of the token string. Only characters matched by the character class tests are appended to the token bufer. A +<string> test will append a matched <string> to the token buffer. A -<string> test fails when the <string> is matched. If successfully matched the test fails. The parse is not advanced in either case.

Note the MAKEBIN[], MAKEOCT[], MAKEHEX[], and MAKEINT[] interceding conversion functions are called only at the end of the outer most alternatives. The [] indicate them to be a function coded in a different language. In CWIC such functions would have been coded in the MOL360 language. Having no MOL language they were coded in MACRO-10 assembly for SLIC. I was considering using bliss but never got around to it.

The numeric objects created can be used in calculations in the other other languages.

The CWIC manual explained that token equations skipped leading skip_class characters. In SLIC leading skip_class characters are not so simply skiped. A token formula may include leading skip_class tests.


Syntax formula:

Syntax formula are test functions, resembling phrase structure/constituency grammar rules.

expr = term $(('+'|'-')term);

The above can be explained as a constituency phrase stricture grammar rule. I would agree that it is a constituent rule. It names the constituent specifing its formation. Its formation being the pattern of the constituent parts. Its analytical nature is apperent when transform operations are added:

expr = term $(('+':ADD|'-':SUB)term!2);

The parser language transform operators use parse and node stacks to build the abstract parse tree. Above :ADD and :SUB create node objects and push them onto the node stack. The tree making operator !2 combines the top node stack object and top 2 parse stack objects into a 3 element list. The first object poped from the node stack. The last 3rd element poped from the parse stack. The second list element again poped from the parse stack. The !<number> pops the top number of parse stack entries onto the list in reverse order. Actually making their list order match the order in which they were recognized. The !<number> formed tree list object is then pushed on the parse stack.

expr = +[ term $('+' term|'-' term:SUB !1) ]+:SUM!1;

The above will create a sumation list of terms: x+y+z-c SUM[[x,y,z,SUB[c]]]

+[ ... ]+ create a list of terms to be added together. Those to be subtracted will be in a SUB sub branch. +[ ]+ are commonly used to create a list of arguments. The textual tree notation <NODE>[<branch>,<branch> ... ] used above comes from CWIC. CWIC supplied debug functions to output trees. SLIC had an interactive debugger. And DEC's ddt could also be used with SLIC. Note that the SUM list contains one element, the list of items to be sumed. I thought about adding a construct to create an indefinite branch tree but never got around to it. It could easily be done in the generator language using the unlist operators -[ ... ]-. I.e. Making a single list of two lists contained in variables a and b:

z=+[-[a]-,-[b]-]+;

Unlists -[...]- may only be used within +[...]+ make list operators.

A recursive decent parser is programed in the parser languages. Generator functions may be passed the top parse stack entry using *1 as an argument. These formula were described as reductive grammar rules in some early compiler-compiler documents. Reductive described their top down analysis methodology.

expr = term $(('+':ADD|'-':SUB)term!2); term = factor $(('*':MPL|'/':DIV)factor!2); factor = '-' primary:NEG!1 | primary; primary = '(' expr ')'|id|number;

The top parse stack entry is poped and passed to a generator using *1 as an argument.

gen[*1]

Calls gen passing it the poped top parse stack entry. Generator function are also test functions returning it's success or failure status. Generators also may return objects on a successful return.

It may be that the SYNTAX formula can not be programed to create the specific tree one desires. It may be that you wish to produce a sumation list sorted in precession or size order. The tree crawling generator language can easily perform such tasks.

SLIC compiles to machine code that is far more efficient then than parser generators using tables. Syntax formula need no stack frame, only having a return address on the stack.

The generator language needs a bit explanation. A generator can be described as a named set of unparse => action; pairs.

generator = id +[unparspar $unparspar]+ :GEN!2; unparspar = '(' +[unpars $(',' unpars)|.EMPTY]+ ')' "=>" action :UPA!2 unpars = decon|(gencall \ id); decon = nodi '[' +[unpars $(',' unpars)]+ ']' :DCON!2 nodi = '#' id:vector!1|id

A simple generator example: eval(ADD[eval(x),eval(y)]) => x+y; (SUB[eval(x),eval(y)]) => x-y; (MPY[eval(x),eval(y)]) => x*y; (DIV[eval(x),eval(y)]) => x/y; (number(x)) => x;

Here we have a generator with several unparse => action pairs. Four tree unparse expressions. Their branches containing recursive calls to eval. Within unparse rules a call to a function is passed the item in whose position the call resides. In the above node two branch unparse expressions. The branches contain calls to eval. eval is called passing it the branch at which it resides. The variable appearing as an argument is instead a local variable in which a returned object is stored. The above tree crawler eval computes numeric expressions in tree form. It can be compressed using vectors

eval(#op[eval(x),eval(y)]) => #act; #op = ADD, SUB, MUL, DIV; #act = x+y, x-y, x*y. x/y;

(number(x)) => x;

In SLIC the initiating command line was the first input to be parser. Supplied functions taking files in tree form are called to setup source, list and output files.

These parser languages are not simplistic parser generators. They have powerful look head constructs. - is a negative look ahead taking a general test expression. ? is positive look ahead operator.

The backtracking alternative is unique. Backtracking creates a backtrack stack frame.

Normally in a sequance of tests like:

a b c | e f

If a returns failure the alternative e f is tried. if b c or f return failure a long failure results. A long failure sets the call stack pointer to the top backtrack stack frame and returns failure.

(a b c \ e f)

The backtrack alternative \ creates a backtrack stack frame be for its left alternative, (a b c) Now if a b or c fail the alternative e f is atempted. If a long failure occurs in any function called by a b or c the parse will be reset to the backtrack saved state and the alternative atempted. Backtrack stack frames are back linked into a stack. As implemented the parse state is held in the backtrack stack frame. On success the backtrack stack_frame state is propagated to the previous backtrack stack frame and removed. A backtrack stack frame is initially created on startup. Returning failure to it is a compiler error and reported as such.

O-XIUIX-O[edit]

CWIC was a combination of Dewey Val Schorre's transformational metalanguage META II and the LISP 2 language developed at SDC. MOL360, a machine oriented language, was separately developed at SDC and included with CWIC package. SCHORRE's transformational metalanguage use parsing formula that were said to be reductive grammar rules. The starting formula starts at the top defining a program breaking it into parts and through iteration defining those parts into simpler and simpler parts down to the point that individual characters are recognized. A reductive grammar is a specific type of analytical grammar in which a top down (recusive decent) parser is programed.

In CWIC and SLIC the parser drives the compile, building an abstract parse tree representation of some part of the input source program. A generator is called from the parser and passed the tree. The term plant or planting is a term used in CWIC"s documentation for outputting code. Angle brackets inclose numerical expressions that are planted into memory blocks associated with a declared sections. A .FLUSH <section name> statement writes out the planted code to a file. CWIC planted 8, 16, and 32 bit numeric values.

GENERAL-PURPOSE TABLE-DRIVEN COMPILER[edit]

I had read the paper "A GENERAL-PURPOSE TABLE-DRIVEN COMPILER" (Stephen Warshall and Robert M. Shapiro) that was published in Saul Rosen's book "Programming Systems and Languages". I basically followed their five step transformation strategy.

1. Syntax analysis: Source -> tree representations

2. Generator: Tree -> n-address macros

3. ISO: In Sequence Optimizer.

4. Code selection. Macro expansion -> assembly

5. Assembly -> object file

SLIC use a 5 phase code generation strategy simular to that described by Warshall and Shapiro. Note though the parser drives the compilation is in CWIC.

Instead of n-address macros, SLIC plants PSEUDO instructions appending them to a list associated with a section. PSEUDO instructions perform the function of assembly macros only PSEUDOs are compiled functions whose body is written in a LISP 2 dialect. PSEUDO functiond taking simple input arguments calls MACHOPs to output machine code. Flushing a section calls each PSEUDO in the section list and deletes it. A MACHOP defines an assembler instruction or family of instructions. Compiled into a function a MACHOP transformes its input parameters in to a bit sequance that aligned correctly the target machine will recognize as intended.

CWIC has a vector facility that associates ordered entries with each other. gen(#V1[gen(x),gen(y)]) => #V2; #V1 = ADD, SUB, MUL. DIV; #V2 = x+y, x-y, x*y, x/y; The above generator function recognizes a tree whose node matches #V1 and performs the coresponding operation of #V2. The parameter to a generator may be an unparse rule. Within an unparse rule function calls like gen(x) and gen(y) are passed the object in which they reside. #V1[gen(x),gen(y)] matches a two branch tree whoes node is one of #V1 listed possibilities. gen(x) is called on the left branch and returns x. likewise for gen(y). The action => #V2 then performs the corosponding operation on x and y. x and y are local to the action. Vectored entry is used in defining a family of target machine instructions.

LISP 2 being a list processing language provides automatic dynamic memory managament. Memory managament that may be optimized for commonly used block sizes. The generator language can process trees doing various functions on them before generating code. An inturperter may be programed in CWIC or SLIC. These languages may be used for many string processing functions. And within SDC they were being used for many text analysis application

Parsing vs Lexing: There is no discussion of parsing vs lexing in CWICs documentation. A point of interest is that the term token is only used in referance to an object defined and created by a token equation. I think the use of the term equation was inherited from Schorre's META II. Schorre was in the CWIC development team. I prefer the term formula. A formula conveys information, a way of doing a thing. The parser language uses three formula types: character class, token, and syntax (all having a consistent features) in specifying tokens and higher language constructs.

Backtracking vs Backup[edit]

Back track: A programed feature using a \ backtrack alternative, OR, operator. A backtrack stack frame is created ahead of the left alternative. A long failure will reset the call stack to the top backtrack stack frame and return failure. An operation simular to a longjmp in C. On a failure return the saved backtrack state is restored. The backtrack stack frame in effect is cleared and the left alternative atempted. A sequance of backtrack alternatives well leave the backtrack frame on the call stack until the last right most alternative. In effect (A\B\C\D\E) has implied grouping (A\(B\(C\(D\E)))) The right most alternative E is not backtracked. The nested backtrack illistration would have the same parsing effect. However Each nested group alternative's would see its left alternative as the first backtrack creating a backtrack stack frame for each group. Long failure is an integral part of backtracking. In a sequance of tests other then the first test returning failure a long failure results. The call stack is reset to the backtrack return and failure returned skipping any number of stacked function returns. In a sequance of mormal(non-backtracking) alternatives the next alternative is attempted when the parse has not advanced by the left alternative. A long fail occurs when other then the first test fails.

Backup: Backup is built into string and token tests. Backup only effects the input stream.

TERMS[edit]

Lexeme[edit]

A lexeme is any sequance of characters recognized as a unit. Not all lexemes (matched sequances) are kept as tokens. A token being defined as an object conflicts with today's parser lexer use of token and lexeme terms. Although a token is an object that may be passed around. A symbol token may have attributes that can be accesses, set or changed. Terminology use

Test[edit]

A test can be thought of as a boolean type of success or failure. A test can also refer to an action that return success or failure. A parse formula is a test function defined by a test expression. A test expression is like a boolean expression except for strict left to right order of evaluation. A quoted string is a test. A test usually being a parse action. Generators use unparse rules that return success or failure. A generator function may explicitly return success or failure. As implemented in SLIC success/failure status is returned in the processor status register.

Token[edit]

A sequance of characters recognized by a token formula and the object created.

Character Clas[edit]

A character class using the : define class operator is a named set of characters written as a sequance of alternatives that may only be a char constant or previously defined class name. In SLIC a characters numeric ordinal or it's quoted glyphs may be used.

Character classes are not compiled into functions. A class table is created and class tests are performed inline. The class table entry indexed by a character's ordinal contains that character class membership mask. A non-zero result when ANDed with a class_mask indicates class membership. A class bit is only assigned to a class having a character literal. The alpha and alphanum classes above containing only other classes (no character literals) would require no additional class bit be assigned. Of course this could change. For a computer having a single bit test instruction it could be that assigning a bit for every class would make for a faster test and/or smaller instruction. A skip_class is pre-defined to contain non-printing and white space characters. Defining a skip_class overrides the default. The built in operator .ANY is like a character class of all possible characters only failing on end of input. It may be thought of as a character class. But in operation only checks the input stream. (No class mask for .ANY)

inconsistencie[edit]

An inconsistencie is the generator language uses ( ) grouped arguments, [ ] are used in the syntax language to group arguments passed to functions coded in other languages.

Specifing rejection[edit]

There is a bit more to these then simple grammar rules. If for example an id is not allowed to end with '_' id could be written:

id .. let $(alphanum|'_' alphanum);

Now an '_' is required to be followed by an alphanum. Which brings up the first fail rule. Once the parse has matched the first test of a sequance a subsequent failure results in hard or long failure. In a token formula the input state is backed up to the formula's entry state and failure returned. The terms backtrack and backup describe different processes. Backup simply resets the input state. We can say, defining a lexeme as a string or a token that backup applies to lexeme processing. A string being a quote demarked sequance of characters. Strings are not normally kept.

Full interger formula[edit]

I will finish up on token formulas with a more inclusive interger formula:

integer .. ("0b"|"0B") bin $bin MAKEBIN[] // binary integer |("0o"|"0O") oct $oct MAKEOCT[] // octal integer |("0x"|"0X"|"0h"|"0H") hex $hex MAKEHEX[] // hex integer | dgt $dgt MAKEINT[]; // decimal integer.

SYNTAX[edit]

The parser is programed using parsing formula, test functions, resembling phrase structure/constituency grammar rules.

expr = term $(('+'|'-')term);

The above can be explained as a constituency grammar rule. I would agree that it is a constituent rule. It names the constituent specifing its formation. Its formation being the pattern of it's constituent parts. Its analytical nature is apperent when transform operations are added:

expr = term $(('+':ADD|'-':SUB)term!2);

Now the analytical nature of the expr syntax formula is apparent. The parser language is also a stack language having two manipulatable stacks. The parse and node stacks are used to build the abstract parse tree. Above :ADD and :SUB create node objects ADD or SUB and push them onto the node stack. The tree making operator !2 combines the top node stack object and 2 parse stack objects into a 3 element list. The first object from the node stack. The last (3rd object) poped from the parse stack. The second list element again poped from the parse stack. The !<number> pops the top number of parse stack entries onto the list in reverse order. The tree transformation is programed in the parser language.

expr = +[ term $('+' term|'-' term:SUB !1) ]+:SUM!1;

The above will create a sumation list of terms: x+y+z-c SUM[[x,y,z,SUB[c]]]

We are programming a recursive decent parser. These formula were described as reductive grammar rules in some early compiler-compiler documents.

LOOK AHEAD[edit]

However these parser programming languages have extensive look ahead and backtracking features.

 ?<parse_expr> is a positize look head test. -<parse_expr> is a negative expression. ,<quoted string> in a syntax formula creates a string object and pushes it on the parse stack. In a token formula appends the string to the token string. $ is a zero or more operator. {generator expression} inserts allows generator expressions with in a syntax formuls. Used primarily to test and/or set symbol attributes. An extended feature in the succesor to SLIC.

GENERATOR function[edit]

A generator is in a way like an overloaded function naming a set of unparse action pairs. generator = id $('(' unparse $(',' unparse) ')' "=>" action ';';

A tree matching unparse pattern tests its node and branches to match the passed argument. gen(ADD[x, y]) => .. (SUB[x, y]) => .. • • ▪ Variables used in an unparse expression are assigned the item at which they resides. Above in the tree tests x gets the left branch and y the right. A generator function of course may have multiple, unparse expression, arguments. A function call within an unparse expression is passed the parameter at which it resides and returns objects to local variables listed as if arguments. gen(ADD[gen(x),gen(y)])=> <action> (SUB[gen(x),gen(y)])=> <action> • • • Above gen(x) in the unparse expressions is passed the left branch of the tree and returns an object in x. x is a local variable to the <action>. Likewise for gen(y). The unparse may simply be a variable. Actions are programed in a LISP 2 dialect, a powerful list processing languages that can perform tree-crawlling optimization before producing code. A textual notation: <NODE NAME>[<tree branch>{,<tree branch> ...}] of tree structures is used. A tree is implemented as a list whoes first element is a <NODE NAME> object. Node objects are created and pushed on the node stack in the syntax language using :<node name>. The parse stack is used to construct abstract parse trees. !<number> creates a tree having <number> of branches whose <node name> is poped from the node stack. Parse stack entries may be poped and passed to a generator in the syntax language. The sequance gen[*1] used in a syntax formula calls gen. <name>[..] is used in syntax formul to call functions in other languages. Using *1 as an argument pops the top parse stack entry passing it as an argument.

PSEUDO -> MACHOP[edit]

SLIC is a single compiler that has 5 sub-languages. No MOL language. SLIC's SYNTAX and GENERATOR languages are basically the same as CWICs. The difference is in the code production strategy. In CWIC the process of producing code is called planting. To divide code, data, and so forth, planting is to named sections. CWIC plants numerical values into memory blocks associated with a section. Implemented on an IBM 360 CWIC produced 8 bit byte code targeting IBM 360 processors. It is here that SLIC departs, using a more generalized symbolic approach. SLIC plants pseudo instructions appending them to a list associated with a section. PSEUDO instructions are procedures also coded in a LISP 2 dialect quite simular to the generator language's actions. Planted PSEUDO instructions are basically a structure containing the PSEUDO procedure entry point and arguments. PSEUDO instructions function as assembly like macros. Code is written out using a .FLUSH <section name> statement issued in a generator function. In SLIC a .FLUSH <section name> statement calls the PSEUDO instructions in order and deletes them. Optionally ISOs (In Sequance Optimizers) could, on flushing, be run on the pseudo list before their execution. A PSEUDO instruction may plant PSEUDO instructions appending them to a section list or call a PSEUDO directly. To output machine code PSEUDO instructions call MACHOP procedures. MACHOPs define target machine instruction's.

MACHOP Language[edit]

A MACHOPs parameter for example may be a register, a constant or memory address. A memory address may be indexed and/or indirect. A mov instruction declaration having two fields:

.MACHOP mov %r,@i(%index)offset

The first field %r using % indicates that r must be a register. The second field is more general allowing a general memory address that may be indexed and/or indirect (indicated by an @). The variables r, index i, and offset are arguments. Code is output as bit sequances into a bit addressable memory space. Fields are specified with a length and value

(3): i

The content of the variable i is ourput as a 3 bit field. A radix may be specified by preceding any field with a radix code:

B binary O octal D decimal H hex

The PSEUDOs and MACHOPs provide a macro assembler like coding inviroment and assembly formating of their output to be included in the compile listing. Multiple plant fields may be combined in displaying the assembly output. MACHOPs numeric formating for assembly output:

O(9): mov; (4): r; (1): i; (4): index; O(18): offset

defines bit fields containing 4 bit register and index fields. and a 1 bit field containing the value of i. The fields are combined into 18 bit half-words and printed as octal values in the optional assembly output: 400020 202625 OOOO13 MOV r1,@ll(r5)

Undefined references in machops causes SLIC to generates polish fix up blocks that are calculated as components are defined. Calculation may be deferred to the linker for external referances. Undefined forward referances are displayed as a number of *. Unevaluated fixup blocks on leaving their creation scope cause their undefined items to be reported. Gensym symbols may be created and used in generator or pdeudo functions.

SLIC design goals[edit]

The goal in developing SLIC was to be able to generate code for any computer architecture of the period. It is still a viable goal. The MACHOP language needs a bit of work to handle scaled and multiple indexing of today's processors. At the time I was developing SLIC there were many computers having various sized addressable units. 48 data bit +2,3 and 4 flag bit ALGOL machines. 6 bit variable length character addressable machine having one and two field demarcation or punctuation bits. 8 and 16 bit mini computers. The DEC-10 and IBM 7xxx computers were 36 bit word addressable machines. There were many different computer architectures in the late 1960 early 1970s. I actually started development of SLIC in 1968.

Syntax Directed Compiler[edit]

These are syntax directed compilers with only an initial setup procedure being called before the driving syntax formula is called. In SLIC there is additional syntax directed setup. The first input is the initiating command line. The command line is parsed and supplied functions called to open files etc.

In the SYNTAX language you write a recursive decent parser using reductive, analytical, grammar formula. Each formula is a test function that returns success or failure. Test can be thought of as a boolean type. However there is an action (usually parsing) performed with success or failure returned in the processor's status register. Generators also return success or failure status. Failure occurs if no unparse matches the input. Or by a forced failure in an action.

To most people the grammar formula appear no different then declarative syntax rules that a parser generator takes as input. Aand it could be argued that the SYNTAX compiler is a parser generator. However I prefer parser programming language to distinguish their differences. I do not know of a parser generator that accommodates programmed syntax directed compilation.

The difference is in the semantics of the language. The compile is syntax directed calling a top driving <program> formula:

program = $declaration;

The above defines a program to be a sequance of declaration. Program repeatedly calls declaration until it returns failure. A more real version might require some set up before starting the compile.

program = setup $declaration cleanup;

Compilers produced by SLIC processed their own command line, opening files etc. The setup formula would parse the initiating command opening the source, listing, and object files. An interpreter is easly implemented as starting out the input is from stdin. Or the users tty on TOPS-10 that SLIC ran on.

History[edit]

Historicly the first parser programming language was created in 1963 by Dewey Val Schorre and implemented in his META II metalanguage compiler. Schorre was also on the CWIC development team. META II documentation can be found online. TREE-META documentation is also available online. TREE-META builds trees simular to CWIC and SLIC. It does not divide the unparsing amongst several generators. It simple has a set of unparse rule initiated by placing a * in a syntax formula to invoke code generation. Generators in CWIC or SLIC are called by the generator name and passed arguments. The same calling syntax is used to call MOL functions. TREE-META does not include list processing or a dictionary (symbol table). It did not have backtracking in the 1968 version. A 1974 TREE-META is documented to have backtracking. Opening a grouped sequance of alternatives with (<- open parentheses left arrow. CWIC and SLIC use a \ backtrack alternative operator. TREE-META also used a different error reporting recovery method.

The syntax compilers of CWIC and SLIC have functionality providing backtracking alternative operators, token formula and directed abstract syntax tree building operatores. The syntax languageS have character class, token, and syntax formula. Syntax and token formula may also include calls to procedure or function written in other languages.

Ok. On with the topic related material:

At the lowest level character classes are defined. Character classes are primarily used in token formula. Unlike regx expressions you define a name as representing any one instance of of a character belonging to a character class:

Character class defines: bin: '0'|'1'; oct: bin|'2'|'3'|'4'|'5'|'6"|'7'; dec: oct|'8'|'9'; hex: dec|'A'|'B'|'C'|'D'|'E'|'F'|'a'|'b'|'c'|'d'|'e'|'f';

upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z'; lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';

Character classes create a class bit mask and are combined to generate a class table. A class table entry indexed by a character's ordinal anded with the class bit mask deturmines class membership.

Lexemes are recognized by quoted strings and token formula. Token formula create token objects. CWIC did not use the term lexeme. The term token was only used in referring to token formula and their created object. These languages do not have lexing distinct from parsing. Syntax and token formula are compiled into machine code functions. In these parsers a lexeme is not necessarily a token. A token is an atomic object. A symbol, number or string in the source language being compiled.

The token formula:

id .. let $(+'_' alphanum|alphanum);

defines an id to be a let followed by zero or more alphanum or _ that must be followed by an alphanum. It must start with a let and end with analphanum. Underscores may be include within the sequance. An underscore must be followed by an alphanum. By default when no conversion function is called the token formula creates a symbol token and catalogs it into the dictionary. Interceding conversion functions may be called as the last operation in a token function. Interceding functions create typed objects that generally are not cataloged. Supplied conversion function are MAKES c 6t4rTR, MAKEINT, MAKEOCT, MAKRH.mjEX, MAKEBIN, and MAKESTR. Quoted string tests are not normally kept. The +'_' copies the '_' to the token buffer if matched.n.

SLIC integer[edit]

An integer token formula in SLIC:

integer .. ("0b"|"0B") bin $bin MAKEBIN[] // binary integer |("0o"|"0O") oct $oct MAKEOCT[] // octal integer |("0x"|"0X"|"0h"|"0H") hex $hex MAKEHEX[] // hex integer | dgt $dgt MAKEINT[]; // decimal integer

The above integer token formula illistrates recognizing integers in several bases and calling interceding conversion functions that create integer objects. Token objects are pushed onto the parse stack. Note that quoted string tests do not skip leading white_space, skip_class, characters in token formula. Token formula do skip leading skip_class ahead of the token.

Token transform operations[edit]

Token formula commonly use the ',' insert operator illistrated by the string formula.

string .. .any | '"' $(-"""" .any | """""",'"') '"' MAKESTR[];

It recognizes a single quoted single character string or an indefinite length string bounded by double " characters. A " character may the placed in the string using two "" together. In the sequance: """""",'"' """""" recognizes two " characters and ,'"' inserts a single quote into the string.

Tree transform operations[edit]

Syntax formula direct the compilation. They parse the input and direct building of an abstract parse tree.

!<number> creates a tree having <number> branches and pushes it onto the parse stack. A tree is a list whose first element is a node object poped from the node stack. The tree branches are the top <number> of parse stack entries. The branches being poped into the last free list position. The list is pushed onto the parse stack. Parse stack and node stacks are part of the language.

Node objects are created by the sequances :<nade name> and pushed on the node stack. Tokens formula create typed token objects and push them onto the parse stack.

arithmetic expression parser[edit]

An arithmetic expression parser coded top-down starting with the expression defining formula: expr = term $(('+':ADD|'-':SUB)term!2); In writing the above formula term is expected to leave a term on the parse stack. expr is expected to only leave a single item on the parse stack. That item may be a term, ADD, or SUB tree. The $ is a zero or more loop operator simular to a Kleene star. Likewise: term = factor $(('*':MPY|'/':DIV) factor!2); is expected to leave a single item on the parse stack. factor = id | number | '(' expr ')'; id and number are assumed to be token formula.

One can also create a flat list: expr = +[term $('+' term|'-' term:nag!1)]+:SUM!1; though +[ ... ]+ are more commonly used for function argument lists. The generator language being lisp 2 based can manipulate lists. Many optimizations can be performed on the tree in generator language before producing sequential code.

The earliest META II parser language using built token recognizers .ID, .NUMBER, and .STRING, parses primarily the same. Only directly outputing stack machine assembly code and does not have bscktracking. Backtracking was introduced in CWIC only applies to the syntax formula.

Backtracking is specified using a backslash \ backtracking alternative operator. The left alternative is backtracked on failure and the right alternative atempted. Backtracking in SLIC creates a backtrack stack frame. The input state is saved on starting the left test. Basically only yhe input state, buffer pointers, are saved for backup. Token and string tests save the input state that is restored on failure. Backtracking restores the parse state. Stack objects may be released and symbol table changes undone. Token formula by default catalog the recognized string in the dictionary. The dictionary is a symbol table stack. Symbol tables are objects and may be assigned to variables or a symbol's attribut.

Debug features[edit]

SLIC included interactive debugging aids. Debugging provided formula break points and trace points that printed the source line showing parse point. The parse and node stacks could be displayed.

The pseudo instructions and assembly could be including in the compiler output listing.

Hear's a start on a c or c++ compiler. program = $declaration;

cc[edit]

I started on updating SLIC. The néw compiler does away with the sub-language declaration adds c stile commenting. The MACHOPs need updating to handle indixing modes on new procesors.

program = $((declaration // A program is a sequance of declarations | .EOF .STOP) // terminated by End Of File \ ERRORX["Error"] // Backtrack and report error // flaging furthest parse point. $-(';'|.EOF .STOP) ';');// Error recovery // find a ; to continue.

declaration = "#" dirictive | comment | global DECLAR[*1] | (id ( grammar PARSER[*1] // Parser function | sequencer GENERATOR[*1] // Generator function | optimizer ISO[*1] // In Sequence Optimizer | pseudo_op PRODUCTION[*1] // Pseudo instruction | emitor_op MACHOP[*1] // Machine instruction ) // All the above start with an identifier \ ERRORX["Syntax error"] garbol); // Backtrack ERROR grammar = (':' class :CLASS // character class define |".." token :TOKEN // token rule |"==" syntax :BCKTRAK // backtrack grammar rule |'=' syntax :SYNTAX // grammar rule. )';'!2 // Combine node name and tree $(-.NL -"/*" .any); Comment to line seperator?

comment = // C style comments: "//" $(-.NL .ANY) | "/*" $(-"*/" .ANY) "*/";

syntax = seq ('|' alt1|'\' alt2 |--);

alt1 = seq:ALT!2 ('|' alt1|--); alternative sequance.

alt2 = seq:BKTK!2 ('\' alt2|--); backtrack alternative sequance

seq = +[oper $oper]+;

oper = test | action | '(' syntax ')';

test = string | id ('[' (arg_list|-- ,NILL) ']':GENR!2|--)

There a few things illistrating the differences of these languages. Backtracking and non-backtracking not being mixed is enforced by the syntax rules. Backtracking is eliminated by factoring. The only use of backtracking is error reporting and recovery.

COBOL[edit]

In the 1970s COBOL could not be parsed by parser generators. each COBOL division was different. Even not having any specific rules in one as basic all comment. Different token etc.

SLIC was used to write a COBOL cross compiler that was completed in less then 4 months. It compiled faster then the DEC COBOL compiler More lines/min The DEC compiler was written in assembly.

In the mid 70s SLIC was used to write a COBOL cross compiler. It took about 4 man months to write. One programmer full time and me part time. The COBOL compiler ran on a DEC-System-10 producing code for a TI-990 minicomputer. An interesting problem as instruction were word aligned. Call and jump instructions had 15 bit addresses. Data references were 8 bit byte addresable.

GENRAD[edit]

In the 1980s at (Genrad and Kontron)/FutureData I worked in language development. FutureData was a small In Circuit Emulator, ICE, company Purchased by Genrad and later sold to Kontron. Our group developed compilers and assemblers supplied with our In Circuit Development systems. Parser generators were used. But primarily working on the code generation side I had little direct experience using them. Though that real world experience showed up the same problems brought up here. It was always problematic trying to get the tree transformations to match our code generation requirement. It was frustrating. We had code generation for many macro processors but adding a front end parser for new languages was a pain. Especially when it required a different parser.

J1[edit]

There were early compiler-compilers that solved all the problems with parser generators discussed here. And that is a confusing statement. Why? Because of the parser generator yacc calling itself a compiller-compiler. It should have been called hacc (half a__ compiler compiler). It discredits so many real compiler-compilers that came before.

variables[edit]

In CWIC a variable may hold any object type. Object types are defined as part of the language. There is only one general structured object type, a list that each element, like a variable, may be of any object type. To be clear every object is structured at the implementation level. A dictionary is part of the systems. The dictionary is a symbol table stack. Several of the object types are specific to the writing of compilers. A symbol is an object type that may be assigned attribute variables. Numeric objects may be used in arithmetic expressions.

META II edit[edit]

META II was first written in META I.[4] META I was a hand-compiled version of META II. The history is unclear as to whether META I and META II are different languages.

In its documentation, META II is described as a language resembling Backus normal form (BNF). Most syntax directed compilers of the day were described as BNF-like or as using BNF. It grossly misleading today to equate META II with BNF. Though META II and BNF are both metalanguages They are quite different. META II is a programming META "language".


There are differences between META II and BNF; for example, in BNF, an arithmetic expression is defined as:

<expr> := <term> | <expr> <addop> <term>

The BNF rule does not tell us how we should parse an expression, while in META II, the order of testing is specified by the equation. The expr parsing equation written in META II is a conditional expression evaluated left to right:

expr = term ('+' expr / '-' expr / .EMPTY);

Above the expr equation is defined by expression to the right of the '='. Evaluating left to right from the '=', term is the first thing that must be tested. If term fails expr fails. If a term is recognized we then look for a '+' if that fails the alternate '-' is tried and finally if a '-' were not recognized the alternate .EMPTY always succeeds and expr returns success recognizing a single term. If a '+' or '-' were matched then expr would be recursively attempted. In META II an equation is a function returning success or failure. The expr equation can also be expressed using nested grouping as:

expr = term (('+' / '-') expr / .EMPTY);

The code production elements were left out to simplify the example. Due to the limited character set of early computers the character / was used as the alternative, or, operator. The examples have used a recursive rule to more closely match the BNF rule. However the evaluation order is a right most first order. That is 5-2+1 would be evaluated as 5-(2+1) A loop would be used to make the implied grouping: (5-2)+1. The $, loop operator, is used to match zero or more of something:

expr = term $(('+' / '-') term);

The above can be expressed in English: An expr is a term followed by zero or more plus or minus terms. Schorre describes this as being an aid to efficiency, but unlike a naive recursive descent compiler it will also ensure that the associativity of arithmetic operations is correct: in modern terms the parsing expression defines a tail recursive parsing function.

Contents[edit]

This contents table started as a markup language test. It reaches some sub titles so left it for the time being.

Introduction.
Metacompiler terminology
Syntax directed translation
Brief history
META_I MATA_II and TREE-META
CWIC
SLIC
Programed parser
The tree stacks
Character classes
Token rules
Syntax rules
Programmimg with rules
The driving rule
Grammar rules of a matacompiler
Unparse rules

Useful links[edit]

Source code

Source language (translation)

constituent_name

functions

Formula

propositional logic

Compiler#Compiler construction

Production Quality Compiler-Compiler Project(PQCC)

single pass

multi-pass compiler

optimizations

Notes[edit]

  1. ^ CWIC Compiler for Writing and Implementing Compilers the Schorre metalanguage in which the author is most experienced. Developed at SDC, Systems Development Ccorporation, in the late 1960s.
  2. ^ In computer science syntax refers to grammar rules describing language constructs made up of words.
  3. ^ CWIC grammar rules Grammer includes the lexical specification of words. Or what in CWIC are tokens. Tokens includes enties such as quoted strings, identifier symbols, numbers, etc.
  4. ^ List of Schorre metacompilers starts with the hand coded META I used to write META II. META II was used to write a simplified ALGOL like languages. META II was a proof of concept and a base from which to start metacompiler development. TREE-META followrd. Other META II based metacompilers were also developed and used. CWIC was developed at SDC with Schorre on the team. Erwin Book brought LISP II technology that was the basis for CWIC's objects and generator language.
  5. ^ META II "META II is a compiler writing language which consists of syntax equations into which instructions to output assembly language are inserted. The method of writing compilers, given in detail in this paper, may be explained briefly as follows. Each syntax equation is translated into a recursive function which tests the input for a particular phrase structure, and if successful the input is advanced over the phrase. The first compiler writen by hand was for a language called META I. META I was then used to implement the improved compiler META II." "METAII " (PDF). Retrieved 5 January 2015.
  6. ^ Meracompiler "A metacompiler, in the most general sense of the term, is a program that reads a metalanguage program as input and translates that program into a set of instructions. If the input program is a complete specification of a programming language the translation is a compiler for that language. The metacompiler class of compiler-compiler may be characterized by a common top-down parsing algorithm and a common syntax. These compilers are expressible in their own META language, whence the prefix meta. Backtracking is avoided by extensive use of factoring in the syntax equations." "The TREE-META Compiler-Compiler System: A Meta Compiler System for the Univac 1108 and General Electric 645, University of Utah Technical Report RADC-TR-69-83. C. Stephen Carr, David A. Luther, Sherian Erdmann" (PDF). Retrieved 5 January 2015. {{cite web}}: line feed character in |title= at position 166 (help)
  7. ^ Metacompilers take as input a computer program written in a metaprogramming language and translate that program into a set of instructions. The programming metalanguage belongs to a specialized class of metalanguages that are on a higher level of abstraction describing the structure and composition of a metalanguage. This type of metalanguage is called a metasyntax. It is common to call a compiler by the name of the language it compiles. Like the FORTRAN compiler compiles a program written in the FORTRAN language or a COBOL compiler compiles a COBOL (language) program... A metacompiler is used to compile a program written in a metalanguage.

Cite error: A list-defined reference named "war" is not used in the content (see the help page).
Cite error: A list-defined reference named "APT" is not used in the content (see the help page).
Cite error: A list-defined reference named "GRAMMARFUN" is not used in the content (see the help page).
Cite error: A list-defined reference named "METAGRAMMAR" is not used in the content (see the help page).

Cite error: A list-defined reference named "WRONGLY" is not used in the content (see the help page).
  1. ^ a b c d e META II A SYNTAX-ORENTED COMPILER WRITING LANGUAGE (Dewey Val Schorre UCLA Computing Facility 1964)
  2. ^ a b "A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token." -- page 111, "Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman Cite error: The named reference "lexeme" was defined multiple times with different content (see the help page).
  3. ^ "A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token." -- page 111, "Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, as quoted in http://stackoverflow.com/questions/14954721/what-is-the-difference-between-token-and-lexeme
  4. ^ Dewey, Val Schorre (1963). META II: a syntax-oriented compiler writing language (PDF). UCLA: UCLA Computing Facility.