From e1270f81bdc81f5a575b34c2d2c294bdde3e6f4f Mon Sep 17 00:00:00 2001 From: Nathan Binkert Date: Sun, 16 Aug 2009 13:39:58 -0700 Subject: ply: update PLY to version 3.2 --- ext/ply/doc/internal.html | 874 +++++++++++++++++++++++++++++++++++ ext/ply/doc/makedoc.py | 10 +- ext/ply/doc/ply.html | 1115 +++++++++++++++++++++++++++++---------------- 3 files changed, 1612 insertions(+), 387 deletions(-) create mode 100644 ext/ply/doc/internal.html (limited to 'ext/ply/doc') diff --git a/ext/ply/doc/internal.html b/ext/ply/doc/internal.html new file mode 100644 index 000000000..3fabfe28c --- /dev/null +++ b/ext/ply/doc/internal.html @@ -0,0 +1,874 @@ + + +PLY Internals + + + +

PLY Internals

+ + +David M. Beazley
+dave@dabeaz.com
+
+ +

+PLY Version: 3.0 +

+ + +

+ +
+ + + +

1. Introduction

+ + +This document describes classes and functions that make up the internal +operation of PLY. Using this programming interface, it is possible to +manually build an parser using a different interface specification +than what PLY normally uses. For example, you could build a gramar +from information parsed in a completely different input format. Some of +these objects may be useful for building more advanced parsing engines +such as GLR. + +

+It should be stressed that using PLY at this level is not for the +faint of heart. Generally, it's assumed that you know a bit of +the underlying compiler theory and how an LR parser is put together. + +

2. Grammar Class

+ + +The file ply.yacc defines a class Grammar that +is used to hold and manipulate information about a grammar +specification. It encapsulates the same basic information +about a grammar that is put into a YACC file including +the list of tokens, precedence rules, and grammar rules. +Various operations are provided to perform different validations +on the grammar. In addition, there are operations to compute +the first and follow sets that are needed by the various table +generation algorithms. + +

+Grammar(terminals) + +

+Creates a new grammar object. terminals is a list of strings +specifying the terminals for the grammar. An instance g of +Grammar has the following methods: +
+ +

+g.set_precedence(term,assoc,level) +

+Sets the precedence level and associativity for a given terminal term. +assoc is one of 'right', +'left', or 'nonassoc' and level is a positive integer. The higher +the value of level, the higher the precedence. Here is an example of typical +precedence settings: + +
+g.set_precedence('PLUS',  'left',1)
+g.set_precedence('MINUS', 'left',1)
+g.set_precedence('TIMES', 'left',2)
+g.set_precedence('DIVIDE','left',2)
+g.set_precedence('UMINUS','left',3)
+
+ +This method must be called prior to adding any productions to the +grammar with g.add_production(). The precedence of individual grammar +rules is determined by the precedence of the right-most terminal. + +
+

+g.add_production(name,syms,func=None,file='',line=0) +

+Adds a new grammar rule. name is the name of the rule, +syms is a list of symbols making up the right hand +side of the rule, func is the function to call when +reducing the rule. file and line specify +the filename and line number of the rule and are used for +generating error messages. + +

+The list of symbols in syms may include character +literals and %prec specifiers. Here are some +examples: + +

+g.add_production('expr',['expr','PLUS','term'],func,file,line)
+g.add_production('expr',['expr','"+"','term'],func,file,line)
+g.add_production('expr',['MINUS','expr','%prec','UMINUS'],func,file,line)
+
+ +

+If any kind of error is detected, a GrammarError exception +is raised with a message indicating the reason for the failure. +

+ +

+g.set_start(start=None) +

+Sets the starting rule for the grammar. start is a string +specifying the name of the start rule. If start is omitted, +the first grammar rule added with add_production() is taken to be +the starting rule. This method must always be called after all +productions have been added. +
+ +

+g.find_unreachable() +

+Diagnostic function. Returns a list of all unreachable non-terminals +defined in the grammar. This is used to identify inactive parts of +the grammar specification. +
+ +

+g.infinite_cycle() +

+Diagnostic function. Returns a list of all non-terminals in the +grammar that result in an infinite cycle. This condition occurs if +there is no way for a grammar rule to expand to a string containing +only terminal symbols. +
+ +

+g.undefined_symbols() +

+Diagnostic function. Returns a list of tuples (name, prod) +corresponding to undefined symbols in the grammar. name is the +name of the undefined symbol and prod is an instance of +Production which has information about the production rule +where the undefined symbol was used. +
+ +

+g.unused_terminals() +

+Diagnostic function. Returns a list of terminals that were defined, +but never used in the grammar. +
+ +

+g.unused_rules() +

+Diagnostic function. Returns a list of Production instances +corresponding to production rules that were defined in the grammar, +but never used anywhere. This is slightly different +than find_unreachable(). +
+ +

+g.unused_precedence() +

+Diagnostic function. Returns a list of tuples (term, assoc) +corresponding to precedence rules that were set, but never used the +grammar. term is the terminal name and assoc is the +precedence associativity (e.g., 'left', 'right', +or 'nonassoc'. +
+ +

+g.compute_first() +

+Compute all of the first sets for all symbols in the grammar. Returns a dictionary +mapping symbol names to a list of all first symbols. +
+ +

+g.compute_follow() +

+Compute all of the follow sets for all non-terminals in the grammar. +The follow set is the set of all possible symbols that might follow a +given non-terminal. Returns a dictionary mapping non-terminal names +to a list of symbols. +
+ +

+g.build_lritems() +

+Calculates all of the LR items for all productions in the grammar. This +step is required before using the grammar for any kind of table generation. +See the section on LR items below. +
+ +

+The following attributes are set by the above methods and may be useful +in code that works with the grammar. All of these attributes should be +assumed to be read-only. Changing their values directly will likely +break the grammar. + +

+g.Productions +

+A list of all productions added. The first entry is reserved for +a production representing the starting rule. The objects in this list +are instances of the Production class, described shortly. +
+ +

+g.Prodnames +

+A dictionary mapping the names of nonterminals to a list of all +productions of that nonterminal. +
+ +

+g.Terminals +

+A dictionary mapping the names of terminals to a list of the +production numbers where they are used. +
+ +

+g.Nonterminals +

+A dictionary mapping the names of nonterminals to a list of the +production numbers where they are used. +
+ +

+g.First +

+A dictionary representing the first sets for all grammar symbols. This is +computed and returned by the compute_first() method. +
+ +

+g.Follow +

+A dictionary representing the follow sets for all grammar rules. This is +computed and returned by the compute_follow() method. +
+ +

+g.Start +

+Starting symbol for the grammar. Set by the set_start() method. +
+ +For the purposes of debugging, a Grammar object supports the __len__() and +__getitem__() special methods. Accessing g[n] returns the nth production +from the grammar. + + +

3. Productions

+ + +Grammar objects store grammar rules as instances of a Production class. This +class has no public constructor--you should only create productions by calling Grammar.add_production(). +The following attributes are available on a Production instance p. + +

+p.name +

+The name of the production. For a grammar rule such as A : B C D, this is 'A'. +
+ +

+p.prod +

+A tuple of symbols making up the right-hand side of the production. For a grammar rule such as A : B C D, this is ('B','C','D'). +
+ +

+p.number +

+Production number. An integer containing the index of the production in the grammar's Productions list. +
+ +

+p.func +

+The name of the reduction function associated with the production. +This is the function that will execute when reducing the entire +grammar rule during parsing. +
+ +

+p.callable +

+The callable object associated with the name in p.func. This is None +unless the production has been bound using bind(). +
+ +

+p.file +

+Filename associated with the production. Typically this is the file where the production was defined. Used for error messages. +
+ +

+p.lineno +

+Line number associated with the production. Typically this is the line number in p.file where the production was defined. Used for error messages. +
+ +

+p.prec +

+Precedence and associativity associated with the production. This is a tuple (assoc,level) where +assoc is one of 'left','right', or 'nonassoc' and level is +an integer. This value is determined by the precedence of the right-most terminal symbol in the production +or by use of the %prec specifier when adding the production. +
+ +

+p.usyms +

+A list of all unique symbols found in the production. +
+ +

+p.lr_items +

+A list of all LR items for this production. This attribute only has a meaningful value if the +Grammar.build_lritems() method has been called. The items in this list are +instances of LRItem described below. +
+ +

+p.lr_next +

+The head of a linked-list representation of the LR items in p.lr_items. +This attribute only has a meaningful value if the Grammar.build_lritems() +method has been called. Each LRItem instance has a lr_next attribute +to move to the next item. The list is terminated by None. +
+ +

+p.bind(dict) +

+Binds the production function name in p.func to a callable object in +dict. This operation is typically carried out in the last step +prior to running the parsing engine and is needed since parsing tables are typically +read from files which only include the function names, not the functions themselves. +
+ +

+Production objects support +the __len__(), __getitem__(), and __str__() +special methods. +len(p) returns the number of symbols in p.prod +and p[n] is the same as p.prod[n]. + +

4. LRItems

+ + +The construction of parsing tables in an LR-based parser generator is primarily +done over a set of "LR Items". An LR item represents a stage of parsing one +of the grammar rules. To compute the LR items, it is first necessary to +call Grammar.build_lritems(). Once this step, all of the productions +in the grammar will have their LR items attached to them. + +

+Here is an interactive example that shows what LR items look like if you +interactively experiment. In this example, g is a Grammar +object. + +

+
+>>> g.build_lritems()
+>>> p = g[1]
+>>> p
+Production(statement -> ID = expr)
+>>>
+
+
+ +In the above code, p represents the first grammar rule. In +this case, a rule 'statement -> ID = expr'. + +

+Now, let's look at the LR items for p. + +

+
+>>> p.lr_items
+[LRItem(statement -> . ID = expr), 
+ LRItem(statement -> ID . = expr), 
+ LRItem(statement -> ID = . expr), 
+ LRItem(statement -> ID = expr .)]
+>>>
+
+
+ +In each LR item, the dot (.) represents a specific stage of parsing. In each LR item, the dot +is advanced by one symbol. It is only when the dot reaches the very end that a production +is successfully parsed. + +

+An instance lr of LRItem has the following +attributes that hold information related to that specific stage of +parsing. + +

+lr.name +

+The name of the grammar rule. For example, 'statement' in the above example. +
+ +

+lr.prod +

+A tuple of symbols representing the right-hand side of the production, including the +special '.' character. For example, ('ID','.','=','expr'). +
+ +

+lr.number +

+An integer representing the production number in the grammar. +
+ +

+lr.usyms +

+A set of unique symbols in the production. Inherited from the original Production instance. +
+ +

+lr.lr_index +

+An integer representing the position of the dot (.). You should never use lr.prod.index() +to search for it--the result will be wrong if the grammar happens to also use (.) as a character +literal. +
+ +

+lr.lr_after +

+A list of all productions that can legally appear immediately to the right of the +dot (.). This list contains Production instances. This attribute +represents all of the possible branches a parse can take from the current position. +For example, suppose that lr represents a stage immediately before +an expression like this: + +
+>>> lr
+LRItem(statement -> ID = . expr)
+>>>
+
+ +Then, the value of lr.lr_after might look like this, showing all productions that +can legally appear next: + +
+>>> lr.lr_after
+[Production(expr -> expr PLUS expr), 
+ Production(expr -> expr MINUS expr), 
+ Production(expr -> expr TIMES expr), 
+ Production(expr -> expr DIVIDE expr), 
+ Production(expr -> MINUS expr), 
+ Production(expr -> LPAREN expr RPAREN), 
+ Production(expr -> NUMBER), 
+ Production(expr -> ID)]
+>>>
+
+ +
+ +

+lr.lr_before +

+The grammar symbol that appears immediately before the dot (.) or None if +at the beginning of the parse. +
+ +

+lr.lr_next +

+A link to the next LR item, representing the next stage of the parse. None if lr +is the last LR item. +
+ +LRItem instances also support the __len__() and __getitem__() special methods. +len(lr) returns the number of items in lr.prod including the dot (.). lr[n] +returns lr.prod[n]. + +

+It goes without saying that all of the attributes associated with LR +items should be assumed to be read-only. Modifications will very +likely create a small black-hole that will consume you and your code. + +

5. LRTable

+ + +The LRTable class is used to represent LR parsing table data. This +minimally includes the production list, action table, and goto table. + +

+LRTable() +

+Create an empty LRTable object. This object contains only the information needed to +run an LR parser. +
+ +An instance lrtab of LRTable has the following methods: + +

+lrtab.read_table(module) +

+Populates the LR table with information from the module specified in module. +module is either a module object already loaded with import or +the name of a Python module. If it's a string containing a module name, it is +loaded and parsing data is extracted. Returns the signature value that was used +when initially writing the tables. Raises a VersionError exception if +the module was created using an incompatible version of PLY. +
+ +

+lrtab.bind_callables(dict) +

+This binds all of the function names used in productions to callable objects +found in the dictionary dict. During table generation and when reading +LR tables from files, PLY only uses the names of action functions such as 'p_expr', +'p_statement', etc. In order to actually run the parser, these names +have to be bound to callable objects. This method is always called prior to +running a parser. +
+ +After lrtab has been populated, the following attributes are defined. + +

+lrtab.lr_method +

+The LR parsing method used (e.g., 'LALR') +
+ + +

+lrtab.lr_productions +

+The production list. If the parsing tables have been newly +constructed, this will be a list of Production instances. If +the parsing tables have been read from a file, it's a list +of MiniProduction instances. This, together +with lr_action and lr_goto contain all of the +information needed by the LR parsing engine. +
+ +

+lrtab.lr_action +

+The LR action dictionary that implements the underlying state machine. +The keys of this dictionary are the LR states. +
+ +

+lrtab.lr_goto +

+The LR goto table that contains information about grammar rule reductions. +
+ + +

6. LRGeneratedTable

+ + +The LRGeneratedTable class represents constructed LR parsing tables on a +grammar. It is a subclass of LRTable. + +

+LRGeneratedTable(grammar, method='LALR',log=None) +

+Create the LR parsing tables on a grammar. grammar is an instance of Grammar, +method is a string with the parsing method ('SLR' or 'LALR'), and +log is a logger object used to write debugging information. The debugging information +written to log is the same as what appears in the parser.out file created +by yacc. By supplying a custom logger with a different message format, it is possible to get +more information (e.g., the line number in yacc.py used for issuing each line of +output in the log). The result is an instance of LRGeneratedTable. +
+ +

+An instance lr of LRGeneratedTable has the following attributes. + +

+lr.grammar +

+A link to the Grammar object used to construct the parsing tables. +
+ +

+lr.lr_method +

+The LR parsing method used (e.g., 'LALR') +
+ + +

+lr.lr_productions +

+A reference to grammar.Productions. This, together with lr_action and lr_goto +contain all of the information needed by the LR parsing engine. +
+ +

+lr.lr_action +

+The LR action dictionary that implements the underlying state machine. The keys of this dictionary are +the LR states. +
+ +

+lr.lr_goto +

+The LR goto table that contains information about grammar rule reductions. +
+ +

+lr.sr_conflicts +

+A list of tuples (state,token,resolution) identifying all shift/reduce conflicts. state is the LR state +number where the conflict occurred, token is the token causing the conflict, and resolution is +a string describing the resolution taken. resolution is either 'shift' or 'reduce'. +
+ +

+lr.rr_conflicts +

+A list of tuples (state,rule,rejected) identifying all reduce/reduce conflicts. state is the +LR state number where the conflict occurred, rule is the production rule that was selected +and rejected is the production rule that was rejected. Both rule and rejected are +instances of Production. They can be inspected to provide the user with more information. +
+ +

+There are two public methods of LRGeneratedTable. + +

+lr.write_table(modulename,outputdir="",signature="") +

+Writes the LR parsing table information to a Python module. modulename is a string +specifying the name of a module such as "parsetab". outputdir is the name of a +directory where the module should be created. signature is a string representing a +grammar signature that's written into the output file. This can be used to detect when +the data stored in a module file is out-of-sync with the the grammar specification (and that +the tables need to be regenerated). If modulename is a string "parsetab", +this function creates a file called parsetab.py. If the module name represents a +package such as "foo.bar.parsetab", then only the last component, "parsetab" is +used. +
+ + +

7. LRParser

+ + +The LRParser class implements the low-level LR parsing engine. + + +

+LRParser(lrtab, error_func) +

+Create an LRParser. lrtab is an instance of LRTable +containing the LR production and state tables. error_func is the +error function to invoke in the event of a parsing error. +
+ +An instance p of LRParser has the following methods: + +

+p.parse(input=None,lexer=None,debug=0,tracking=0,tokenfunc=None) +

+Run the parser. input is a string, which if supplied is fed into the +lexer using its input() method. lexer is an instance of the +Lexer class to use for tokenizing. If not supplied, the last lexer +created with the lex module is used. debug is a boolean flag +that enables debugging. tracking is a boolean flag that tells the +parser to perform additional line number tracking. tokenfunc is a callable +function that returns the next token. If supplied, the parser will use it to get +all tokens. +
+ +

+p.restart() +

+Resets the parser state for a parse already in progress. +
+ +

8. ParserReflect

+ + +

+The ParserReflect class is used to collect parser specification data +from a Python module or object. This class is what collects all of the +p_rule() functions in a PLY file, performs basic error checking, +and collects all of the needed information to build a grammar. Most of the +high-level PLY interface as used by the yacc() function is actually +implemented by this class. + +

+ParserReflect(pdict, log=None) +

+Creates a ParserReflect instance. pdict is a dictionary +containing parser specification data. This dictionary typically corresponds +to the module or class dictionary of code that implements a PLY parser. +log is a logger instance that will be used to report error +messages. +
+ +An instance p of ParserReflect has the following methods: + +

+p.get_all() +

+Collect and store all required parsing information. +
+ +

+p.validate_all() +

+Validate all of the collected parsing information. This is a seprate step +from p.get_all() as a performance optimization. In order to +increase parser start-up time, a parser can elect to only validate the +parsing data when regenerating the parsing tables. The validation +step tries to collect as much information as possible rather than +raising an exception at the first sign of trouble. The attribute +p.error is set if there are any validation errors. The +value of this attribute is also returned. +
+ +

+p.signature() +

+Compute a signature representing the contents of the collected parsing +data. The signature value should change if anything in the parser +specification has changed in a way that would justify parser table +regeneration. This method can be called after p.get_all(), +but before p.validate_all(). +
+ +The following attributes are set in the process of collecting data: + +

+p.start +

+The grammar start symbol, if any. Taken from pdict['start']. +
+ +

+p.error_func +

+The error handling function or None. Taken from pdict['p_error']. +
+ +

+p.tokens +

+The token list. Taken from pdict['tokens']. +
+ +

+p.prec +

+The precedence specifier. Taken from pdict['precedence']. +
+ +

+p.preclist +

+A parsed version of the precedence specified. A list of tuples of the form +(token,assoc,level) where token is the terminal symbol, +assoc is the associativity (e.g., 'left') and level +is a numeric precedence level. +
+ +

+p.grammar +

+A list of tuples (name, rules) representing the grammar rules. name is the +name of a Python function or method in pdict that starts with "p_". +rules is a list of tuples (filename,line,prodname,syms) representing +the grammar rules found in the documentation string of that function. filename and line contain location +information that can be used for debugging. prodname is the name of the +production. syms is the right-hand side of the production. If you have a +function like this + +
+def p_expr(p):
+    '''expr : expr PLUS expr
+            | expr MINUS expr
+            | expr TIMES expr
+            | expr DIVIDE expr'''
+
+ +then the corresponding entry in p.grammar might look like this: + +
+('p_expr', [ ('calc.py',10,'expr', ['expr','PLUS','expr']),
+             ('calc.py',11,'expr', ['expr','MINUS','expr']),
+             ('calc.py',12,'expr', ['expr','TIMES','expr']),
+             ('calc.py',13,'expr', ['expr','DIVIDE','expr'])
+           ])
+
+
+ +

+p.pfuncs +

+A sorted list of tuples (line, file, name, doc) representing all of +the p_ functions found. line and file give location +information. name is the name of the function. doc is the +documentation string. This list is sorted in ascending order by line number. +
+ +

+p.files +

+A dictionary holding all of the source filenames that were encountered +while collecting parser information. Only the keys of this dictionary have +any meaning. +
+ +

+p.error +

+An attribute that indicates whether or not any critical errors +occurred in validation. If this is set, it means that that some kind +of problem was detected and that no further processing should be +performed. +
+ + +

9. High-level operation

+ + +Using all of the above classes requires some attention to detail. The yacc() +function carries out a very specific sequence of operations to create a grammar. +This same sequence should be emulated if you build an alternative PLY interface. + +
    +
  1. A ParserReflect object is created and raw grammar specification data is +collected. +
  2. A Grammar object is created and populated with information +from the specification data. +
  3. A LRGenerator object is created to run the LALR algorithm over +the Grammar object. +
  4. Productions in the LRGenerator and bound to callables using the bind_callables() +method. +
  5. A LRParser object is created from from the information in the +LRGenerator object. +
+ + + + + + + + + + diff --git a/ext/ply/doc/makedoc.py b/ext/ply/doc/makedoc.py index 3eed9bd74..415a53aa0 100644 --- a/ext/ply/doc/makedoc.py +++ b/ext/ply/doc/makedoc.py @@ -93,7 +93,7 @@ for s in lines: result.append("") result.append("") skipspace = 0 - + m = h2.match(s) if m: prevheadingtext = m.group(2) @@ -115,7 +115,7 @@ for s in lines: subsection = 0 subsubsection = 0 subsubsubsection = 0 - skipspace = 1 + skipspace = 1 continue m = h3.match(s) if m: @@ -134,7 +134,7 @@ for s in lines: index += """
  • %s\n""" % (headingname,prevheadingtext) subsubsection = 0 - skipspace = 1 + skipspace = 1 continue m = h4.match(s) if m: @@ -151,7 +151,7 @@ for s in lines: index += " @@ -72,10 +79,26 @@ dave@dabeaz.com
    +

    1. Preface and Requirements

    +

    +This document provides an overview of lexing and parsing with PLY. +Given the intrinsic complexity of parsing, I would strongly advise +that you read (or at least skim) this entire document before jumping +into a big development project with PLY. +

    -

    1. Introduction

    +

    +PLY-3.0 is compatible with both Python 2 and Python 3. Be aware that +Python 3 support is new and has not been extensively tested (although +all of the examples and unit tests pass under Python 3.0). If you are +using Python 2, you should try to use Python 2.4 or newer. Although PLY +works with versions as far back as Python 2.2, some of its optional features +require more modern library modules. +

    + +

    2. Introduction

    PLY is a pure-Python implementation of the popular compiler @@ -95,7 +118,10 @@ include lexical analysis, parsing, type checking, type inference, nested scoping, and code generation for the SPARC processor. Approximately 30 different compiler implementations were completed in this course. Most of PLY's interface and operation has been influenced by common -usability problems encountered by students. +usability problems encountered by students. Since 2001, PLY has +continued to be improved as feedback has been received from users. +PLY-3.0 represents a major refactoring of the original implementation +with an eye towards future enhancements.

    Since PLY was primarily developed as an instructional tool, you will @@ -120,7 +146,7 @@ Techniques, and Tools", by Aho, Sethi, and Ullman. O'Reilly's "Lex and Yacc" by John Levine may also be handy. In fact, the O'Reilly book can be used as a reference for PLY as the concepts are virtually identical. -

    2. PLY Overview

    +

    3. PLY Overview

    PLY consists of two separate modules; lex.py and @@ -163,7 +189,7 @@ parsing tables is relatively expensive, PLY caches the results and saves them to a file. If no changes are detected in the input source, the tables are read from the cache. Otherwise, they are regenerated. -

    3. Lex

    +

    4. Lex

    lex.py is used to tokenize an input string. For example, suppose @@ -206,7 +232,7 @@ More specifically, the input is broken into pairs of token types and values. Fo The identification of tokens is typically done by writing a series of regular expression rules. The next section shows how this is done using lex.py. -

    3.1 Lex Example

    +

    4.1 Lex Example

    The following example shows how lex.py is used to write a simple tokenizer. @@ -243,11 +269,7 @@ t_RPAREN = r'\)' # A regular expression rule with some action code def t_NUMBER(t): r'\d+' - try: - t.value = int(t.value) - except ValueError: - print "Line %d: Number %s is too large!" % (t.lineno,t.value) - t.value = 0 + t.value = int(t.value) return t # Define a rule so we can track line numbers @@ -264,11 +286,14 @@ def t_error(t): t.lexer.skip(1) # Build the lexer -lex.lex() +lexer = lex.lex() -To use the lexer, you first need to feed it some input text using its input() method. After that, repeated calls to token() produce tokens. The following code shows how this works: +To use the lexer, you first need to feed it some input text using +its input() method. After that, repeated calls +to token() produce tokens. The following code shows how this +works:
    @@ -280,11 +305,11 @@ data = '''
     '''
     
     # Give the lexer some input
    -lex.input(data)
    +lexer.input(data)
     
     # Tokenize
    -while 1:
    -    tok = lex.token()
    +while True:
    +    tok = lexer.token()
         if not tok: break      # No more input
         print tok
     
    @@ -308,7 +333,16 @@ LexToken(NUMBER,2,3,21)
    -The tokens returned by lex.token() are instances +Lexers also support the iteration protocol. So, you can write the above loop as follows: + +
    +
    +for tok in lexer:
    +    print tok
    +
    +
    + +The tokens returned by lexer.token() are instances of LexToken. This object has attributes tok.type, tok.value, tok.lineno, and tok.lexpos. The following code shows an example of @@ -317,8 +351,8 @@ accessing these attributes:
     # Tokenize
    -while 1:
    -    tok = lex.token()
    +while True:
    +    tok = lexer.token()
         if not tok: break      # No more input
         print tok.type, tok.value, tok.line, tok.lexpos
     
    @@ -330,7 +364,7 @@ type and value of the token itself. the location of the token. tok.lexpos is the index of the token relative to the start of the input text. -

    3.2 The tokens list

    +

    4.2 The tokens list

    All lexers must provide a list tokens that defines all of the possible token @@ -355,7 +389,7 @@ tokens = (
    -

    3.3 Specification of tokens

    +

    4.3 Specification of tokens

    Each token is specified by writing a regular expression rule. Each of these rules are @@ -379,11 +413,7 @@ converts the string into a Python integer.
     def t_NUMBER(t):
         r'\d+'
    -    try:
    -         t.value = int(t.value)
    -    except ValueError:
    -         print "Number %s is too large!" % t.value
    -	 t.value = 0
    +    t.value = int(t.value)
         return t
     
    @@ -414,8 +444,8 @@ expressions in order of decreasing length, this problem is solved for rules defi the order can be explicitly controlled since rules appearing first are checked first.

    -To handle reserved words, it is usually easier to just match an identifier and do a special name lookup in a function -like this: +To handle reserved words, you should write a single rule to match an +identifier and do a special name lookup in a function like this:

    @@ -427,6 +457,8 @@ reserved = {
        ...
     }
     
    +tokens = ['LPAREN','RPAREN',...,'ID'] + list(reserved.values())
    +
     def t_ID(t):
         r'[a-zA-Z_][a-zA-Z_0-9]*'
         t.type = reserved.get(t.value,'ID')    # Check for reserved words
    @@ -449,7 +481,7 @@ t_PRINT = r'print'
     those rules will be triggered for identifiers that include those words as a prefix such as "forget" or "printed".  This is probably not
     what you want.
     
    -

    3.4 Token values

    +

    4.4 Token values

    When tokens are returned by lex, they have a value that is stored in the value attribute. Normally, the value is the text @@ -468,9 +500,10 @@ def t_ID(t):
    It is important to note that storing data in other attribute names is not recommended. The yacc.py module only exposes the -contents of the value attribute. Thus, accessing other attributes may be unnecessarily awkward. +contents of the value attribute. Thus, accessing other attributes may be unnecessarily awkward. If you +need to store multiple values on a token, assign a tuple, dictionary, or instance to value. -

    3.5 Discarded tokens

    +

    4.5 Discarded tokens

    To discard a token, such as a comment, simply define a token rule that returns no value. For example: @@ -496,7 +529,7 @@ Be advised that if you are ignoring many different kinds of text, you may still control over the order in which regular expressions are matched (i.e., functions are matched in order of specification whereas strings are sorted by regular expression length). -

    3.6 Line numbers and positional information

    +

    4.6 Line numbers and positional information

    By default, lex.py knows nothing about line numbers. This is because lex.py doesn't know anything @@ -525,11 +558,10 @@ column information as a separate step. For instance, just count backwards unti # input is the input text string # token is a token instance def find_column(input,token): - i = token.lexpos - while i > 0: - if input[i] == '\n': break - i -= 1 - column = (token.lexpos - i)+1 + last_cr = input.rfind('\n',0,token.lexpos) + if last_cr < 0: + last_cr = 0 + column = (token.lexpos - last_cr) + 1 return column @@ -537,7 +569,7 @@ def find_column(input,token): Since column information is often only useful in the context of error handling, calculating the column position can be performed when needed as opposed to doing it for each token. -

    3.7 Ignored characters

    +

    4.7 Ignored characters

    @@ -549,7 +581,7 @@ similar to t_newline(), the use of t_ignore provides substanti lexing performance because it is handled as a special case and is checked in a much more efficient manner than the normal regular expression rules. -

    3.8 Literal characters

    +

    4.8 Literal characters

    @@ -575,7 +607,7 @@ take precedence.

    When a literal token is returned, both its type and value attributes are set to the character itself. For example, '+'. -

    3.9 Error handling

    +

    4.9 Error handling

    @@ -596,44 +628,42 @@ def t_error(t): In this case, we simply print the offending character and skip ahead one character by calling t.lexer.skip(1). -

    3.10 Building and using the lexer

    +

    4.10 Building and using the lexer

    To build the lexer, the function lex.lex() is used. This function uses Python reflection (or introspection) to read the the regular expression rules -out of the calling context and build the lexer. Once the lexer has been built, two functions can +out of the calling context and build the lexer. Once the lexer has been built, two methods can be used to control the lexer.

    -If desired, the lexer can also be used as an object. The lex() returns a Lexer object that -can be used for this purpose. For example: +The preferred way to use PLY is to invoke the above methods directly on the lexer object returned by the +lex() function. The legacy interface to PLY involves module-level functions lex.input() and lex.token(). +For example:
    -lexer = lex.lex()
    -lexer.input(sometext)
    +lex.lex()
    +lex.input(sometext)
     while 1:
    -    tok = lexer.token()
    +    tok = lex.token()
         if not tok: break
         print tok
     

    -This latter technique should be used if you intend to use multiple lexers in your application. Simply define each -lexer in its own module and use the object returned by lex() as appropriate. +In this example, the module-level functions lex.input() and lex.token() are bound to the input() +and token() methods of the last lexer created by the lex module. This interface may go away at some point so +it's probably best not to use it. -

    -Note: The global functions lex.input() and lex.token() are bound to the input() -and token() methods of the last lexer created by the lex module. - -

    3.11 The @TOKEN decorator

    +

    4.11 The @TOKEN decorator

    In some applications, you may want to define build tokens from as a series of @@ -680,7 +710,7 @@ t_ID.__doc__ = identifier NOTE: Use of @TOKEN requires Python-2.4 or newer. If you're concerned about backwards compatibility with older versions of Python, use the alternative approach of setting the docstring directly. -

    3.12 Optimized mode

    +

    4.12 Optimized mode

    For improved performance, it may be desirable to use Python's @@ -717,7 +747,7 @@ lexer = lex.lex(optimize=1,lextab="footab") When running in optimized mode, it is important to note that lex disables most error checking. Thus, this is really only recommended if you're sure everything is working correctly and you're ready to start releasing production code. -

    3.13 Debugging

    +

    4.13 Debugging

    For the purpose of debugging, you can run lex() in a debugging mode as follows: @@ -728,12 +758,16 @@ lexer = lex.lex(debug=1) -This will result in a large amount of debugging information to be printed including all of the added rules and the master -regular expressions. +

    +This will produce various sorts of debugging information including all of the added rules, +the master regular expressions used by the lexer, and tokens generating during lexing. +

    +

    In addition, lex.py comes with a simple main function which will either tokenize input read from standard input or from a file specified on the command line. To use it, simply put this in your lexer: +

    @@ -742,7 +776,10 @@ if __name__ == '__main__':
     
    -

    3.14 Alternative specification of lexers

    +Please refer to the "Debugging" section near the end for some more advanced details +of debugging. + +

    4.14 Alternative specification of lexers

    As shown in the example, lexers are specified all within one Python module. If you want to @@ -780,11 +817,7 @@ t_RPAREN = r'\)' # A regular expression rule with some action code def t_NUMBER(t): r'\d+' - try: - t.value = int(t.value) - except ValueError: - print "Line %d: Number %s is too large!" % (t.lineno,t.value) - t.value = 0 + t.value = int(t.value) return t # Define a rule so we can track line numbers @@ -821,7 +854,7 @@ None -The object option can be used to define lexers as a class instead of a module. For example: +The module option can also be used to define lexers from instances of a class. For example:
    @@ -851,11 +884,7 @@ class MyLexer:
         # Note addition of self parameter since we're in a class
         def t_NUMBER(self,t):
             r'\d+'
    -        try:
    -             t.value = int(t.value)    
    -        except ValueError:
    -             print "Line %d: Number %s is too large!" % (t.lineno,t.value)
    -             t.value = 0
    +        t.value = int(t.value)    
             return t
     
         # Define a rule so we can track line numbers
    @@ -873,12 +902,12 @@ class MyLexer:
     
         # Build the lexer
         def build(self,**kwargs):
    -        self.lexer = lex.lex(object=self, **kwargs)
    +        self.lexer = lex.lex(module=self, **kwargs)
         
         # Test it output
         def test(self,data):
             self.lexer.input(data)
    -        while 1:
    +        while True:
                  tok = lexer.token()
                  if not tok: break
                  print tok
    @@ -890,14 +919,81 @@ m.test("3 + 4")     # Test it
     
    -For reasons that are subtle, you should NOT invoke lex.lex() inside the __init__() method of your class. If you -do, it may cause bizarre behavior if someone tries to duplicate a lexer object. Keep reading. -

    3.15 Maintaining state

    +When building a lexer from class, you should construct the lexer from +an instance of the class, not the class object itself. This is because +PLY only works properly if the lexer actions are defined by bound-methods. + +

    +When using the module option to lex(), PLY collects symbols +from the underlying object using the dir() function. There is no +direct access to the __dict__ attribute of the object supplied as a +module value. + +

    +Finally, if you want to keep things nicely encapsulated, but don't want to use a +full-fledged class definition, lexers can be defined using closures. For example: + +

    +
    +import ply.lex as lex
    +
    +# List of token names.   This is always required
    +tokens = (
    +  'NUMBER',
    +  'PLUS',
    +  'MINUS',
    +  'TIMES',
    +  'DIVIDE',
    +  'LPAREN',
    +  'RPAREN',
    +)
    +
    +def MyLexer():
    +    # Regular expression rules for simple tokens
    +    t_PLUS    = r'\+'
    +    t_MINUS   = r'-'
    +    t_TIMES   = r'\*'
    +    t_DIVIDE  = r'/'
    +    t_LPAREN  = r'\('
    +    t_RPAREN  = r'\)'
    +
    +    # A regular expression rule with some action code
    +    def t_NUMBER(t):
    +        r'\d+'
    +        t.value = int(t.value)    
    +        return t
    +
    +    # Define a rule so we can track line numbers
    +    def t_newline(t):
    +        r'\n+'
    +        t.lexer.lineno += len(t.value)
    +
    +    # A string containing ignored characters (spaces and tabs)
    +    t_ignore  = ' \t'
    +
    +    # Error handling rule
    +    def t_error(t):
    +        print "Illegal character '%s'" % t.value[0]
    +        t.lexer.skip(1)
    +
    +    # Build the lexer from my environment and return it    
    +    return lex.lex()
    +
    +
    + + +

    4.15 Maintaining state

    -In your lexer, you may want to maintain a variety of state information. This might include mode settings, symbol tables, and other details. There are a few -different ways to handle this situation. First, you could just keep some global variables: +In your lexer, you may want to maintain a variety of state +information. This might include mode settings, symbol tables, and +other details. As an example, suppose that you wanted to keep +track of how many NUMBER tokens had been encountered. + +

    +One way to do this is to keep a set of global variables in the module +where you created the lexer. For example:

    @@ -906,28 +1002,22 @@ def t_NUMBER(t):
         r'\d+'
         global num_count
         num_count += 1
    -    try:
    -         t.value = int(t.value)    
    -    except ValueError:
    -         print "Line %d: Number %s is too large!" % (t.lineno,t.value)
    -	 t.value = 0
    +    t.value = int(t.value)    
         return t
     
    -Alternatively, you can store this information inside the Lexer object created by lex(). To this, you can use the lexer attribute -of tokens passed to the various rules. For example: +If you don't like the use of a global variable, another place to store +information is inside the Lexer object created by lex(). +To this, you can use the lexer attribute of tokens passed to +the various rules. For example:
     def t_NUMBER(t):
         r'\d+'
         t.lexer.num_count += 1     # Note use of lexer attribute
    -    try:
    -         t.value = int(t.value)    
    -    except ValueError:
    -         print "Line %d: Number %s is too large!" % (t.lineno,t.value)
    -	 t.value = 0
    +    t.value = int(t.value)    
         return t
     
     lexer = lex.lex()
    @@ -935,17 +1025,20 @@ lexer.num_count = 0            # Set the initial count
     
    -This latter approach has the advantage of storing information inside -the lexer itself---something that may be useful if multiple instances -of the same lexer have been created. However, it may also feel kind -of "hacky" to the purists. Just to put their mind at some ease, all +This latter approach has the advantage of being simple and working +correctly in applications where multiple instantiations of a given +lexer exist in the same application. However, this might also feel +like a gross violation of encapsulation to OO purists. +Just to put your mind at some ease, all internal attributes of the lexer (with the exception of lineno) have names that are prefixed by lex (e.g., lexdata,lexpos, etc.). Thus, -it should be perfectly safe to store attributes in the lexer that -don't have names starting with that prefix. +it is perfectly safe to store attributes in the lexer that +don't have names starting with that prefix or a name that conlicts with one of the +predefined methods (e.g., input(), token(), etc.).

    -A third approach is to define the lexer as a class as shown in the previous example: +If you don't like assigning values on the lexer object, you can define your lexer as a class as +shown in the previous section:

    @@ -954,11 +1047,7 @@ class MyLexer:
         def t_NUMBER(self,t):
             r'\d+'
             self.num_count += 1
    -        try:
    -             t.value = int(t.value)    
    -        except ValueError:
    -             print "Line %d: Number %s is too large!" % (t.lineno,t.value)
    -             t.value = 0
    +        t.value = int(t.value)    
             return t
     
         def build(self, **kwargs):
    @@ -966,23 +1055,36 @@ class MyLexer:
     
         def __init__(self):
             self.num_count = 0
    -
    -# Create a lexer 
    -m = MyLexer()
    -lexer = lex.lex(object=m)
     
    -The class approach may be the easiest to manage if your application is going to be creating multiple instances of the same lexer and -you need to manage a lot of state. +The class approach may be the easiest to manage if your application is +going to be creating multiple instances of the same lexer and you need +to manage a lot of state. -

    3.16 Duplicating lexers

    +

    +State can also be managed through closures. For example, in Python 3: +

    +
    +def MyLexer():
    +    num_count = 0
    +    ...
    +    def t_NUMBER(t):
    +        r'\d+'
    +        nonlocal num_count
    +        num_count += 1
    +        t.value = int(t.value)    
    +        return t
    +    ...
    +
    +
    + +

    4.16 Lexer cloning

    -NOTE: I am thinking about deprecating this feature. Post comments on ply-hack@googlegroups.com or send me a private email at dave@dabeaz.com.

    -If necessary, a lexer object can be quickly duplicated by invoking its clone() method. For example: +If necessary, a lexer object can be duplicated by invoking its clone() method. For example:

    @@ -992,23 +1094,25 @@ newlexer = lexer.clone()
     
    -When a lexer is cloned, the copy is identical to the original lexer, -including any input text. However, once created, different text can be -fed to the clone which can be used independently. This capability may -be useful in situations when you are writing a parser/compiler that +When a lexer is cloned, the copy is exactly identical to the original lexer +including any input text and internal state. However, the clone allows a +different set of input text to be supplied which may be processed separately. +This may be useful in situations when you are writing a parser/compiler that involves recursive or reentrant processing. For instance, if you needed to scan ahead in the input for some reason, you could create a -clone and use it to look ahead. +clone and use it to look ahead. Or, if you were implementing some kind of preprocessor, +cloned lexers could be used to handle different input files.

    -The advantage of using clone() instead of reinvoking lex() is -that it is significantly faster. Namely, it is not necessary to re-examine all of the -token rules, build a regular expression, and construct internal tables. All of this -information can simply be reused in the new lexer. +Creating a clone is different than calling lex.lex() in that +PLY doesn't regenerate any of the internal tables or regular expressions. So,

    -Special considerations need to be made when cloning a lexer that is defined as a class. Previous sections -showed an example of a class MyLexer. If you have the following code: +Special considerations need to be made when cloning lexers that also +maintain their own internal state using classes or closures. Namely, +you need to be aware that the newly created lexers will share all of +this state with the original lexer. For example, if you defined a +lexer as a class and did this:

    @@ -1020,43 +1124,12 @@ b = a.clone()              # Clone the lexer
     
    Then both a and b are going to be bound to the same -object m. If the object m contains internal state -related to lexing, this sharing may lead to quite a bit of confusion. To fix this, -the clone() method accepts an optional argument that can be used to supply a new object. This -can be used to clone the lexer and bind it to a new instance. For example: +object m and any changes to m will be reflected in both lexers. It's +important to emphasize that clone() is only meant to create a new lexer +that reuses the regular expressions and environment of another lexer. If you +need to make a totally new copy of a lexer, then call lex() again. -
    -
    -m = MyLexer()              # Create a lexer
    -a = lex.lex(object=m)
    -
    -# Create a clone 
    -n = MyLexer()              # New instance of MyLexer
    -b = a.clone(n)             # New lexer bound to n
    -
    -
    - -It may make sense to encapsulate all of this inside a method: - -
    -
    -class MyLexer:
    -     ...
    -     def clone(self):
    -         c = MyLexer()        # Create a new instance of myself
    -         # Copy attributes from self to c as appropriate
    -         ...
    -         # Clone the lexer
    -         c.lexer = self.lexer.clone(c)
    -         return c
    -
    -
    - -The fact that a new instance of MyLexer may be created while cloning a lexer is the reason why you should never -invoke lex.lex() inside __init__(). If you do, the lexer will be rebuilt from scratch and you lose -all of the performance benefits of using clone() in the first place. - -

    3.17 Internal lexer state

    +

    4.17 Internal lexer state

    A Lexer object lexer has a number of internal attributes that may be useful in certain @@ -1074,8 +1147,9 @@ matched at the new position.

    lexer.lineno

    -The current value of the line number attribute stored in the lexer. This can be modified as needed to -change the line number. +The current value of the line number attribute stored in the lexer. PLY only specifies that the attribute +exists---it never sets, updates, or performs any processing with it. If you want to track line numbers, +you will need to add code yourself (see the section on line numbers and positional information).

    @@ -1090,9 +1164,10 @@ would probably be a bad idea to modify this unless you really know what you're d

    This is the raw Match object returned by the Python re.match() function (used internally by PLY) for the current token. If you have written a regular expression that contains named groups, you can use this to retrieve those values. +Note: This attribute is only updated when tokens are defined and processed by functions.
    -

    3.18 Conditional lexing and start conditions

    +

    4.18 Conditional lexing and start conditions

    In advanced parsing applications, it may be useful to have different @@ -1291,7 +1366,7 @@ However, if the closing right brace is encountered, the rule t_ccode_rbrace< position), stores it, and returns a token 'CCODE' containing all of that text. When returning the token, the lexing state is restored back to its initial state. -

    3.19 Miscellaneous Issues

    +

    4.19 Miscellaneous Issues

    @@ -1331,7 +1406,7 @@ tokens are available.

  • The token() method must return an object tok that has type and value attributes. -

    4. Parsing basics

    +

    5. Parsing basics

    yacc.py is used to parse language syntax. Before showing an @@ -1357,9 +1432,10 @@ factor : NUMBER In the grammar, symbols such as NUMBER, +, -, *, and / are known -as terminals and correspond to raw input tokens. Identifiers such as term and factor refer to more -complex rules, typically comprised of a collection of tokens. These identifiers are known as non-terminals. +as terminals and correspond to raw input tokens. Identifiers such as term and factor refer to +grammar rules comprised of a collection of terminals and other rules. These identifiers are known as non-terminals.

    + The semantic behavior of a language is often specified using a technique known as syntax directed translation. In syntax directed translation, attributes are attached to each symbol in a given grammar @@ -1385,9 +1461,12 @@ factor : NUMBER factor.val = int(NUMBER.lexval) -A good way to think about syntax directed translation is to simply think of each symbol in the grammar as some -kind of object. The semantics of the language are then expressed as a collection of methods/operations on these -objects. +A good way to think about syntax directed translation is to +view each symbol in the grammar as a kind of object. Associated +with each symbol is a value representing its "state" (for example, the +val attribute above). Semantic +actions are then expressed as a collection of functions or methods +that operate on the symbols and associated values.

    Yacc uses a parsing technique known as LR-parsing or shift-reduce parsing. LR parsing is a @@ -1396,62 +1475,78 @@ Whenever a valid right-hand-side is found in the input, the appropriate action c grammar symbols are replaced by the grammar symbol on the left-hand-side.

    -LR parsing is commonly implemented by shifting grammar symbols onto a stack and looking at the stack and the next -input token for patterns. The details of the algorithm can be found in a compiler text, but the -following example illustrates the steps that are performed if you wanted to parse the expression -3 + 5 * (10 - 20) using the grammar defined above: +LR parsing is commonly implemented by shifting grammar symbols onto a +stack and looking at the stack and the next input token for patterns that +match one of the grammar rules. +The details of the algorithm can be found in a compiler textbook, but the +following example illustrates the steps that are performed if you +wanted to parse the expression +3 + 5 * (10 - 20) using the grammar defined above. In the example, +the special symbol $ represents the end of input. +

     Step Symbol Stack           Input Tokens            Action
     ---- ---------------------  ---------------------   -------------------------------
    -1    $                      3 + 5 * ( 10 - 20 )$    Shift 3
    -2    $ 3                      + 5 * ( 10 - 20 )$    Reduce factor : NUMBER
    -3    $ factor                 + 5 * ( 10 - 20 )$    Reduce term   : factor
    -4    $ term                   + 5 * ( 10 - 20 )$    Reduce expr : term
    -5    $ expr                   + 5 * ( 10 - 20 )$    Shift +
    -6    $ expr +                   5 * ( 10 - 20 )$    Shift 5
    -7    $ expr + 5                   * ( 10 - 20 )$    Reduce factor : NUMBER
    -8    $ expr + factor              * ( 10 - 20 )$    Reduce term   : factor
    -9    $ expr + term                * ( 10 - 20 )$    Shift *
    -10   $ expr + term *                ( 10 - 20 )$    Shift (
    -11   $ expr + term * (                10 - 20 )$    Shift 10
    -12   $ expr + term * ( 10                - 20 )$    Reduce factor : NUMBER
    -13   $ expr + term * ( factor            - 20 )$    Reduce term : factor
    -14   $ expr + term * ( term              - 20 )$    Reduce expr : term
    -15   $ expr + term * ( expr              - 20 )$    Shift -
    -16   $ expr + term * ( expr -              20 )$    Shift 20
    -17   $ expr + term * ( expr - 20              )$    Reduce factor : NUMBER
    -18   $ expr + term * ( expr - factor          )$    Reduce term : factor
    -19   $ expr + term * ( expr - term            )$    Reduce expr : expr - term
    -20   $ expr + term * ( expr                   )$    Shift )
    -21   $ expr + term * ( expr )                  $    Reduce factor : (expr)
    -22   $ expr + term * factor                    $    Reduce term : term * factor
    -23   $ expr + term                             $    Reduce expr : expr + term
    -24   $ expr                                    $    Reduce expr
    -25   $                                         $    Success!
    -
    -
    - -When parsing the expression, an underlying state machine and the current input token determine what to do next. -If the next token looks like part of a valid grammar rule (based on other items on the stack), it is generally shifted -onto the stack. If the top of the stack contains a valid right-hand-side of a grammar rule, it is -usually "reduced" and the symbols replaced with the symbol on the left-hand-side. When this reduction occurs, the -appropriate action is triggered (if defined). If the input token can't be shifted and the top of stack doesn't match -any grammar rules, a syntax error has occurred and the parser must take some kind of recovery step (or bail out). - -

    -It is important to note that the underlying implementation is built around a large finite-state machine that is encoded -in a collection of tables. The construction of these tables is quite complicated and beyond the scope of this discussion. -However, subtle details of this process explain why, in the example above, the parser chooses to shift a token -onto the stack in step 9 rather than reducing the rule expr : expr + term. - -

    5. Yacc reference

    - - -This section describes how to use write parsers in PLY. - -

    5.1 An example

    +1 3 + 5 * ( 10 - 20 )$ Shift 3 +2 3 + 5 * ( 10 - 20 )$ Reduce factor : NUMBER +3 factor + 5 * ( 10 - 20 )$ Reduce term : factor +4 term + 5 * ( 10 - 20 )$ Reduce expr : term +5 expr + 5 * ( 10 - 20 )$ Shift + +6 expr + 5 * ( 10 - 20 )$ Shift 5 +7 expr + 5 * ( 10 - 20 )$ Reduce factor : NUMBER +8 expr + factor * ( 10 - 20 )$ Reduce term : factor +9 expr + term * ( 10 - 20 )$ Shift * +10 expr + term * ( 10 - 20 )$ Shift ( +11 expr + term * ( 10 - 20 )$ Shift 10 +12 expr + term * ( 10 - 20 )$ Reduce factor : NUMBER +13 expr + term * ( factor - 20 )$ Reduce term : factor +14 expr + term * ( term - 20 )$ Reduce expr : term +15 expr + term * ( expr - 20 )$ Shift - +16 expr + term * ( expr - 20 )$ Shift 20 +17 expr + term * ( expr - 20 )$ Reduce factor : NUMBER +18 expr + term * ( expr - factor )$ Reduce term : factor +19 expr + term * ( expr - term )$ Reduce expr : expr - term +20 expr + term * ( expr )$ Shift ) +21 expr + term * ( expr ) $ Reduce factor : (expr) +22 expr + term * factor $ Reduce term : term * factor +23 expr + term $ Reduce expr : expr + term +24 expr $ Reduce expr +25 $ Success! + + + +When parsing the expression, an underlying state machine and the +current input token determine what happens next. If the next token +looks like part of a valid grammar rule (based on other items on the +stack), it is generally shifted onto the stack. If the top of the +stack contains a valid right-hand-side of a grammar rule, it is +usually "reduced" and the symbols replaced with the symbol on the +left-hand-side. When this reduction occurs, the appropriate action is +triggered (if defined). If the input token can't be shifted and the +top of stack doesn't match any grammar rules, a syntax error has +occurred and the parser must take some kind of recovery step (or bail +out). A parse is only successful if the parser reaches a state where +the symbol stack is empty and there are no more input tokens. + +

    +It is important to note that the underlying implementation is built +around a large finite-state machine that is encoded in a collection of +tables. The construction of these tables is non-trivial and +beyond the scope of this discussion. However, subtle details of this +process explain why, in the example above, the parser chooses to shift +a token onto the stack in step 9 rather than reducing the +rule expr : expr + term. + +

    6. Yacc

    + + +The ply.yacc module implements the parsing component of PLY. +The name "yacc" stands for "Yet Another Compiler Compiler" and is +borrowed from the Unix tool of the same name. + +

    6.1 An example

    Suppose you wanted to make a grammar for simple arithmetic expressions as previously described. Here is @@ -1503,26 +1598,26 @@ def p_error(p): print "Syntax error in input!" # Build the parser -yacc.yacc() - -# Use this if you want to build the parser using SLR instead of LALR -# yacc.yacc(method="SLR") +parser = yacc.yacc() -while 1: +while True: try: s = raw_input('calc > ') except EOFError: break if not s: continue - result = yacc.parse(s) + result = parser.parse(s) print result -In this example, each grammar rule is defined by a Python function where the docstring to that function contains the -appropriate context-free grammar specification. Each function accepts a single -argument p that is a sequence containing the values of each grammar symbol in the corresponding rule. The values of -p[i] are mapped to grammar symbols as shown here: +In this example, each grammar rule is defined by a Python function +where the docstring to that function contains the appropriate +context-free grammar specification. The statements that make up the +function body implement the semantic actions of the rule. Each function +accepts a single argument p that is a sequence containing the +values of each grammar symbol in the corresponding rule. The values +of p[i] are mapped to grammar symbols as shown here:
    @@ -1535,42 +1630,49 @@ def p_expression_plus(p):
     
    -For tokens, the "value" of the corresponding p[i] is the -same as the p.value attribute assigned -in the lexer module. For non-terminals, the value is determined by -whatever is placed in p[0] when rules are reduced. This -value can be anything at all. However, it probably most common for -the value to be a simple Python type, a tuple, or an instance. In this example, we -are relying on the fact that the NUMBER token stores an integer value in its value -field. All of the other rules simply perform various types of integer operations and store -the result. - -

    -Note: The use of negative indices have a special meaning in yacc---specially p[-1] does -not have the same value as p[3] in this example. Please see the section on "Embedded Actions" for further -details. -

    -The first rule defined in the yacc specification determines the starting grammar -symbol (in this case, a rule for expression appears first). Whenever -the starting rule is reduced by the parser and no more input is available, parsing -stops and the final value is returned (this value will be whatever the top-most rule -placed in p[0]). Note: an alternative starting symbol can be specified using the start keyword argument to +For tokens, the "value" of the corresponding p[i] is the +same as the p.value attribute assigned in the lexer +module. For non-terminals, the value is determined by whatever is +placed in p[0] when rules are reduced. This value can be +anything at all. However, it probably most common for the value to be +a simple Python type, a tuple, or an instance. In this example, we +are relying on the fact that the NUMBER token stores an +integer value in its value field. All of the other rules simply +perform various types of integer operations and propagate the result. +

    + +

    +Note: The use of negative indices have a special meaning in +yacc---specially p[-1] does not have the same value +as p[3] in this example. Please see the section on "Embedded +Actions" for further details. +

    + +

    +The first rule defined in the yacc specification determines the +starting grammar symbol (in this case, a rule for expression +appears first). Whenever the starting rule is reduced by the parser +and no more input is available, parsing stops and the final value is +returned (this value will be whatever the top-most rule placed +in p[0]). Note: an alternative starting symbol can be +specified using the start keyword argument to yacc(). -

    The p_error(p) rule is defined to catch syntax errors. See the error handling section -below for more detail. +

    The p_error(p) rule is defined to catch syntax errors. +See the error handling section below for more detail.

    -To build the parser, call the yacc.yacc() function. This function -looks at the module and attempts to construct all of the LR parsing tables for the grammar -you have specified. The first time yacc.yacc() is invoked, you will get a message -such as this: +To build the parser, call the yacc.yacc() function. This +function looks at the module and attempts to construct all of the LR +parsing tables for the grammar you have specified. The first +time yacc.yacc() is invoked, you will get a message such as +this:

     $ python calcparse.py
    -yacc: Generating LALR parsing table...  
    +Generating LALR tables
     calc > 
     
    @@ -1582,7 +1684,8 @@ debugging file called parser.out is created. On subsequent executions, yacc will reload the table from parsetab.py unless it has detected a change in the underlying grammar (in which case the tables and parsetab.py file are -regenerated). Note: The names of parser output files can be changed if necessary. See the notes that follow later. +regenerated). Note: The names of parser output files can be changed +if necessary. See the PLY Reference for details.

    If any errors are detected in your grammar specification, yacc.py will produce @@ -1597,9 +1700,18 @@ diagnostic messages and possibly raise an exception. Some of the errors that ca

  • Undefined rules and tokens -The next few sections now discuss a few finer points of grammar construction. +The next few sections discuss grammar specification in more detail. -

    5.2 Combining Grammar Rule Functions

    +

    +The final part of the example shows how to actually run the parser +created by +yacc(). To run the parser, you simply have to call +the parse() with a string of input text. This will run all +of the grammar rules and return the result of the entire parse. This +result return is the value assigned to p[0] in the starting +grammar rule. + +

    6.2 Combining Grammar Rule Functions

    When grammar rules are similar, they can be combined into a single function. @@ -1668,7 +1780,15 @@ def p_expressions(p): -

    5.3 Character Literals

    +If parsing performance is a concern, you should resist the urge to put +too much conditional processing into a single grammar rule as shown in +these examples. When you add checks to see which grammar rule is +being handled, you are actually duplicating the work that the parser +has already performed (i.e., the parser already knows exactly what rule it +matched). You can eliminate this overhead by using a +separate p_rule() function for each grammar rule. + +

    6.3 Character Literals

    If desired, a grammar may contain tokens defined as single character literals. For example: @@ -1704,7 +1824,7 @@ literals = ['+','-','*','/' ] Character literals are limited to a single character. Thus, it is not legal to specify literals such as '<=' or '=='. For this, use the normal lexing rules (e.g., define a rule such as t_EQ = r'=='). -

    5.4 Empty Productions

    +

    6.4 Empty Productions

    yacc.py can handle empty productions by defining a rule like this: @@ -1728,10 +1848,12 @@ def p_optitem(p): -Note: You can write empty rules anywhere by simply specifying an empty right hand side. However, I personally find that -writing an "empty" rule and using "empty" to denote an empty production is easier to read. +Note: You can write empty rules anywhere by simply specifying an empty +right hand side. However, I personally find that writing an "empty" +rule and using "empty" to denote an empty production is easier to read +and more clearly states your intentions. -

    5.5 Changing the starting symbol

    +

    6.5 Changing the starting symbol

    Normally, the first rule found in a yacc specification defines the starting grammar rule (top level rule). To change this, simply @@ -1751,8 +1873,10 @@ def p_foo(p): -The use of a start specifier may be useful during debugging since you can use it to have yacc build a subset of -a larger grammar. For this purpose, it is also possible to specify a starting symbol as an argument to yacc(). For example: +The use of a start specifier may be useful during debugging +since you can use it to have yacc build a subset of a larger grammar. +For this purpose, it is also possible to specify a starting symbol as +an argument to yacc(). For example:
    @@ -1760,12 +1884,14 @@ yacc.yacc(start='foo')
     
    -

    5.6 Dealing With Ambiguous Grammars

    +

    6.6 Dealing With Ambiguous Grammars

    -The expression grammar given in the earlier example has been written in a special format to eliminate ambiguity. -However, in many situations, it is extremely difficult or awkward to write grammars in this format. A -much more natural way to express the grammar is in a more compact form like this: +The expression grammar given in the earlier example has been written +in a special format to eliminate ambiguity. However, in many +situations, it is extremely difficult or awkward to write grammars in +this format. A much more natural way to express the grammar is in a +more compact form like this:
    @@ -1778,15 +1904,18 @@ expression : expression PLUS expression
     
    -Unfortunately, this grammar specification is ambiguous. For example, if you are parsing the string -"3 * 4 + 5", there is no way to tell how the operators are supposed to be grouped. -For example, does the expression mean "(3 * 4) + 5" or is it "3 * (4+5)"? +Unfortunately, this grammar specification is ambiguous. For example, +if you are parsing the string "3 * 4 + 5", there is no way to tell how +the operators are supposed to be grouped. For example, does the +expression mean "(3 * 4) + 5" or is it "3 * (4+5)"?

    -When an ambiguous grammar is given to yacc.py it will print messages about "shift/reduce conflicts" -or a "reduce/reduce conflicts". A shift/reduce conflict is caused when the parser generator can't decide -whether or not to reduce a rule or shift a symbol on the parsing stack. For example, consider -the string "3 * 4 + 5" and the internal parsing stack: +When an ambiguous grammar is given to yacc.py it will print +messages about "shift/reduce conflicts" or "reduce/reduce conflicts". +A shift/reduce conflict is caused when the parser generator can't +decide whether or not to reduce a rule or shift a symbol on the +parsing stack. For example, consider the string "3 * 4 + 5" and the +internal parsing stack:

    @@ -1801,20 +1930,25 @@ Step Symbol Stack           Input Tokens            Action
     
    -In this case, when the parser reaches step 6, it has two options. One is to reduce the -rule expr : expr * expr on the stack. The other option is to shift the -token + on the stack. Both options are perfectly legal from the rules -of the context-free-grammar. +In this case, when the parser reaches step 6, it has two options. One +is to reduce the rule expr : expr * expr on the stack. The +other option is to shift the token + on the stack. Both +options are perfectly legal from the rules of the +context-free-grammar.

    -By default, all shift/reduce conflicts are resolved in favor of shifting. Therefore, in the above -example, the parser will always shift the + instead of reducing. Although this -strategy works in many cases (including the ambiguous if-then-else), it is not enough for arithmetic -expressions. In fact, in the above example, the decision to shift + is completely wrong---we should have -reduced expr * expr since multiplication has higher mathematical precedence than addition. +By default, all shift/reduce conflicts are resolved in favor of +shifting. Therefore, in the above example, the parser will always +shift the + instead of reducing. Although this strategy +works in many cases (for example, the case of +"if-then" versus "if-then-else"), it is not enough for arithmetic expressions. In fact, +in the above example, the decision to shift + is completely +wrong---we should have reduced expr * expr since +multiplication has higher mathematical precedence than addition. -

    To resolve ambiguity, especially in expression grammars, yacc.py allows individual -tokens to be assigned a precedence level and associativity. This is done by adding a variable +

    To resolve ambiguity, especially in expression +grammars, yacc.py allows individual tokens to be assigned a +precedence level and associativity. This is done by adding a variable precedence to the grammar file like this:

    @@ -1826,17 +1960,19 @@ precedence = (
    -This declaration specifies that PLUS/MINUS have -the same precedence level and are left-associative and that -TIMES/DIVIDE have the same precedence and are left-associative. -Within the precedence declaration, tokens are ordered from lowest to highest precedence. Thus, -this declaration specifies that TIMES/DIVIDE have higher -precedence than PLUS/MINUS (since they appear later in the +This declaration specifies that PLUS/MINUS have the +same precedence level and are left-associative and that +TIMES/DIVIDE have the same precedence and are +left-associative. Within the precedence declaration, tokens +are ordered from lowest to highest precedence. Thus, this declaration +specifies that TIMES/DIVIDE have higher precedence +than PLUS/MINUS (since they appear later in the precedence specification).

    -The precedence specification works by associating a numerical precedence level value and associativity direction to -the listed tokens. For example, in the above example you get: +The precedence specification works by associating a numerical +precedence level value and associativity direction to the listed +tokens. For example, in the above example you get:

    @@ -1847,9 +1983,10 @@ DIVIDE    : level = 2,  assoc = 'left'
     
    -These values are then used to attach a numerical precedence value and associativity direction -to each grammar rule. This is always determined by looking at the precedence of the right-most terminal symbol. -For example: +These values are then used to attach a numerical precedence value and +associativity direction to each grammar rule. This is always +determined by looking at the precedence of the right-most terminal +symbol. For example:
    @@ -1867,7 +2004,7 @@ looking at the precedence rules and associativity specifiers.
     
     

      -
    1. If the current token has higher precedence, it is shifted. +
    2. If the current token has higher precedence than the rule on the stack, it is shifted.
    3. If the grammar rule on the stack has higher precedence, the rule is reduced.
    4. If the current token and the grammar rule have the same precedence, the rule is reduced for left associativity, whereas the token is shifted for right associativity. @@ -1875,21 +2012,28 @@ rule is reduced for left associativity, whereas the token is shifted for right a favor of shifting (the default).
    -For example, if "expression PLUS expression" has been parsed and the next token -is "TIMES", the action is going to be a shift because "TIMES" has a higher precedence level than "PLUS". On the other -hand, if "expression TIMES expression" has been parsed and the next token is "PLUS", the action -is going to be reduce because "PLUS" has a lower precedence than "TIMES." +For example, if "expression PLUS expression" has been parsed and the +next token is "TIMES", the action is going to be a shift because +"TIMES" has a higher precedence level than "PLUS". On the other hand, +if "expression TIMES expression" has been parsed and the next token is +"PLUS", the action is going to be reduce because "PLUS" has a lower +precedence than "TIMES."

    -When shift/reduce conflicts are resolved using the first three techniques (with the help of -precedence rules), yacc.py will report no errors or conflicts in the grammar. +When shift/reduce conflicts are resolved using the first three +techniques (with the help of precedence rules), yacc.py will +report no errors or conflicts in the grammar (although it will print +some information in the parser.out debugging file).

    -One problem with the precedence specifier technique is that it is sometimes necessary to -change the precedence of an operator in certain contents. For example, consider a unary-minus operator -in "3 + 4 * -5". Normally, unary minus has a very high precedence--being evaluated before the multiply. -However, in our precedence specifier, MINUS has a lower precedence than TIMES. To deal with this, -precedence rules can be given for fictitious tokens like this: +One problem with the precedence specifier technique is that it is +sometimes necessary to change the precedence of an operator in certain +contexts. For example, consider a unary-minus operator in "3 + 4 * +-5". Mathematically, the unary minus is normally given a very high +precedence--being evaluated before the multiply. However, in our +precedence specifier, MINUS has a lower precedence than TIMES. To +deal with this, precedence rules can be given for so-called "fictitious tokens" +like this:

    @@ -1978,11 +2122,27 @@ whether it's supposed to reduce the 5 as an expression and then reduce
     the rule assignment : ID EQUALS expression.
     
     

    -It should be noted that reduce/reduce conflicts are notoriously difficult to spot -simply looking at the input grammer. To locate these, it is usually easier to look at the -parser.out debugging file with an appropriately high level of caffeination. +It should be noted that reduce/reduce conflicts are notoriously +difficult to spot simply looking at the input grammer. When a +reduce/reduce conflict occurs, yacc() will try to help by +printing a warning message such as this: + +

    +
    +WARNING: 1 reduce/reduce conflict
    +WARNING: reduce/reduce conflict in state 15 resolved using rule (assignment -> ID EQUALS NUMBER)
    +WARNING: rejected rule (expression -> NUMBER)
    +
    +
    + +This message identifies the two rules that are in conflict. However, +it may not tell you how the parser arrived at such a state. To try +and figure it out, you'll probably have to look at your grammar and +the contents of the +parser.out debugging file with an appropriately high level of +caffeination. -

    5.7 The parser.out file

    +

    6.7 The parser.out file

    Tracking down shift/reduce and reduce/reduce conflicts is one of the finer pleasures of using an LR @@ -2240,10 +2400,15 @@ state 13
    -In the file, each state of the grammar is described. Within each state the "." indicates the current -location of the parse within any applicable grammar rules. In addition, the actions for each valid -input token are listed. When a shift/reduce or reduce/reduce conflict arises, rules not selected -are prefixed with an !. For example: +The different states that appear in this file are a representation of +every possible sequence of valid input tokens allowed by the grammar. +When receiving input tokens, the parser is building up a stack and +looking for matching rules. Each state keeps track of the grammar +rules that might be in the process of being matched at that point. Within each +rule, the "." character indicates the current location of the parse +within that rule. In addition, the actions for each valid input token +are listed. When a shift/reduce or reduce/reduce conflict arises, +rules not selected are prefixed with an !. For example:
    @@ -2258,12 +2423,22 @@ By looking at these rules (and with a little practice), you can usually track do
     of most parsing conflicts.  It should also be stressed that not all shift-reduce conflicts are
     bad.  However, the only way to be sure that they are resolved correctly is to look at parser.out.
       
    -

    5.8 Syntax Error Handling

    +

    6.8 Syntax Error Handling

    -When a syntax error occurs during parsing, the error is immediately +If you are creating a parser for production use, the handling of +syntax errors is important. As a general rule, you don't want a +parser to simply throw up its hands and stop at the first sign of +trouble. Instead, you want it to report the error, recover if possible, and +continue parsing so that all of the errors in the input get reported +to the user at once. This is the standard behavior found in compilers +for languages such as C, C++, and Java. + +In PLY, when a syntax error occurs during parsing, the error is immediately detected (i.e., the parser does not read any more tokens beyond the -source of the error). Error recovery in LR parsers is a delicate +source of the error). However, at this point, the parser enters a +recovery mode that can be used to try and continue further parsing. +As a general rule, error recovery in LR parsers is a delicate topic that involves ancient rituals and black-magic. The recovery mechanism provided by yacc.py is comparable to Unix yacc so you may want consult a book like O'Reilly's "Lex and Yacc" for some of the finer details. @@ -2273,7 +2448,9 @@ When a syntax error occurs, yacc.py performs the following steps:
    1. On the first occurrence of an error, the user-defined p_error() function -is called with the offending token as an argument. Afterwards, the parser enters +is called with the offending token as an argument. However, if the syntax error is due to +reaching the end-of-file, p_error() is called with an argument of None. +Afterwards, the parser enters an "error-recovery" mode in which it will not make future calls to p_error() until it has successfully shifted at least 3 tokens onto the parsing stack. @@ -2298,7 +2475,7 @@ shifted onto the parsing stack. parser can successfully shift a new symbol or reduce a rule involving error.
    -

    5.8.1 Recovery and resynchronization with error rules

    +

    6.8.1 Recovery and resynchronization with error rules

    The most well-behaved approach for handling syntax errors is to write grammar rules that include the error @@ -2350,7 +2527,7 @@ This is because the first bad token encountered will cause the rule to be reduced--which may make it difficult to recover if more bad tokens immediately follow. -

    5.8.2 Panic mode recovery

    +

    6.8.2 Panic mode recovery

    An alternative error recovery scheme is to enter a panic mode recovery in which tokens are @@ -2423,7 +2600,37 @@ def p_error(p):
    -

    5.8.3 General comments on error handling

    +

    6.8.3 Signaling an error from a production

    + + +If necessary, a production rule can manually force the parser to enter error recovery. This +is done by raising the SyntaxError exception like this: + +
    +
    +def p_production(p):
    +    'production : some production ...'
    +    raise SyntaxError
    +
    +
    + +The effect of raising SyntaxError is the same as if the last symbol shifted onto the +parsing stack was actually a syntax error. Thus, when you do this, the last symbol shifted is popped off +of the parsing stack and the current lookahead token is set to an error token. The parser +then enters error-recovery mode where it tries to reduce rules that can accept error tokens. +The steps that follow from this point are exactly the same as if a syntax error were detected and +p_error() were called. + +

    +One important aspect of manually setting an error is that the p_error() function will NOT be +called in this case. If you need to issue an error message, make sure you do it in the production that +raises SyntaxError. + +

    +Note: This feature of PLY is meant to mimic the behavior of the YYERROR macro in yacc. + + +

    6.8.4 General comments on error handling

    For normal types of languages, error recovery with error rules and resynchronization characters is probably the most reliable @@ -2431,10 +2638,12 @@ technique. This is because you can instrument the grammar to catch errors at sel to recover and continue parsing. Panic mode recovery is really only useful in certain specialized applications where you might want to discard huge portions of the input text to find a valid restart point. -

    5.9 Line Number and Position Tracking

    +

    6.9 Line Number and Position Tracking

    + -Position tracking is often a tricky problem when writing compilers. By default, PLY tracks the line number and position of -all tokens. This information is available using the following functions: +Position tracking is often a tricky problem when writing compilers. +By default, PLY tracks the line number and position of all tokens. +This information is available using the following functions:
    • p.lineno(num). Return the line number for symbol num @@ -2452,9 +2661,11 @@ def p_expression(p):
    -As an optional feature, yacc.py can automatically track line numbers and positions for all of the grammar symbols -as well. However, this -extra tracking requires extra processing and can significantly slow down parsing. Therefore, it must be enabled by passing the +As an optional feature, yacc.py can automatically track line +numbers and positions for all of the grammar symbols as well. +However, this extra tracking requires extra processing and can +significantly slow down parsing. Therefore, it must be enabled by +passing the tracking=True option to yacc.parse(). For example:
    @@ -2463,8 +2674,9 @@ yacc.parse(data,tracking=True)
    -Once enabled, the lineno() and lexpos() methods work for all grammar symbols. In addition, two -additional methods can be used: +Once enabled, the lineno() and lexpos() methods work +for all grammar symbols. In addition, two additional methods can be +used:
    • p.linespan(num). Return a tuple (startline,endline) with the starting and ending line number for symbol num. @@ -2506,29 +2718,59 @@ def p_bad_func(p):

      -Similarly, you may get better parsing performance if you only propagate line number -information where it's needed. For example: +Similarly, you may get better parsing performance if you only +selectively propagate line number information where it's needed using +the p.set_lineno() method. For example:

       def p_fname(p):
           'fname : ID'
      -    p[0] = (p[1],p.lineno(1))
      +    p[0] = p[1]
      +    p.set_lineno(0,p.lineno(1))
       
      -Finally, it should be noted that PLY does not store position information after a rule has been -processed. If it is important for you to retain this information in an abstract syntax tree, you -must make your own copy. +PLY doesn't retain line number information from rules that have already been +parsed. If you are building an abstract syntax tree and need to have line numbers, +you should make sure that the line numbers appear in the tree itself. -

      5.10 AST Construction

      +

      6.10 AST Construction

      -yacc.py provides no special functions for constructing an abstract syntax tree. However, such -construction is easy enough to do on your own. Simply create a data structure for abstract syntax tree nodes -and assign nodes to p[0] in each rule. +yacc.py provides no special functions for constructing an +abstract syntax tree. However, such construction is easy enough to do +on your own. -For example: +

      A minimal way to construct a tree is to simply create and +propagate a tuple or list in each grammar rule function. There +are many possible ways to do this, but one example would be something +like this: + +

      +
      +def p_expression_binop(p):
      +    '''expression : expression PLUS expression
      +                  | expression MINUS expression
      +                  | expression TIMES expression
      +                  | expression DIVIDE expression'''
      +
      +    p[0] = ('binary-expression',p[2],p[1],p[3])
      +
      +def p_expression_group(p):
      +    'expression : LPAREN expression RPAREN'
      +    p[0] = ('group-expression',p[2])
      +
      +def p_expression_number(p):
      +    'expression : NUMBER'
      +    p[0] = ('number-expression',p[1])
      +
      +
      + +

      +Another approach is to create a set of data structure for different +kinds of abstract syntax tree nodes and assign nodes to p[0] +in each rule. For example:

      @@ -2564,8 +2806,12 @@ def p_expression_number(p):
       
      -To simplify tree traversal, it may make sense to pick a very generic tree structure for your parse tree nodes. -For example: +The advantage to this approach is that it may make it easier to attach more complicated +semantics, type checking, code generation, and other features to the node classes. + +

      +To simplify tree traversal, it may make sense to pick a very generic +tree structure for your parse tree nodes. For example:

      @@ -2588,7 +2834,7 @@ def p_expression_binop(p):
       
      -

      5.11 Embedded Actions

      +

      6.11 Embedded Actions

      The parsing technique used by yacc only allows actions to be executed at the end of a rule. For example, @@ -2608,7 +2854,7 @@ symbols A, B, C, and D have been parsed. Sometimes, however, it is useful to execute small code fragments during intermediate stages of parsing. For example, suppose you wanted to perform some action immediately after A has -been parsed. To do this, you can write a empty rule like this: +been parsed. To do this, write an empty rule like this:
      @@ -2671,8 +2917,11 @@ def p_seen_AB(p):
       
      -an extra shift-reduce conflict will be introduced. This conflict is caused by the fact that the same symbol C appears next in -both the abcd and abcx rules. The parser can either shift the symbol (abcd rule) or reduce the empty rule seen_AB (abcx rule). +an extra shift-reduce conflict will be introduced. This conflict is +caused by the fact that the same symbol C appears next in +both the abcd and abcx rules. The parser can either +shift the symbol (abcd rule) or reduce the empty +rule seen_AB (abcx rule).

      A common use of embedded rules is to control other aspects of parsing @@ -2696,10 +2945,14 @@ def p_new_scope(p): -In this case, the embedded action new_scope executes immediately after a LBRACE ({) symbol is parsed. This might -adjust internal symbol tables and other aspects of the parser. Upon completion of the rule statements_block, code might undo the operations performed in the embedded action (e.g., pop_scope()). +In this case, the embedded action new_scope executes +immediately after a LBRACE ({) symbol is parsed. +This might adjust internal symbol tables and other aspects of the +parser. Upon completion of the rule statements_block, code +might undo the operations performed in the embedded action +(e.g., pop_scope()). -

      5.12 Yacc implementation notes

      +

      6.12 Miscellaneous Yacc Notes

        @@ -2770,16 +3023,7 @@ each time it runs (which may take awhile depending on how large your grammar is)
        -yacc.parse(debug=1)
        -
        -
        - -

        -

      • To redirect the debugging output to a filename of your choosing, use: - -
        -
        -yacc.parse(debug=1, debugfile="debugging.out")
        +yacc.parse(debug=1)     
         
        @@ -2812,17 +3056,17 @@ machine. Please be patient. size of the grammar. The biggest bottlenecks will be the lexer and the complexity of the code in your grammar rules.
      -

      6. Parser and Lexer State Management

      +

      7. Multiple Parsers and Lexers

      In advanced parsing applications, you may want to have multiple -parsers and lexers. Furthermore, the parser may want to control the -behavior of the lexer in some way. +parsers and lexers.

      -To do this, it is important to note that both the lexer and parser are -actually implemented as objects. These objects are returned by the -lex() and yacc() functions respectively. For example: +As a general rules this isn't a problem. However, to make it work, +you need to carefully make sure everything gets hooked up correctly. +First, make sure you save the objects returned by lex() and +yacc(). For example:

      @@ -2831,7 +3075,8 @@ parser = yacc.yacc()     # Return parser object
       
      -To attach the lexer and parser together, make sure you use the lexer argumemnt to parse. For example: +Next, when parsing, make sure you give the parse() function a reference to the lexer it +should be using. For example:
      @@ -2839,8 +3084,13 @@ parser.parse(text,lexer=lexer)
       
      -Within lexer and parser rules, these objects are also available. In the lexer, -the "lexer" attribute of a token refers to the lexer object in use. For example: +If you forget to do this, the parser will use the last lexer +created--which is not always what you want. + +

      +Within lexer and parser rule functions, these objects are also +available. In the lexer, the "lexer" attribute of a token refers to +the lexer object that triggered the rule. For example:

      @@ -2868,7 +3118,7 @@ If necessary, arbitrary attributes can be attached to the lexer or parser object
       For example, if you wanted to have different parsing modes, you could attach a mode
       attribute to the parser object and look at it later.
       
      -

      7. Using Python's Optimized Mode

      +

      8. Using Python's Optimized Mode

      Because PLY uses information from doc-strings, parsing and lexing @@ -2891,9 +3141,110 @@ the tables without the need for doc strings.

      Beware: running PLY in optimized mode disables a lot of error checking. You should only do this when your project has stabilized -and you don't need to do any debugging. - -

      8. Where to go from here?

      +and you don't need to do any debugging. One of the purposes of +optimized mode is to substantially decrease the startup time of +your compiler (by assuming that everything is already properly +specified and works). + +

      9. Advanced Debugging

      + + +

      +Debugging a compiler is typically not an easy task. PLY provides some +advanced diagonistic capabilities through the use of Python's +logging module. The next two sections describe this: + +

      9.1 Debugging the lex() and yacc() commands

      + + +

      +Both the lex() and yacc() commands have a debugging +mode that can be enabled using the debug flag. For example: + +

      +
      +lex.lex(debug=True)
      +yacc.yacc(debug=True)
      +
      +
      + +Normally, the output produced by debugging is routed to either +standard error or, in the case of yacc(), to a file +parser.out. This output can be more carefully controlled +by supplying a logging object. Here is an example that adds +information about where different debugging messages are coming from: + +
      +
      +# Set up a logging object
      +import logging
      +logging.basicConfig(
      +    level = logging.DEBUG,
      +    filename = "parselog.txt",
      +    filemode = "w",
      +    format = "%(filename)10s:%(lineno)4d:%(message)s"
      +)
      +log = logging.getLogger()
      +
      +lex.lex(debug=True,debuglog=log)
      +yacc.yacc(debug=True,debuglog=log)
      +
      +
      + +If you supply a custom logger, the amount of debugging +information produced can be controlled by setting the logging level. +Typically, debugging messages are either issued at the DEBUG, +INFO, or WARNING levels. + +

      +PLY's error messages and warnings are also produced using the logging +interface. This can be controlled by passing a logging object +using the errorlog parameter. + +

      +
      +lex.lex(errorlog=log)
      +yacc.yacc(errorlog=log)
      +
      +
      + +If you want to completely silence warnings, you can either pass in a +logging object with an appropriate filter level or use the NullLogger +object defined in either lex or yacc. For example: + +
      +
      +yacc.yacc(errorlog=yacc.NullLogger())
      +
      +
      + +

      9.2 Run-time Debugging

      + + +

      +To enable run-time debugging of a parser, use the debug option to parse. This +option can either be an integer (which simply turns debugging on or off) or an instance +of a logger object. For example: + +

      +
      +log = logging.getLogger()
      +parser.parse(input,debug=log)
      +
      +
      + +If a logging object is passed, you can use its filtering level to control how much +output gets generated. The INFO level is used to produce information +about rule reductions. The DEBUG level will show information about the +parsing stack, token shifts, and other details. The ERROR level shows information +related to parsing errors. + +

      +For very complicated problems, you should pass in a logging object that +redirects to a file where you can more easily inspect the output after +execution. + +

      10. Where to go from here?

      The examples directory of the PLY distribution contains several simple examples. Please consult a -- cgit v1.2.3