diff options
author | Nathan Binkert <nate@binkert.org> | 2009-08-16 13:39:58 -0700 |
---|---|---|
committer | Nathan Binkert <nate@binkert.org> | 2009-08-16 13:39:58 -0700 |
commit | e1270f81bdc81f5a575b34c2d2c294bdde3e6f4f (patch) | |
tree | b54af3469a338609faf04e67603c5264e79d59a5 /ext/ply/doc/ply.html | |
parent | bcaf93d182f43bf72d52104bb909324945904120 (diff) | |
download | gem5-e1270f81bdc81f5a575b34c2d2c294bdde3e6f4f.tar.xz |
ply: update PLY to version 3.2
Diffstat (limited to 'ext/ply/doc/ply.html')
-rw-r--r-- | ext/ply/doc/ply.html | 1115 |
1 files changed, 733 insertions, 382 deletions
diff --git a/ext/ply/doc/ply.html b/ext/ply/doc/ply.html index dba0c6288..3345e7929 100644 --- a/ext/ply/doc/ply.html +++ b/ext/ply/doc/ply.html @@ -12,12 +12,13 @@ dave@dabeaz.com<br> </b> <p> -<b>PLY Version: 2.3</b> +<b>PLY Version: 3.0</b> <p> <!-- INDEX --> <div class="sectiontoc"> <ul> +<li><a href="#ply_nn1">Preface and Requirements</a> <li><a href="#ply_nn1">Introduction</a> <li><a href="#ply_nn2">PLY Overview</a> <li><a href="#ply_nn3">Lex</a> @@ -37,13 +38,13 @@ dave@dabeaz.com<br> <li><a href="#ply_nn16">Debugging</a> <li><a href="#ply_nn17">Alternative specification of lexers</a> <li><a href="#ply_nn18">Maintaining state</a> -<li><a href="#ply_nn19">Duplicating lexers</a> +<li><a href="#ply_nn19">Lexer cloning</a> <li><a href="#ply_nn20">Internal lexer state</a> <li><a href="#ply_nn21">Conditional lexing and start conditions</a> <li><a href="#ply_nn21">Miscellaneous Issues</a> </ul> <li><a href="#ply_nn22">Parsing basics</a> -<li><a href="#ply_nn23">Yacc reference</a> +<li><a href="#ply_nn23">Yacc</a> <ul> <li><a href="#ply_nn24">An example</a> <li><a href="#ply_nn25">Combining Grammar Rule Functions</a> @@ -56,15 +57,21 @@ dave@dabeaz.com<br> <ul> <li><a href="#ply_nn30">Recovery and resynchronization with error rules</a> <li><a href="#ply_nn31">Panic mode recovery</a> +<li><a href="#ply_nn35">Signaling an error from a production</a> <li><a href="#ply_nn32">General comments on error handling</a> </ul> <li><a href="#ply_nn33">Line Number and Position Tracking</a> <li><a href="#ply_nn34">AST Construction</a> <li><a href="#ply_nn35">Embedded Actions</a> -<li><a href="#ply_nn36">Yacc implementation notes</a> +<li><a href="#ply_nn36">Miscellaneous Yacc Notes</a> </ul> -<li><a href="#ply_nn37">Parser and Lexer State Management</a> +<li><a href="#ply_nn37">Multiple Parsers and Lexers</a> <li><a href="#ply_nn38">Using Python's Optimized Mode</a> +<li><a href="#ply_nn44">Advanced Debugging</a> +<ul> +<li><a href="#ply_nn45">Debugging the lex() and yacc() commands</a> +<li><a href="#ply_nn46">Run-time Debugging</a> +</ul> <li><a href="#ply_nn39">Where to go from here?</a> </ul> </div> @@ -72,10 +79,26 @@ dave@dabeaz.com<br> +<H2><a name="ply_nn1"></a>1. Preface and Requirements</H2> +<p> +This document provides an overview of lexing and parsing with PLY. +Given the intrinsic complexity of parsing, I would strongly advise +that you read (or at least skim) this entire document before jumping +into a big development project with PLY. +</p> -<H2><a name="ply_nn1"></a>1. Introduction</H2> +<p> +PLY-3.0 is compatible with both Python 2 and Python 3. Be aware that +Python 3 support is new and has not been extensively tested (although +all of the examples and unit tests pass under Python 3.0). If you are +using Python 2, you should try to use Python 2.4 or newer. Although PLY +works with versions as far back as Python 2.2, some of its optional features +require more modern library modules. +</p> + +<H2><a name="ply_nn1"></a>2. Introduction</H2> PLY is a pure-Python implementation of the popular compiler @@ -95,7 +118,10 @@ include lexical analysis, parsing, type checking, type inference, nested scoping, and code generation for the SPARC processor. Approximately 30 different compiler implementations were completed in this course. Most of PLY's interface and operation has been influenced by common -usability problems encountered by students. +usability problems encountered by students. Since 2001, PLY has +continued to be improved as feedback has been received from users. +PLY-3.0 represents a major refactoring of the original implementation +with an eye towards future enhancements. <p> Since PLY was primarily developed as an instructional tool, you will @@ -120,7 +146,7 @@ Techniques, and Tools", by Aho, Sethi, and Ullman. O'Reilly's "Lex and Yacc" by John Levine may also be handy. In fact, the O'Reilly book can be used as a reference for PLY as the concepts are virtually identical. -<H2><a name="ply_nn2"></a>2. PLY Overview</H2> +<H2><a name="ply_nn2"></a>3. PLY Overview</H2> PLY consists of two separate modules; <tt>lex.py</tt> and @@ -163,7 +189,7 @@ parsing tables is relatively expensive, PLY caches the results and saves them to a file. If no changes are detected in the input source, the tables are read from the cache. Otherwise, they are regenerated. -<H2><a name="ply_nn3"></a>3. Lex</H2> +<H2><a name="ply_nn3"></a>4. Lex</H2> <tt>lex.py</tt> is used to tokenize an input string. For example, suppose @@ -206,7 +232,7 @@ More specifically, the input is broken into pairs of token types and values. Fo The identification of tokens is typically done by writing a series of regular expression rules. The next section shows how this is done using <tt>lex.py</tt>. -<H3><a name="ply_nn4"></a>3.1 Lex Example</H3> +<H3><a name="ply_nn4"></a>4.1 Lex Example</H3> The following example shows how <tt>lex.py</tt> is used to write a simple tokenizer. @@ -243,11 +269,7 @@ t_RPAREN = r'\)' # A regular expression rule with some action code def t_NUMBER(t): r'\d+' - try: - t.value = int(t.value) - except ValueError: - print "Line %d: Number %s is too large!" % (t.lineno,t.value) - t.value = 0 + t.value = int(t.value) return t # Define a rule so we can track line numbers @@ -264,11 +286,14 @@ def t_error(t): t.lexer.skip(1) # Build the lexer -lex.lex() +lexer = lex.lex() </pre> </blockquote> -To use the lexer, you first need to feed it some input text using its <tt>input()</tt> method. After that, repeated calls to <tt>token()</tt> produce tokens. The following code shows how this works: +To use the lexer, you first need to feed it some input text using +its <tt>input()</tt> method. After that, repeated calls +to <tt>token()</tt> produce tokens. The following code shows how this +works: <blockquote> <pre> @@ -280,11 +305,11 @@ data = ''' ''' # Give the lexer some input -lex.input(data) +lexer.input(data) # Tokenize -while 1: - tok = lex.token() +while True: + tok = lexer.token() if not tok: break # No more input print tok </pre> @@ -308,7 +333,16 @@ LexToken(NUMBER,2,3,21) </pre> </blockquote> -The tokens returned by <tt>lex.token()</tt> are instances +Lexers also support the iteration protocol. So, you can write the above loop as follows: + +<blockquote> +<pre> +for tok in lexer: + print tok +</pre> +</blockquote> + +The tokens returned by <tt>lexer.token()</tt> are instances of <tt>LexToken</tt>. This object has attributes <tt>tok.type</tt>, <tt>tok.value</tt>, <tt>tok.lineno</tt>, and <tt>tok.lexpos</tt>. The following code shows an example of @@ -317,8 +351,8 @@ accessing these attributes: <blockquote> <pre> # Tokenize -while 1: - tok = lex.token() +while True: + tok = lexer.token() if not tok: break # No more input print tok.type, tok.value, tok.line, tok.lexpos </pre> @@ -330,7 +364,7 @@ type and value of the token itself. the location of the token. <tt>tok.lexpos</tt> is the index of the token relative to the start of the input text. -<H3><a name="ply_nn5"></a>3.2 The tokens list</H3> +<H3><a name="ply_nn5"></a>4.2 The tokens list</H3> All lexers must provide a list <tt>tokens</tt> that defines all of the possible token @@ -355,7 +389,7 @@ tokens = ( </pre> </blockquote> -<H3><a name="ply_nn6"></a>3.3 Specification of tokens</H3> +<H3><a name="ply_nn6"></a>4.3 Specification of tokens</H3> Each token is specified by writing a regular expression rule. Each of these rules are @@ -379,11 +413,7 @@ converts the string into a Python integer. <pre> def t_NUMBER(t): r'\d+' - try: - t.value = int(t.value) - except ValueError: - print "Number %s is too large!" % t.value - t.value = 0 + t.value = int(t.value) return t </pre> </blockquote> @@ -414,8 +444,8 @@ expressions in order of decreasing length, this problem is solved for rules defi the order can be explicitly controlled since rules appearing first are checked first. <p> -To handle reserved words, it is usually easier to just match an identifier and do a special name lookup in a function -like this: +To handle reserved words, you should write a single rule to match an +identifier and do a special name lookup in a function like this: <blockquote> <pre> @@ -427,6 +457,8 @@ reserved = { ... } +tokens = ['LPAREN','RPAREN',...,'ID'] + list(reserved.values()) + def t_ID(t): r'[a-zA-Z_][a-zA-Z_0-9]*' t.type = reserved.get(t.value,'ID') # Check for reserved words @@ -449,7 +481,7 @@ t_PRINT = r'print' those rules will be triggered for identifiers that include those words as a prefix such as "forget" or "printed". This is probably not what you want. -<H3><a name="ply_nn7"></a>3.4 Token values</H3> +<H3><a name="ply_nn7"></a>4.4 Token values</H3> When tokens are returned by lex, they have a value that is stored in the <tt>value</tt> attribute. Normally, the value is the text @@ -468,9 +500,10 @@ def t_ID(t): </blockquote> It is important to note that storing data in other attribute names is <em>not</em> recommended. The <tt>yacc.py</tt> module only exposes the -contents of the <tt>value</tt> attribute. Thus, accessing other attributes may be unnecessarily awkward. +contents of the <tt>value</tt> attribute. Thus, accessing other attributes may be unnecessarily awkward. If you +need to store multiple values on a token, assign a tuple, dictionary, or instance to <tt>value</tt>. -<H3><a name="ply_nn8"></a>3.5 Discarded tokens</H3> +<H3><a name="ply_nn8"></a>4.5 Discarded tokens</H3> To discard a token, such as a comment, simply define a token rule that returns no value. For example: @@ -496,7 +529,7 @@ Be advised that if you are ignoring many different kinds of text, you may still control over the order in which regular expressions are matched (i.e., functions are matched in order of specification whereas strings are sorted by regular expression length). -<H3><a name="ply_nn9"></a>3.6 Line numbers and positional information</H3> +<H3><a name="ply_nn9"></a>4.6 Line numbers and positional information</H3> <p>By default, <tt>lex.py</tt> knows nothing about line numbers. This is because <tt>lex.py</tt> doesn't know anything @@ -525,11 +558,10 @@ column information as a separate step. For instance, just count backwards unti # input is the input text string # token is a token instance def find_column(input,token): - i = token.lexpos - while i > 0: - if input[i] == '\n': break - i -= 1 - column = (token.lexpos - i)+1 + last_cr = input.rfind('\n',0,token.lexpos) + if last_cr < 0: + last_cr = 0 + column = (token.lexpos - last_cr) + 1 return column </pre> </blockquote> @@ -537,7 +569,7 @@ def find_column(input,token): Since column information is often only useful in the context of error handling, calculating the column position can be performed when needed as opposed to doing it for each token. -<H3><a name="ply_nn10"></a>3.7 Ignored characters</H3> +<H3><a name="ply_nn10"></a>4.7 Ignored characters</H3> <p> @@ -549,7 +581,7 @@ similar to <tt>t_newline()</tt>, the use of <tt>t_ignore</tt> provides substanti lexing performance because it is handled as a special case and is checked in a much more efficient manner than the normal regular expression rules. -<H3><a name="ply_nn11"></a>3.8 Literal characters</H3> +<H3><a name="ply_nn11"></a>4.8 Literal characters</H3> <p> @@ -575,7 +607,7 @@ take precedence. <p> When a literal token is returned, both its <tt>type</tt> and <tt>value</tt> attributes are set to the character itself. For example, <tt>'+'</tt>. -<H3><a name="ply_nn12"></a>3.9 Error handling</H3> +<H3><a name="ply_nn12"></a>4.9 Error handling</H3> <p> @@ -596,44 +628,42 @@ def t_error(t): In this case, we simply print the offending character and skip ahead one character by calling <tt>t.lexer.skip(1)</tt>. -<H3><a name="ply_nn13"></a>3.10 Building and using the lexer</H3> +<H3><a name="ply_nn13"></a>4.10 Building and using the lexer</H3> <p> To build the lexer, the function <tt>lex.lex()</tt> is used. This function uses Python reflection (or introspection) to read the the regular expression rules -out of the calling context and build the lexer. Once the lexer has been built, two functions can +out of the calling context and build the lexer. Once the lexer has been built, two methods can be used to control the lexer. <ul> -<li><tt>lex.input(data)</tt>. Reset the lexer and store a new input string. -<li><tt>lex.token()</tt>. Return the next token. Returns a special <tt>LexToken</tt> instance on success or +<li><tt>lexer.input(data)</tt>. Reset the lexer and store a new input string. +<li><tt>lexer.token()</tt>. Return the next token. Returns a special <tt>LexToken</tt> instance on success or None if the end of the input text has been reached. </ul> -If desired, the lexer can also be used as an object. The <tt>lex()</tt> returns a <tt>Lexer</tt> object that -can be used for this purpose. For example: +The preferred way to use PLY is to invoke the above methods directly on the lexer object returned by the +<tt>lex()</tt> function. The legacy interface to PLY involves module-level functions <tt>lex.input()</tt> and <tt>lex.token()</tt>. +For example: <blockquote> <pre> -lexer = lex.lex() -lexer.input(sometext) +lex.lex() +lex.input(sometext) while 1: - tok = lexer.token() + tok = lex.token() if not tok: break print tok </pre> </blockquote> <p> -This latter technique should be used if you intend to use multiple lexers in your application. Simply define each -lexer in its own module and use the object returned by <tt>lex()</tt> as appropriate. +In this example, the module-level functions <tt>lex.input()</tt> and <tt>lex.token()</tt> are bound to the <tt>input()</tt> +and <tt>token()</tt> methods of the last lexer created by the lex module. This interface may go away at some point so +it's probably best not to use it. -<p> -Note: The global functions <tt>lex.input()</tt> and <tt>lex.token()</tt> are bound to the <tt>input()</tt> -and <tt>token()</tt> methods of the last lexer created by the lex module. - -<H3><a name="ply_nn14"></a>3.11 The @TOKEN decorator</H3> +<H3><a name="ply_nn14"></a>4.11 The @TOKEN decorator</H3> In some applications, you may want to define build tokens from as a series of @@ -680,7 +710,7 @@ t_ID.__doc__ = identifier <b>NOTE:</b> Use of <tt>@TOKEN</tt> requires Python-2.4 or newer. If you're concerned about backwards compatibility with older versions of Python, use the alternative approach of setting the docstring directly. -<H3><a name="ply_nn15"></a>3.12 Optimized mode</H3> +<H3><a name="ply_nn15"></a>4.12 Optimized mode</H3> For improved performance, it may be desirable to use Python's @@ -717,7 +747,7 @@ lexer = lex.lex(optimize=1,lextab="footab") When running in optimized mode, it is important to note that lex disables most error checking. Thus, this is really only recommended if you're sure everything is working correctly and you're ready to start releasing production code. -<H3><a name="ply_nn16"></a>3.13 Debugging</H3> +<H3><a name="ply_nn16"></a>4.13 Debugging</H3> For the purpose of debugging, you can run <tt>lex()</tt> in a debugging mode as follows: @@ -728,12 +758,16 @@ lexer = lex.lex(debug=1) </pre> </blockquote> -This will result in a large amount of debugging information to be printed including all of the added rules and the master -regular expressions. +<p> +This will produce various sorts of debugging information including all of the added rules, +the master regular expressions used by the lexer, and tokens generating during lexing. +</p> +<p> In addition, <tt>lex.py</tt> comes with a simple main function which will either tokenize input read from standard input or from a file specified on the command line. To use it, simply put this in your lexer: +</p> <blockquote> <pre> @@ -742,7 +776,10 @@ if __name__ == '__main__': </pre> </blockquote> -<H3><a name="ply_nn17"></a>3.14 Alternative specification of lexers</H3> +Please refer to the "Debugging" section near the end for some more advanced details +of debugging. + +<H3><a name="ply_nn17"></a>4.14 Alternative specification of lexers</H3> As shown in the example, lexers are specified all within one Python module. If you want to @@ -780,11 +817,7 @@ t_RPAREN = r'\)' # A regular expression rule with some action code def t_NUMBER(t): r'\d+' - try: - t.value = int(t.value) - except ValueError: - print "Line %d: Number %s is too large!" % (t.lineno,t.value) - t.value = 0 + t.value = int(t.value) return t # Define a rule so we can track line numbers @@ -821,7 +854,7 @@ None </pre> </blockquote> -The <tt>object</tt> option can be used to define lexers as a class instead of a module. For example: +The <tt>module</tt> option can also be used to define lexers from instances of a class. For example: <blockquote> <pre> @@ -851,11 +884,7 @@ class MyLexer: # Note addition of self parameter since we're in a class def t_NUMBER(self,t): r'\d+' - try: - t.value = int(t.value) - except ValueError: - print "Line %d: Number %s is too large!" % (t.lineno,t.value) - t.value = 0 + t.value = int(t.value) return t # Define a rule so we can track line numbers @@ -873,12 +902,12 @@ class MyLexer: <b># Build the lexer def build(self,**kwargs): - self.lexer = lex.lex(object=self, **kwargs)</b> + self.lexer = lex.lex(module=self, **kwargs)</b> # Test it output def test(self,data): self.lexer.input(data) - while 1: + while True: tok = lexer.token() if not tok: break print tok @@ -890,14 +919,81 @@ m.test("3 + 4") # Test it </pre> </blockquote> -For reasons that are subtle, you should <em>NOT</em> invoke <tt>lex.lex()</tt> inside the <tt>__init__()</tt> method of your class. If you -do, it may cause bizarre behavior if someone tries to duplicate a lexer object. Keep reading. -<H3><a name="ply_nn18"></a>3.15 Maintaining state</H3> +When building a lexer from class, <em>you should construct the lexer from +an instance of the class</em>, not the class object itself. This is because +PLY only works properly if the lexer actions are defined by bound-methods. + +<p> +When using the <tt>module</tt> option to <tt>lex()</tt>, PLY collects symbols +from the underlying object using the <tt>dir()</tt> function. There is no +direct access to the <tt>__dict__</tt> attribute of the object supplied as a +module value. + +<P> +Finally, if you want to keep things nicely encapsulated, but don't want to use a +full-fledged class definition, lexers can be defined using closures. For example: + +<blockquote> +<pre> +import ply.lex as lex + +# List of token names. This is always required +tokens = ( + 'NUMBER', + 'PLUS', + 'MINUS', + 'TIMES', + 'DIVIDE', + 'LPAREN', + 'RPAREN', +) + +def MyLexer(): + # Regular expression rules for simple tokens + t_PLUS = r'\+' + t_MINUS = r'-' + t_TIMES = r'\*' + t_DIVIDE = r'/' + t_LPAREN = r'\(' + t_RPAREN = r'\)' + + # A regular expression rule with some action code + def t_NUMBER(t): + r'\d+' + t.value = int(t.value) + return t + + # Define a rule so we can track line numbers + def t_newline(t): + r'\n+' + t.lexer.lineno += len(t.value) + + # A string containing ignored characters (spaces and tabs) + t_ignore = ' \t' + + # Error handling rule + def t_error(t): + print "Illegal character '%s'" % t.value[0] + t.lexer.skip(1) + + # Build the lexer from my environment and return it + return lex.lex() +</pre> +</blockquote> + + +<H3><a name="ply_nn18"></a>4.15 Maintaining state</H3> -In your lexer, you may want to maintain a variety of state information. This might include mode settings, symbol tables, and other details. There are a few -different ways to handle this situation. First, you could just keep some global variables: +In your lexer, you may want to maintain a variety of state +information. This might include mode settings, symbol tables, and +other details. As an example, suppose that you wanted to keep +track of how many NUMBER tokens had been encountered. + +<p> +One way to do this is to keep a set of global variables in the module +where you created the lexer. For example: <blockquote> <pre> @@ -906,28 +1002,22 @@ def t_NUMBER(t): r'\d+' global num_count num_count += 1 - try: - t.value = int(t.value) - except ValueError: - print "Line %d: Number %s is too large!" % (t.lineno,t.value) - t.value = 0 + t.value = int(t.value) return t </pre> </blockquote> -Alternatively, you can store this information inside the Lexer object created by <tt>lex()</tt>. To this, you can use the <tt>lexer</tt> attribute -of tokens passed to the various rules. For example: +If you don't like the use of a global variable, another place to store +information is inside the Lexer object created by <tt>lex()</tt>. +To this, you can use the <tt>lexer</tt> attribute of tokens passed to +the various rules. For example: <blockquote> <pre> def t_NUMBER(t): r'\d+' t.lexer.num_count += 1 # Note use of lexer attribute - try: - t.value = int(t.value) - except ValueError: - print "Line %d: Number %s is too large!" % (t.lineno,t.value) - t.value = 0 + t.value = int(t.value) return t lexer = lex.lex() @@ -935,17 +1025,20 @@ lexer.num_count = 0 # Set the initial count </pre> </blockquote> -This latter approach has the advantage of storing information inside -the lexer itself---something that may be useful if multiple instances -of the same lexer have been created. However, it may also feel kind -of "hacky" to the purists. Just to put their mind at some ease, all +This latter approach has the advantage of being simple and working +correctly in applications where multiple instantiations of a given +lexer exist in the same application. However, this might also feel +like a gross violation of encapsulation to OO purists. +Just to put your mind at some ease, all internal attributes of the lexer (with the exception of <tt>lineno</tt>) have names that are prefixed by <tt>lex</tt> (e.g., <tt>lexdata</tt>,<tt>lexpos</tt>, etc.). Thus, -it should be perfectly safe to store attributes in the lexer that -don't have names starting with that prefix. +it is perfectly safe to store attributes in the lexer that +don't have names starting with that prefix or a name that conlicts with one of the +predefined methods (e.g., <tt>input()</tt>, <tt>token()</tt>, etc.). <p> -A third approach is to define the lexer as a class as shown in the previous example: +If you don't like assigning values on the lexer object, you can define your lexer as a class as +shown in the previous section: <blockquote> <pre> @@ -954,11 +1047,7 @@ class MyLexer: def t_NUMBER(self,t): r'\d+' self.num_count += 1 - try: - t.value = int(t.value) - except ValueError: - print "Line %d: Number %s is too large!" % (t.lineno,t.value) - t.value = 0 + t.value = int(t.value) return t def build(self, **kwargs): @@ -966,23 +1055,36 @@ class MyLexer: def __init__(self): self.num_count = 0 - -# Create a lexer -m = MyLexer() -lexer = lex.lex(object=m) </pre> </blockquote> -The class approach may be the easiest to manage if your application is going to be creating multiple instances of the same lexer and -you need to manage a lot of state. +The class approach may be the easiest to manage if your application is +going to be creating multiple instances of the same lexer and you need +to manage a lot of state. -<H3><a name="ply_nn19"></a>3.16 Duplicating lexers</H3> +<p> +State can also be managed through closures. For example, in Python 3: +<blockquote> +<pre> +def MyLexer(): + num_count = 0 + ... + def t_NUMBER(t): + r'\d+' + nonlocal num_count + num_count += 1 + t.value = int(t.value) + return t + ... +</pre> +</blockquote> + +<H3><a name="ply_nn19"></a>4.16 Lexer cloning</H3> -<b>NOTE: I am thinking about deprecating this feature. Post comments on <a href="http://groups.google.com/group/ply-hack">ply-hack@googlegroups.com</a> or send me a private email at dave@dabeaz.com.</b> <p> -If necessary, a lexer object can be quickly duplicated by invoking its <tt>clone()</tt> method. For example: +If necessary, a lexer object can be duplicated by invoking its <tt>clone()</tt> method. For example: <blockquote> <pre> @@ -992,23 +1094,25 @@ newlexer = lexer.clone() </pre> </blockquote> -When a lexer is cloned, the copy is identical to the original lexer, -including any input text. However, once created, different text can be -fed to the clone which can be used independently. This capability may -be useful in situations when you are writing a parser/compiler that +When a lexer is cloned, the copy is exactly identical to the original lexer +including any input text and internal state. However, the clone allows a +different set of input text to be supplied which may be processed separately. +This may be useful in situations when you are writing a parser/compiler that involves recursive or reentrant processing. For instance, if you needed to scan ahead in the input for some reason, you could create a -clone and use it to look ahead. +clone and use it to look ahead. Or, if you were implementing some kind of preprocessor, +cloned lexers could be used to handle different input files. <p> -The advantage of using <tt>clone()</tt> instead of reinvoking <tt>lex()</tt> is -that it is significantly faster. Namely, it is not necessary to re-examine all of the -token rules, build a regular expression, and construct internal tables. All of this -information can simply be reused in the new lexer. +Creating a clone is different than calling <tt>lex.lex()</tt> in that +PLY doesn't regenerate any of the internal tables or regular expressions. So, <p> -Special considerations need to be made when cloning a lexer that is defined as a class. Previous sections -showed an example of a class <tt>MyLexer</tt>. If you have the following code: +Special considerations need to be made when cloning lexers that also +maintain their own internal state using classes or closures. Namely, +you need to be aware that the newly created lexers will share all of +this state with the original lexer. For example, if you defined a +lexer as a class and did this: <blockquote> <pre> @@ -1020,43 +1124,12 @@ b = a.clone() # Clone the lexer </blockquote> Then both <tt>a</tt> and <tt>b</tt> are going to be bound to the same -object <tt>m</tt>. If the object <tt>m</tt> contains internal state -related to lexing, this sharing may lead to quite a bit of confusion. To fix this, -the <tt>clone()</tt> method accepts an optional argument that can be used to supply a new object. This -can be used to clone the lexer and bind it to a new instance. For example: +object <tt>m</tt> and any changes to <tt>m</tt> will be reflected in both lexers. It's +important to emphasize that <tt>clone()</tt> is only meant to create a new lexer +that reuses the regular expressions and environment of another lexer. If you +need to make a totally new copy of a lexer, then call <tt>lex()</tt> again. -<blockquote> -<pre> -m = MyLexer() # Create a lexer -a = lex.lex(object=m) - -# Create a clone -n = MyLexer() # New instance of MyLexer -b = a.clone(n) # New lexer bound to n -</pre> -</blockquote> - -It may make sense to encapsulate all of this inside a method: - -<blockquote> -<pre> -class MyLexer: - ... - def clone(self): - c = MyLexer() # Create a new instance of myself - # Copy attributes from self to c as appropriate - ... - # Clone the lexer - c.lexer = self.lexer.clone(c) - return c -</pre> -</blockquote> - -The fact that a new instance of <tt>MyLexer</tt> may be created while cloning a lexer is the reason why you should never -invoke <tt>lex.lex()</tt> inside <tt>__init__()</tt>. If you do, the lexer will be rebuilt from scratch and you lose -all of the performance benefits of using <tt>clone()</tt> in the first place. - -<H3><a name="ply_nn20"></a>3.17 Internal lexer state</H3> +<H3><a name="ply_nn20"></a>4.17 Internal lexer state</H3> A Lexer object <tt>lexer</tt> has a number of internal attributes that may be useful in certain @@ -1074,8 +1147,9 @@ matched at the new position. <p> <tt>lexer.lineno</tt> <blockquote> -The current value of the line number attribute stored in the lexer. This can be modified as needed to -change the line number. +The current value of the line number attribute stored in the lexer. PLY only specifies that the attribute +exists---it never sets, updates, or performs any processing with it. If you want to track line numbers, +you will need to add code yourself (see the section on line numbers and positional information). </blockquote> <p> @@ -1090,9 +1164,10 @@ would probably be a bad idea to modify this unless you really know what you're d <blockquote> This is the raw <tt>Match</tt> object returned by the Python <tt>re.match()</tt> function (used internally by PLY) for the current token. If you have written a regular expression that contains named groups, you can use this to retrieve those values. +Note: This attribute is only updated when tokens are defined and processed by functions. </blockquote> -<H3><a name="ply_nn21"></a>3.18 Conditional lexing and start conditions</H3> +<H3><a name="ply_nn21"></a>4.18 Conditional lexing and start conditions</H3> In advanced parsing applications, it may be useful to have different @@ -1291,7 +1366,7 @@ However, if the closing right brace is encountered, the rule <tt>t_ccode_rbrace< position), stores it, and returns a token 'CCODE' containing all of that text. When returning the token, the lexing state is restored back to its initial state. -<H3><a name="ply_nn21"></a>3.19 Miscellaneous Issues</H3> +<H3><a name="ply_nn21"></a>4.19 Miscellaneous Issues</H3> <P> @@ -1331,7 +1406,7 @@ tokens are available. <li>The <tt>token()</tt> method must return an object <tt>tok</tt> that has <tt>type</tt> and <tt>value</tt> attributes. </ul> -<H2><a name="ply_nn22"></a>4. Parsing basics</H2> +<H2><a name="ply_nn22"></a>5. Parsing basics</H2> <tt>yacc.py</tt> is used to parse language syntax. Before showing an @@ -1357,9 +1432,10 @@ factor : NUMBER </blockquote> In the grammar, symbols such as <tt>NUMBER</tt>, <tt>+</tt>, <tt>-</tt>, <tt>*</tt>, and <tt>/</tt> are known -as <em>terminals</em> and correspond to raw input tokens. Identifiers such as <tt>term</tt> and <tt>factor</tt> refer to more -complex rules, typically comprised of a collection of tokens. These identifiers are known as <em>non-terminals</em>. +as <em>terminals</em> and correspond to raw input tokens. Identifiers such as <tt>term</tt> and <tt>factor</tt> refer to +grammar rules comprised of a collection of terminals and other rules. These identifiers are known as <em>non-terminals</em>. <P> + The semantic behavior of a language is often specified using a technique known as syntax directed translation. In syntax directed translation, attributes are attached to each symbol in a given grammar @@ -1385,9 +1461,12 @@ factor : NUMBER factor.val = int(NUMBER.lexval) </pre> </blockquote> -A good way to think about syntax directed translation is to simply think of each symbol in the grammar as some -kind of object. The semantics of the language are then expressed as a collection of methods/operations on these -objects. +A good way to think about syntax directed translation is to +view each symbol in the grammar as a kind of object. Associated +with each symbol is a value representing its "state" (for example, the +<tt>val</tt> attribute above). Semantic +actions are then expressed as a collection of functions or methods +that operate on the symbols and associated values. <p> Yacc uses a parsing technique known as LR-parsing or shift-reduce parsing. LR parsing is a @@ -1396,62 +1475,78 @@ Whenever a valid right-hand-side is found in the input, the appropriate action c grammar symbols are replaced by the grammar symbol on the left-hand-side. <p> -LR parsing is commonly implemented by shifting grammar symbols onto a stack and looking at the stack and the next -input token for patterns. The details of the algorithm can be found in a compiler text, but the -following example illustrates the steps that are performed if you wanted to parse the expression -<tt>3 + 5 * (10 - 20)</tt> using the grammar defined above: +LR parsing is commonly implemented by shifting grammar symbols onto a +stack and looking at the stack and the next input token for patterns that +match one of the grammar rules. +The details of the algorithm can be found in a compiler textbook, but the +following example illustrates the steps that are performed if you +wanted to parse the expression +<tt>3 + 5 * (10 - 20)</tt> using the grammar defined above. In the example, +the special symbol <tt>$</tt> represents the end of input. + <blockquote> <pre> Step Symbol Stack Input Tokens Action ---- --------------------- --------------------- ------------------------------- -1 $ 3 + 5 * ( 10 - 20 )$ Shift 3 -2 $ 3 + 5 * ( 10 - 20 )$ Reduce factor : NUMBER -3 $ factor + 5 * ( 10 - 20 )$ Reduce term : factor -4 $ term + 5 * ( 10 - 20 )$ Reduce expr : term -5 $ expr + 5 * ( 10 - 20 )$ Shift + -6 $ expr + 5 * ( 10 - 20 )$ Shift 5 -7 $ expr + 5 * ( 10 - 20 )$ Reduce factor : NUMBER -8 $ expr + factor * ( 10 - 20 )$ Reduce term : factor -9 $ expr + term * ( 10 - 20 )$ Shift * -10 $ expr + term * ( 10 - 20 )$ Shift ( -11 $ expr + term * ( 10 - 20 )$ Shift 10 -12 $ expr + term * ( 10 - 20 )$ Reduce factor : NUMBER -13 $ expr + term * ( factor - 20 )$ Reduce term : factor -14 $ expr + term * ( term - 20 )$ Reduce expr : term -15 $ expr + term * ( expr - 20 )$ Shift - -16 $ expr + term * ( expr - 20 )$ Shift 20 -17 $ expr + term * ( expr - 20 )$ Reduce factor : NUMBER -18 $ expr + term * ( expr - factor )$ Reduce term : factor -19 $ expr + term * ( expr - term )$ Reduce expr : expr - term -20 $ expr + term * ( expr )$ Shift ) -21 $ expr + term * ( expr ) $ Reduce factor : (expr) -22 $ expr + term * factor $ Reduce term : term * factor -23 $ expr + term $ Reduce expr : expr + term -24 $ expr $ Reduce expr -25 $ $ Success! -</pre> -</blockquote> - -When parsing the expression, an underlying state machine and the current input token determine what to do next. -If the next token looks like part of a valid grammar rule (based on other items on the stack), it is generally shifted -onto the stack. If the top of the stack contains a valid right-hand-side of a grammar rule, it is -usually "reduced" and the symbols replaced with the symbol on the left-hand-side. When this reduction occurs, the -appropriate action is triggered (if defined). If the input token can't be shifted and the top of stack doesn't match -any grammar rules, a syntax error has occurred and the parser must take some kind of recovery step (or bail out). - -<p> -It is important to note that the underlying implementation is built around a large finite-state machine that is encoded -in a collection of tables. The construction of these tables is quite complicated and beyond the scope of this discussion. -However, subtle details of this process explain why, in the example above, the parser chooses to shift a token -onto the stack in step 9 rather than reducing the rule <tt>expr : expr + term</tt>. - -<H2><a name="ply_nn23"></a>5. Yacc reference</H2> - - -This section describes how to use write parsers in PLY. - -<H3><a name="ply_nn24"></a>5.1 An example</H3> +1 3 + 5 * ( 10 - 20 )$ Shift 3 +2 3 + 5 * ( 10 - 20 )$ Reduce factor : NUMBER +3 factor + 5 * ( 10 - 20 )$ Reduce term : factor +4 term + 5 * ( 10 - 20 )$ Reduce expr : term +5 expr + 5 * ( 10 - 20 )$ Shift + +6 expr + 5 * ( 10 - 20 )$ Shift 5 +7 expr + 5 * ( 10 - 20 )$ Reduce factor : NUMBER +8 expr + factor * ( 10 - 20 )$ Reduce term : factor +9 expr + term * ( 10 - 20 )$ Shift * +10 expr + term * ( 10 - 20 )$ Shift ( +11 expr + term * ( 10 - 20 )$ Shift 10 +12 expr + term * ( 10 - 20 )$ Reduce factor : NUMBER +13 expr + term * ( factor - 20 )$ Reduce term : factor +14 expr + term * ( term - 20 )$ Reduce expr : term +15 expr + term * ( expr - 20 )$ Shift - +16 expr + term * ( expr - 20 )$ Shift 20 +17 expr + term * ( expr - 20 )$ Reduce factor : NUMBER +18 expr + term * ( expr - factor )$ Reduce term : factor +19 expr + term * ( expr - term )$ Reduce expr : expr - term +20 expr + term * ( expr )$ Shift ) +21 expr + term * ( expr ) $ Reduce factor : (expr) +22 expr + term * factor $ Reduce term : term * factor +23 expr + term $ Reduce expr : expr + term +24 expr $ Reduce expr +25 $ Success! +</pre> +</blockquote> + +When parsing the expression, an underlying state machine and the +current input token determine what happens next. If the next token +looks like part of a valid grammar rule (based on other items on the +stack), it is generally shifted onto the stack. If the top of the +stack contains a valid right-hand-side of a grammar rule, it is +usually "reduced" and the symbols replaced with the symbol on the +left-hand-side. When this reduction occurs, the appropriate action is +triggered (if defined). If the input token can't be shifted and the +top of stack doesn't match any grammar rules, a syntax error has +occurred and the parser must take some kind of recovery step (or bail +out). A parse is only successful if the parser reaches a state where +the symbol stack is empty and there are no more input tokens. + +<p> +It is important to note that the underlying implementation is built +around a large finite-state machine that is encoded in a collection of +tables. The construction of these tables is non-trivial and +beyond the scope of this discussion. However, subtle details of this +process explain why, in the example above, the parser chooses to shift +a token onto the stack in step 9 rather than reducing the +rule <tt>expr : expr + term</tt>. + +<H2><a name="ply_nn23"></a>6. Yacc</H2> + + +The <tt>ply.yacc</tt> module implements the parsing component of PLY. +The name "yacc" stands for "Yet Another Compiler Compiler" and is +borrowed from the Unix tool of the same name. + +<H3><a name="ply_nn24"></a>6.1 An example</H3> Suppose you wanted to make a grammar for simple arithmetic expressions as previously described. Here is @@ -1503,26 +1598,26 @@ def p_error(p): print "Syntax error in input!" # Build the parser -yacc.yacc() - -# Use this if you want to build the parser using SLR instead of LALR -# yacc.yacc(method="SLR") +parser = yacc.yacc() -while 1: +while True: try: s = raw_input('calc > ') except EOFError: break if not s: continue - result = yacc.parse(s) + result = parser.parse(s) print result </pre> </blockquote> -In this example, each grammar rule is defined by a Python function where the docstring to that function contains the -appropriate context-free grammar specification. Each function accepts a single -argument <tt>p</tt> that is a sequence containing the values of each grammar symbol in the corresponding rule. The values of -<tt>p[i]</tt> are mapped to grammar symbols as shown here: +In this example, each grammar rule is defined by a Python function +where the docstring to that function contains the appropriate +context-free grammar specification. The statements that make up the +function body implement the semantic actions of the rule. Each function +accepts a single argument <tt>p</tt> that is a sequence containing the +values of each grammar symbol in the corresponding rule. The values +of <tt>p[i]</tt> are mapped to grammar symbols as shown here: <blockquote> <pre> @@ -1535,42 +1630,49 @@ def p_expression_plus(p): </pre> </blockquote> -For tokens, the "value" of the corresponding <tt>p[i]</tt> is the -<em>same</em> as the <tt>p.value</tt> attribute assigned -in the lexer module. For non-terminals, the value is determined by -whatever is placed in <tt>p[0]</tt> when rules are reduced. This -value can be anything at all. However, it probably most common for -the value to be a simple Python type, a tuple, or an instance. In this example, we -are relying on the fact that the <tt>NUMBER</tt> token stores an integer value in its value -field. All of the other rules simply perform various types of integer operations and store -the result. - -<P> -Note: The use of negative indices have a special meaning in yacc---specially <tt>p[-1]</tt> does -not have the same value as <tt>p[3]</tt> in this example. Please see the section on "Embedded Actions" for further -details. - <p> -The first rule defined in the yacc specification determines the starting grammar -symbol (in this case, a rule for <tt>expression</tt> appears first). Whenever -the starting rule is reduced by the parser and no more input is available, parsing -stops and the final value is returned (this value will be whatever the top-most rule -placed in <tt>p[0]</tt>). Note: an alternative starting symbol can be specified using the <tt>start</tt> keyword argument to +For tokens, the "value" of the corresponding <tt>p[i]</tt> is the +<em>same</em> as the <tt>p.value</tt> attribute assigned in the lexer +module. For non-terminals, the value is determined by whatever is +placed in <tt>p[0]</tt> when rules are reduced. This value can be +anything at all. However, it probably most common for the value to be +a simple Python type, a tuple, or an instance. In this example, we +are relying on the fact that the <tt>NUMBER</tt> token stores an +integer value in its value field. All of the other rules simply +perform various types of integer operations and propagate the result. +</p> + +<p> +Note: The use of negative indices have a special meaning in +yacc---specially <tt>p[-1]</tt> does not have the same value +as <tt>p[3]</tt> in this example. Please see the section on "Embedded +Actions" for further details. +</p> + +<p> +The first rule defined in the yacc specification determines the +starting grammar symbol (in this case, a rule for <tt>expression</tt> +appears first). Whenever the starting rule is reduced by the parser +and no more input is available, parsing stops and the final value is +returned (this value will be whatever the top-most rule placed +in <tt>p[0]</tt>). Note: an alternative starting symbol can be +specified using the <tt>start</tt> keyword argument to <tt>yacc()</tt>. -<p>The <tt>p_error(p)</tt> rule is defined to catch syntax errors. See the error handling section -below for more detail. +<p>The <tt>p_error(p)</tt> rule is defined to catch syntax errors. +See the error handling section below for more detail. <p> -To build the parser, call the <tt>yacc.yacc()</tt> function. This function -looks at the module and attempts to construct all of the LR parsing tables for the grammar -you have specified. The first time <tt>yacc.yacc()</tt> is invoked, you will get a message -such as this: +To build the parser, call the <tt>yacc.yacc()</tt> function. This +function looks at the module and attempts to construct all of the LR +parsing tables for the grammar you have specified. The first +time <tt>yacc.yacc()</tt> is invoked, you will get a message such as +this: <blockquote> <pre> $ python calcparse.py -yacc: Generating LALR parsing table... +Generating LALR tables calc > </pre> </blockquote> @@ -1582,7 +1684,8 @@ debugging file called <tt>parser.out</tt> is created. On subsequent executions, <tt>yacc</tt> will reload the table from <tt>parsetab.py</tt> unless it has detected a change in the underlying grammar (in which case the tables and <tt>parsetab.py</tt> file are -regenerated). Note: The names of parser output files can be changed if necessary. See the notes that follow later. +regenerated). Note: The names of parser output files can be changed +if necessary. See the <a href="reference.html">PLY Reference</a> for details. <p> If any errors are detected in your grammar specification, <tt>yacc.py</tt> will produce @@ -1597,9 +1700,18 @@ diagnostic messages and possibly raise an exception. Some of the errors that ca <li>Undefined rules and tokens </ul> -The next few sections now discuss a few finer points of grammar construction. +The next few sections discuss grammar specification in more detail. -<H3><a name="ply_nn25"></a>5.2 Combining Grammar Rule Functions</H3> +<p> +The final part of the example shows how to actually run the parser +created by +<tt>yacc()</tt>. To run the parser, you simply have to call +the <tt>parse()</tt> with a string of input text. This will run all +of the grammar rules and return the result of the entire parse. This +result return is the value assigned to <tt>p[0]</tt> in the starting +grammar rule. + +<H3><a name="ply_nn25"></a>6.2 Combining Grammar Rule Functions</H3> When grammar rules are similar, they can be combined into a single function. @@ -1668,7 +1780,15 @@ def p_expressions(p): </pre> </blockquote> -<H3><a name="ply_nn26"></a>5.3 Character Literals</H3> +If parsing performance is a concern, you should resist the urge to put +too much conditional processing into a single grammar rule as shown in +these examples. When you add checks to see which grammar rule is +being handled, you are actually duplicating the work that the parser +has already performed (i.e., the parser already knows exactly what rule it +matched). You can eliminate this overhead by using a +separate <tt>p_rule()</tt> function for each grammar rule. + +<H3><a name="ply_nn26"></a>6.3 Character Literals</H3> If desired, a grammar may contain tokens defined as single character literals. For example: @@ -1704,7 +1824,7 @@ literals = ['+','-','*','/' ] <b>Character literals are limited to a single character</b>. Thus, it is not legal to specify literals such as <tt>'<='</tt> or <tt>'=='</tt>. For this, use the normal lexing rules (e.g., define a rule such as <tt>t_EQ = r'=='</tt>). -<H3><a name="ply_nn26"></a>5.4 Empty Productions</H3> +<H3><a name="ply_nn26"></a>6.4 Empty Productions</H3> <tt>yacc.py</tt> can handle empty productions by defining a rule like this: @@ -1728,10 +1848,12 @@ def p_optitem(p): </pre> </blockquote> -Note: You can write empty rules anywhere by simply specifying an empty right hand side. However, I personally find that -writing an "empty" rule and using "empty" to denote an empty production is easier to read. +Note: You can write empty rules anywhere by simply specifying an empty +right hand side. However, I personally find that writing an "empty" +rule and using "empty" to denote an empty production is easier to read +and more clearly states your intentions. -<H3><a name="ply_nn28"></a>5.5 Changing the starting symbol</H3> +<H3><a name="ply_nn28"></a>6.5 Changing the starting symbol</H3> Normally, the first rule found in a yacc specification defines the starting grammar rule (top level rule). To change this, simply @@ -1751,8 +1873,10 @@ def p_foo(p): </pre> </blockquote> -The use of a <tt>start</tt> specifier may be useful during debugging since you can use it to have yacc build a subset of -a larger grammar. For this purpose, it is also possible to specify a starting symbol as an argument to <tt>yacc()</tt>. For example: +The use of a <tt>start</tt> specifier may be useful during debugging +since you can use it to have yacc build a subset of a larger grammar. +For this purpose, it is also possible to specify a starting symbol as +an argument to <tt>yacc()</tt>. For example: <blockquote> <pre> @@ -1760,12 +1884,14 @@ yacc.yacc(start='foo') </pre> </blockquote> -<H3><a name="ply_nn27"></a>5.6 Dealing With Ambiguous Grammars</H3> +<H3><a name="ply_nn27"></a>6.6 Dealing With Ambiguous Grammars</H3> -The expression grammar given in the earlier example has been written in a special format to eliminate ambiguity. -However, in many situations, it is extremely difficult or awkward to write grammars in this format. A -much more natural way to express the grammar is in a more compact form like this: +The expression grammar given in the earlier example has been written +in a special format to eliminate ambiguity. However, in many +situations, it is extremely difficult or awkward to write grammars in +this format. A much more natural way to express the grammar is in a +more compact form like this: <blockquote> <pre> @@ -1778,15 +1904,18 @@ expression : expression PLUS expression </pre> </blockquote> -Unfortunately, this grammar specification is ambiguous. For example, if you are parsing the string -"3 * 4 + 5", there is no way to tell how the operators are supposed to be grouped. -For example, does the expression mean "(3 * 4) + 5" or is it "3 * (4+5)"? +Unfortunately, this grammar specification is ambiguous. For example, +if you are parsing the string "3 * 4 + 5", there is no way to tell how +the operators are supposed to be grouped. For example, does the +expression mean "(3 * 4) + 5" or is it "3 * (4+5)"? <p> -When an ambiguous grammar is given to <tt>yacc.py</tt> it will print messages about "shift/reduce conflicts" -or a "reduce/reduce conflicts". A shift/reduce conflict is caused when the parser generator can't decide -whether or not to reduce a rule or shift a symbol on the parsing stack. For example, consider -the string "3 * 4 + 5" and the internal parsing stack: +When an ambiguous grammar is given to <tt>yacc.py</tt> it will print +messages about "shift/reduce conflicts" or "reduce/reduce conflicts". +A shift/reduce conflict is caused when the parser generator can't +decide whether or not to reduce a rule or shift a symbol on the +parsing stack. For example, consider the string "3 * 4 + 5" and the +internal parsing stack: <blockquote> <pre> @@ -1801,20 +1930,25 @@ Step Symbol Stack Input Tokens Action </pre> </blockquote> -In this case, when the parser reaches step 6, it has two options. One is to reduce the -rule <tt>expr : expr * expr</tt> on the stack. The other option is to shift the -token <tt>+</tt> on the stack. Both options are perfectly legal from the rules -of the context-free-grammar. +In this case, when the parser reaches step 6, it has two options. One +is to reduce the rule <tt>expr : expr * expr</tt> on the stack. The +other option is to shift the token <tt>+</tt> on the stack. Both +options are perfectly legal from the rules of the +context-free-grammar. <p> -By default, all shift/reduce conflicts are resolved in favor of shifting. Therefore, in the above -example, the parser will always shift the <tt>+</tt> instead of reducing. Although this -strategy works in many cases (including the ambiguous if-then-else), it is not enough for arithmetic -expressions. In fact, in the above example, the decision to shift <tt>+</tt> is completely wrong---we should have -reduced <tt>expr * expr</tt> since multiplication has higher mathematical precedence than addition. +By default, all shift/reduce conflicts are resolved in favor of +shifting. Therefore, in the above example, the parser will always +shift the <tt>+</tt> instead of reducing. Although this strategy +works in many cases (for example, the case of +"if-then" versus "if-then-else"), it is not enough for arithmetic expressions. In fact, +in the above example, the decision to shift <tt>+</tt> is completely +wrong---we should have reduced <tt>expr * expr</tt> since +multiplication has higher mathematical precedence than addition. -<p>To resolve ambiguity, especially in expression grammars, <tt>yacc.py</tt> allows individual -tokens to be assigned a precedence level and associativity. This is done by adding a variable +<p>To resolve ambiguity, especially in expression +grammars, <tt>yacc.py</tt> allows individual tokens to be assigned a +precedence level and associativity. This is done by adding a variable <tt>precedence</tt> to the grammar file like this: <blockquote> @@ -1826,17 +1960,19 @@ precedence = ( </pre> </blockquote> -This declaration specifies that <tt>PLUS</tt>/<tt>MINUS</tt> have -the same precedence level and are left-associative and that -<tt>TIMES</tt>/<tt>DIVIDE</tt> have the same precedence and are left-associative. -Within the <tt>precedence</tt> declaration, tokens are ordered from lowest to highest precedence. Thus, -this declaration specifies that <tt>TIMES</tt>/<tt>DIVIDE</tt> have higher -precedence than <tt>PLUS</tt>/<tt>MINUS</tt> (since they appear later in the +This declaration specifies that <tt>PLUS</tt>/<tt>MINUS</tt> have the +same precedence level and are left-associative and that +<tt>TIMES</tt>/<tt>DIVIDE</tt> have the same precedence and are +left-associative. Within the <tt>precedence</tt> declaration, tokens +are ordered from lowest to highest precedence. Thus, this declaration +specifies that <tt>TIMES</tt>/<tt>DIVIDE</tt> have higher precedence +than <tt>PLUS</tt>/<tt>MINUS</tt> (since they appear later in the precedence specification). <p> -The precedence specification works by associating a numerical precedence level value and associativity direction to -the listed tokens. For example, in the above example you get: +The precedence specification works by associating a numerical +precedence level value and associativity direction to the listed +tokens. For example, in the above example you get: <blockquote> <pre> @@ -1847,9 +1983,10 @@ DIVIDE : level = 2, assoc = 'left' </pre> </blockquote> -These values are then used to attach a numerical precedence value and associativity direction -to each grammar rule. <em>This is always determined by looking at the precedence of the right-most terminal symbol.</em> -For example: +These values are then used to attach a numerical precedence value and +associativity direction to each grammar rule. <em>This is always +determined by looking at the precedence of the right-most terminal +symbol.</em> For example: <blockquote> <pre> @@ -1867,7 +2004,7 @@ looking at the precedence rules and associativity specifiers. <p> <ol> -<li>If the current token has higher precedence, it is shifted. +<li>If the current token has higher precedence than the rule on the stack, it is shifted. <li>If the grammar rule on the stack has higher precedence, the rule is reduced. <li>If the current token and the grammar rule have the same precedence, the rule is reduced for left associativity, whereas the token is shifted for right associativity. @@ -1875,21 +2012,28 @@ rule is reduced for left associativity, whereas the token is shifted for right a favor of shifting (the default). </ol> -For example, if "expression PLUS expression" has been parsed and the next token -is "TIMES", the action is going to be a shift because "TIMES" has a higher precedence level than "PLUS". On the other -hand, if "expression TIMES expression" has been parsed and the next token is "PLUS", the action -is going to be reduce because "PLUS" has a lower precedence than "TIMES." +For example, if "expression PLUS expression" has been parsed and the +next token is "TIMES", the action is going to be a shift because +"TIMES" has a higher precedence level than "PLUS". On the other hand, +if "expression TIMES expression" has been parsed and the next token is +"PLUS", the action is going to be reduce because "PLUS" has a lower +precedence than "TIMES." <p> -When shift/reduce conflicts are resolved using the first three techniques (with the help of -precedence rules), <tt>yacc.py</tt> will report no errors or conflicts in the grammar. +When shift/reduce conflicts are resolved using the first three +techniques (with the help of precedence rules), <tt>yacc.py</tt> will +report no errors or conflicts in the grammar (although it will print +some information in the <tt>parser.out</tt> debugging file). <p> -One problem with the precedence specifier technique is that it is sometimes necessary to -change the precedence of an operator in certain contents. For example, consider a unary-minus operator -in "3 + 4 * -5". Normally, unary minus has a very high precedence--being evaluated before the multiply. -However, in our precedence specifier, MINUS has a lower precedence than TIMES. To deal with this, -precedence rules can be given for fictitious tokens like this: +One problem with the precedence specifier technique is that it is +sometimes necessary to change the precedence of an operator in certain +contexts. For example, consider a unary-minus operator in "3 + 4 * +-5". Mathematically, the unary minus is normally given a very high +precedence--being evaluated before the multiply. However, in our +precedence specifier, MINUS has a lower precedence than TIMES. To +deal with this, precedence rules can be given for so-called "fictitious tokens" +like this: <blockquote> <pre> @@ -1978,11 +2122,27 @@ whether it's supposed to reduce the 5 as an expression and then reduce the rule <tt>assignment : ID EQUALS expression</tt>. <p> -It should be noted that reduce/reduce conflicts are notoriously difficult to spot -simply looking at the input grammer. To locate these, it is usually easier to look at the -<tt>parser.out</tt> debugging file with an appropriately high level of caffeination. +It should be noted that reduce/reduce conflicts are notoriously +difficult to spot simply looking at the input grammer. When a +reduce/reduce conflict occurs, <tt>yacc()</tt> will try to help by +printing a warning message such as this: + +<blockquote> +<pre> +WARNING: 1 reduce/reduce conflict +WARNING: reduce/reduce conflict in state 15 resolved using rule (assignment -> ID EQUALS NUMBER) +WARNING: rejected rule (expression -> NUMBER) +</pre> +</blockquote> + +This message identifies the two rules that are in conflict. However, +it may not tell you how the parser arrived at such a state. To try +and figure it out, you'll probably have to look at your grammar and +the contents of the +<tt>parser.out</tt> debugging file with an appropriately high level of +caffeination. -<H3><a name="ply_nn28"></a>5.7 The parser.out file</H3> +<H3><a name="ply_nn28"></a>6.7 The parser.out file</H3> Tracking down shift/reduce and reduce/reduce conflicts is one of the finer pleasures of using an LR @@ -2240,10 +2400,15 @@ state 13 </pre> </blockquote> -In the file, each state of the grammar is described. Within each state the "." indicates the current -location of the parse within any applicable grammar rules. In addition, the actions for each valid -input token are listed. When a shift/reduce or reduce/reduce conflict arises, rules <em>not</em> selected -are prefixed with an !. For example: +The different states that appear in this file are a representation of +every possible sequence of valid input tokens allowed by the grammar. +When receiving input tokens, the parser is building up a stack and +looking for matching rules. Each state keeps track of the grammar +rules that might be in the process of being matched at that point. Within each +rule, the "." character indicates the current location of the parse +within that rule. In addition, the actions for each valid input token +are listed. When a shift/reduce or reduce/reduce conflict arises, +rules <em>not</em> selected are prefixed with an !. For example: <blockquote> <pre> @@ -2258,12 +2423,22 @@ By looking at these rules (and with a little practice), you can usually track do of most parsing conflicts. It should also be stressed that not all shift-reduce conflicts are bad. However, the only way to be sure that they are resolved correctly is to look at <tt>parser.out</tt>. -<H3><a name="ply_nn29"></a>5.8 Syntax Error Handling</H3> +<H3><a name="ply_nn29"></a>6.8 Syntax Error Handling</H3> -When a syntax error occurs during parsing, the error is immediately +If you are creating a parser for production use, the handling of +syntax errors is important. As a general rule, you don't want a +parser to simply throw up its hands and stop at the first sign of +trouble. Instead, you want it to report the error, recover if possible, and +continue parsing so that all of the errors in the input get reported +to the user at once. This is the standard behavior found in compilers +for languages such as C, C++, and Java. + +In PLY, when a syntax error occurs during parsing, the error is immediately detected (i.e., the parser does not read any more tokens beyond the -source of the error). Error recovery in LR parsers is a delicate +source of the error). However, at this point, the parser enters a +recovery mode that can be used to try and continue further parsing. +As a general rule, error recovery in LR parsers is a delicate topic that involves ancient rituals and black-magic. The recovery mechanism provided by <tt>yacc.py</tt> is comparable to Unix yacc so you may want consult a book like O'Reilly's "Lex and Yacc" for some of the finer details. @@ -2273,7 +2448,9 @@ When a syntax error occurs, <tt>yacc.py</tt> performs the following steps: <ol> <li>On the first occurrence of an error, the user-defined <tt>p_error()</tt> function -is called with the offending token as an argument. Afterwards, the parser enters +is called with the offending token as an argument. However, if the syntax error is due to +reaching the end-of-file, <tt>p_error()</tt> is called with an argument of <tt>None</tt>. +Afterwards, the parser enters an "error-recovery" mode in which it will not make future calls to <tt>p_error()</tt> until it has successfully shifted at least 3 tokens onto the parsing stack. @@ -2298,7 +2475,7 @@ shifted onto the parsing stack. parser can successfully shift a new symbol or reduce a rule involving <tt>error</tt>. </ol> -<H4><a name="ply_nn30"></a>5.8.1 Recovery and resynchronization with error rules</H4> +<H4><a name="ply_nn30"></a>6.8.1 Recovery and resynchronization with error rules</H4> The most well-behaved approach for handling syntax errors is to write grammar rules that include the <tt>error</tt> @@ -2350,7 +2527,7 @@ This is because the first bad token encountered will cause the rule to be reduced--which may make it difficult to recover if more bad tokens immediately follow. -<H4><a name="ply_nn31"></a>5.8.2 Panic mode recovery</H4> +<H4><a name="ply_nn31"></a>6.8.2 Panic mode recovery</H4> An alternative error recovery scheme is to enter a panic mode recovery in which tokens are @@ -2423,7 +2600,37 @@ def p_error(p): </pre> </blockquote> -<H4><a name="ply_nn32"></a>5.8.3 General comments on error handling</H4> +<H4><a name="ply_nn35"></a>6.8.3 Signaling an error from a production</H4> + + +If necessary, a production rule can manually force the parser to enter error recovery. This +is done by raising the <tt>SyntaxError</tt> exception like this: + +<blockquote> +<pre> +def p_production(p): + 'production : some production ...' + raise SyntaxError +</pre> +</blockquote> + +The effect of raising <tt>SyntaxError</tt> is the same as if the last symbol shifted onto the +parsing stack was actually a syntax error. Thus, when you do this, the last symbol shifted is popped off +of the parsing stack and the current lookahead token is set to an <tt>error</tt> token. The parser +then enters error-recovery mode where it tries to reduce rules that can accept <tt>error</tt> tokens. +The steps that follow from this point are exactly the same as if a syntax error were detected and +<tt>p_error()</tt> were called. + +<P> +One important aspect of manually setting an error is that the <tt>p_error()</tt> function will <b>NOT</b> be +called in this case. If you need to issue an error message, make sure you do it in the production that +raises <tt>SyntaxError</tt>. + +<P> +Note: This feature of PLY is meant to mimic the behavior of the YYERROR macro in yacc. + + +<H4><a name="ply_nn32"></a>6.8.4 General comments on error handling</H4> For normal types of languages, error recovery with error rules and resynchronization characters is probably the most reliable @@ -2431,10 +2638,12 @@ technique. This is because you can instrument the grammar to catch errors at sel to recover and continue parsing. Panic mode recovery is really only useful in certain specialized applications where you might want to discard huge portions of the input text to find a valid restart point. -<H3><a name="ply_nn33"></a>5.9 Line Number and Position Tracking</H3> +<H3><a name="ply_nn33"></a>6.9 Line Number and Position Tracking</H3> + -Position tracking is often a tricky problem when writing compilers. By default, PLY tracks the line number and position of -all tokens. This information is available using the following functions: +Position tracking is often a tricky problem when writing compilers. +By default, PLY tracks the line number and position of all tokens. +This information is available using the following functions: <ul> <li><tt>p.lineno(num)</tt>. Return the line number for symbol <em>num</em> @@ -2452,9 +2661,11 @@ def p_expression(p): </pre> </blockquote> -As an optional feature, <tt>yacc.py</tt> can automatically track line numbers and positions for all of the grammar symbols -as well. However, this -extra tracking requires extra processing and can significantly slow down parsing. Therefore, it must be enabled by passing the +As an optional feature, <tt>yacc.py</tt> can automatically track line +numbers and positions for all of the grammar symbols as well. +However, this extra tracking requires extra processing and can +significantly slow down parsing. Therefore, it must be enabled by +passing the <tt>tracking=True</tt> option to <tt>yacc.parse()</tt>. For example: <blockquote> @@ -2463,8 +2674,9 @@ yacc.parse(data,tracking=True) </pre> </blockquote> -Once enabled, the <tt>lineno()</tt> and <tt>lexpos()</tt> methods work for all grammar symbols. In addition, two -additional methods can be used: +Once enabled, the <tt>lineno()</tt> and <tt>lexpos()</tt> methods work +for all grammar symbols. In addition, two additional methods can be +used: <ul> <li><tt>p.linespan(num)</tt>. Return a tuple (startline,endline) with the starting and ending line number for symbol <em>num</em>. @@ -2506,29 +2718,59 @@ def p_bad_func(p): </blockquote> <p> -Similarly, you may get better parsing performance if you only propagate line number -information where it's needed. For example: +Similarly, you may get better parsing performance if you only +selectively propagate line number information where it's needed using +the <tt>p.set_lineno()</tt> method. For example: <blockquote> <pre> def p_fname(p): 'fname : ID' - p[0] = (p[1],p.lineno(1)) + p[0] = p[1] + p.set_lineno(0,p.lineno(1)) </pre> </blockquote> -Finally, it should be noted that PLY does not store position information after a rule has been -processed. If it is important for you to retain this information in an abstract syntax tree, you -must make your own copy. +PLY doesn't retain line number information from rules that have already been +parsed. If you are building an abstract syntax tree and need to have line numbers, +you should make sure that the line numbers appear in the tree itself. -<H3><a name="ply_nn34"></a>5.10 AST Construction</H3> +<H3><a name="ply_nn34"></a>6.10 AST Construction</H3> -<tt>yacc.py</tt> provides no special functions for constructing an abstract syntax tree. However, such -construction is easy enough to do on your own. Simply create a data structure for abstract syntax tree nodes -and assign nodes to <tt>p[0]</tt> in each rule. +<tt>yacc.py</tt> provides no special functions for constructing an +abstract syntax tree. However, such construction is easy enough to do +on your own. -For example: +<p>A minimal way to construct a tree is to simply create and +propagate a tuple or list in each grammar rule function. There +are many possible ways to do this, but one example would be something +like this: + +<blockquote> +<pre> +def p_expression_binop(p): + '''expression : expression PLUS expression + | expression MINUS expression + | expression TIMES expression + | expression DIVIDE expression''' + + p[0] = ('binary-expression',p[2],p[1],p[3]) + +def p_expression_group(p): + 'expression : LPAREN expression RPAREN' + p[0] = ('group-expression',p[2]) + +def p_expression_number(p): + 'expression : NUMBER' + p[0] = ('number-expression',p[1]) +</pre> +</blockquote> + +<p> +Another approach is to create a set of data structure for different +kinds of abstract syntax tree nodes and assign nodes to <tt>p[0]</tt> +in each rule. For example: <blockquote> <pre> @@ -2564,8 +2806,12 @@ def p_expression_number(p): </pre> </blockquote> -To simplify tree traversal, it may make sense to pick a very generic tree structure for your parse tree nodes. -For example: +The advantage to this approach is that it may make it easier to attach more complicated +semantics, type checking, code generation, and other features to the node classes. + +<p> +To simplify tree traversal, it may make sense to pick a very generic +tree structure for your parse tree nodes. For example: <blockquote> <pre> @@ -2588,7 +2834,7 @@ def p_expression_binop(p): </pre> </blockquote> -<H3><a name="ply_nn35"></a>5.11 Embedded Actions</H3> +<H3><a name="ply_nn35"></a>6.11 Embedded Actions</H3> The parsing technique used by yacc only allows actions to be executed at the end of a rule. For example, @@ -2608,7 +2854,7 @@ symbols <tt>A</tt>, <tt>B</tt>, <tt>C</tt>, and <tt>D</tt> have been parsed. Sometimes, however, it is useful to execute small code fragments during intermediate stages of parsing. For example, suppose you wanted to perform some action immediately after <tt>A</tt> has -been parsed. To do this, you can write a empty rule like this: +been parsed. To do this, write an empty rule like this: <blockquote> <pre> @@ -2671,8 +2917,11 @@ def p_seen_AB(p): </pre> </blockquote> -an extra shift-reduce conflict will be introduced. This conflict is caused by the fact that the same symbol <tt>C</tt> appears next in -both the <tt>abcd</tt> and <tt>abcx</tt> rules. The parser can either shift the symbol (<tt>abcd</tt> rule) or reduce the empty rule <tt>seen_AB</tt> (<tt>abcx</tt> rule). +an extra shift-reduce conflict will be introduced. This conflict is +caused by the fact that the same symbol <tt>C</tt> appears next in +both the <tt>abcd</tt> and <tt>abcx</tt> rules. The parser can either +shift the symbol (<tt>abcd</tt> rule) or reduce the empty +rule <tt>seen_AB</tt> (<tt>abcx</tt> rule). <p> A common use of embedded rules is to control other aspects of parsing @@ -2696,10 +2945,14 @@ def p_new_scope(p): </pre> </blockquote> -In this case, the embedded action <tt>new_scope</tt> executes immediately after a <tt>LBRACE</tt> (<tt>{</tt>) symbol is parsed. This might -adjust internal symbol tables and other aspects of the parser. Upon completion of the rule <tt>statements_block</tt>, code might undo the operations performed in the embedded action (e.g., <tt>pop_scope()</tt>). +In this case, the embedded action <tt>new_scope</tt> executes +immediately after a <tt>LBRACE</tt> (<tt>{</tt>) symbol is parsed. +This might adjust internal symbol tables and other aspects of the +parser. Upon completion of the rule <tt>statements_block</tt>, code +might undo the operations performed in the embedded action +(e.g., <tt>pop_scope()</tt>). -<H3><a name="ply_nn36"></a>5.12 Yacc implementation notes</H3> +<H3><a name="ply_nn36"></a>6.12 Miscellaneous Yacc Notes</H3> <ul> @@ -2770,16 +3023,7 @@ each time it runs (which may take awhile depending on how large your grammar is) <blockquote> <pre> -yacc.parse(debug=1) -</pre> -</blockquote> - -<p> -<li>To redirect the debugging output to a filename of your choosing, use: - -<blockquote> -<pre> -yacc.parse(debug=1, debugfile="debugging.out") +yacc.parse(debug=1) </pre> </blockquote> @@ -2812,17 +3056,17 @@ machine. Please be patient. size of the grammar. The biggest bottlenecks will be the lexer and the complexity of the code in your grammar rules. </ul> -<H2><a name="ply_nn37"></a>6. Parser and Lexer State Management</H2> +<H2><a name="ply_nn37"></a>7. Multiple Parsers and Lexers</H2> In advanced parsing applications, you may want to have multiple -parsers and lexers. Furthermore, the parser may want to control the -behavior of the lexer in some way. +parsers and lexers. <p> -To do this, it is important to note that both the lexer and parser are -actually implemented as objects. These objects are returned by the -<tt>lex()</tt> and <tt>yacc()</tt> functions respectively. For example: +As a general rules this isn't a problem. However, to make it work, +you need to carefully make sure everything gets hooked up correctly. +First, make sure you save the objects returned by <tt>lex()</tt> and +<tt>yacc()</tt>. For example: <blockquote> <pre> @@ -2831,7 +3075,8 @@ parser = yacc.yacc() # Return parser object </pre> </blockquote> -To attach the lexer and parser together, make sure you use the <tt>lexer</tt> argumemnt to parse. For example: +Next, when parsing, make sure you give the <tt>parse()</tt> function a reference to the lexer it +should be using. For example: <blockquote> <pre> @@ -2839,8 +3084,13 @@ parser.parse(text,lexer=lexer) </pre> </blockquote> -Within lexer and parser rules, these objects are also available. In the lexer, -the "lexer" attribute of a token refers to the lexer object in use. For example: +If you forget to do this, the parser will use the last lexer +created--which is not always what you want. + +<p> +Within lexer and parser rule functions, these objects are also +available. In the lexer, the "lexer" attribute of a token refers to +the lexer object that triggered the rule. For example: <blockquote> <pre> @@ -2868,7 +3118,7 @@ If necessary, arbitrary attributes can be attached to the lexer or parser object For example, if you wanted to have different parsing modes, you could attach a mode attribute to the parser object and look at it later. -<H2><a name="ply_nn38"></a>7. Using Python's Optimized Mode</H2> +<H2><a name="ply_nn38"></a>8. Using Python's Optimized Mode</H2> Because PLY uses information from doc-strings, parsing and lexing @@ -2891,9 +3141,110 @@ the tables without the need for doc strings. <p> Beware: running PLY in optimized mode disables a lot of error checking. You should only do this when your project has stabilized -and you don't need to do any debugging. - -<H2><a name="ply_nn39"></a>8. Where to go from here?</H2> +and you don't need to do any debugging. One of the purposes of +optimized mode is to substantially decrease the startup time of +your compiler (by assuming that everything is already properly +specified and works). + +<H2><a name="ply_nn44"></a>9. Advanced Debugging</H2> + + +<p> +Debugging a compiler is typically not an easy task. PLY provides some +advanced diagonistic capabilities through the use of Python's +<tt>logging</tt> module. The next two sections describe this: + +<H3><a name="ply_nn45"></a>9.1 Debugging the lex() and yacc() commands</H3> + + +<p> +Both the <tt>lex()</tt> and <tt>yacc()</tt> commands have a debugging +mode that can be enabled using the <tt>debug</tt> flag. For example: + +<blockquote> +<pre> +lex.lex(debug=True) +yacc.yacc(debug=True) +</pre> +</blockquote> + +Normally, the output produced by debugging is routed to either +standard error or, in the case of <tt>yacc()</tt>, to a file +<tt>parser.out</tt>. This output can be more carefully controlled +by supplying a logging object. Here is an example that adds +information about where different debugging messages are coming from: + +<blockquote> +<pre> +# Set up a logging object +import logging +logging.basicConfig( + level = logging.DEBUG, + filename = "parselog.txt", + filemode = "w", + format = "%(filename)10s:%(lineno)4d:%(message)s" +) +log = logging.getLogger() + +lex.lex(debug=True,debuglog=log) +yacc.yacc(debug=True,debuglog=log) +</pre> +</blockquote> + +If you supply a custom logger, the amount of debugging +information produced can be controlled by setting the logging level. +Typically, debugging messages are either issued at the <tt>DEBUG</tt>, +<tt>INFO</tt>, or <tt>WARNING</tt> levels. + +<p> +PLY's error messages and warnings are also produced using the logging +interface. This can be controlled by passing a logging object +using the <tt>errorlog</tt> parameter. + +<blockquote> +<pre> +lex.lex(errorlog=log) +yacc.yacc(errorlog=log) +</pre> +</blockquote> + +If you want to completely silence warnings, you can either pass in a +logging object with an appropriate filter level or use the <tt>NullLogger</tt> +object defined in either <tt>lex</tt> or <tt>yacc</tt>. For example: + +<blockquote> +<pre> +yacc.yacc(errorlog=yacc.NullLogger()) +</pre> +</blockquote> + +<H3><a name="ply_nn46"></a>9.2 Run-time Debugging</H3> + + +<p> +To enable run-time debugging of a parser, use the <tt>debug</tt> option to parse. This +option can either be an integer (which simply turns debugging on or off) or an instance +of a logger object. For example: + +<blockquote> +<pre> +log = logging.getLogger() +parser.parse(input,debug=log) +</pre> +</blockquote> + +If a logging object is passed, you can use its filtering level to control how much +output gets generated. The <tt>INFO</tt> level is used to produce information +about rule reductions. The <tt>DEBUG</tt> level will show information about the +parsing stack, token shifts, and other details. The <tt>ERROR</tt> level shows information +related to parsing errors. + +<p> +For very complicated problems, you should pass in a logging object that +redirects to a file where you can more easily inspect the output after +execution. + +<H2><a name="ply_nn39"></a>10. Where to go from here?</H2> The <tt>examples</tt> directory of the PLY distribution contains several simple examples. Please consult a |