PLY – A patch for file parsing

PLY is a Python Lex-Yacc parsing tool. Defining a parser in PLY is quite simple, and for a high-level language such as Python this is the key feature. A lack I noted in PLY is the inability of efficiently parsing a file (maybe I am wrong; in this case, I apologize in advance). Specifically, the yacc.parse() method requires an input parameter. This parameter is used by the lexer for tokens identification. Tokens identification is obtained by applying Python regular expressions, and so input is required to be a string. Thus, the only way to parse a file is to read it in a string and then to invoke yacc.parse() on this string. This is really inefficient. Another approach could be to constraint parsed objects to fit on a single line, but this is a good idea only for simple parsers.

The patch

The solution proposed here is to modify the parse methods in order to invoke the lexer one-line-per-time. When the lexer returns a None token, the parser reads the next line and pass it to the lexer, until the end of the file is reached.

Only few lines of code in the file ply/yacc.py need to be changed. The patch is reported below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
--- ply-3.3-orig/ply/yacc.py	2009-09-02 16:27:23.000000000 +0200
+++ ply-3.3-patched/ply/yacc.py	2010-04-13 11:30:28.753330468 +0200
@@ -303,7 +303,7 @@
 
         # If input was supplied, pass to lexer
         if input is not None:
-            lexer.input(input)
+            lexer.input(input.readline())
 
         if tokenfunc is None:
            # Tokenize function
@@ -341,6 +341,11 @@
             if not lookahead:
                 if not lookaheadstack:
                     lookahead = get_token()     # Get the next token
+                    
+                    # If we finished with the current line, then process the next
+                    if lookahead is None and input is not None:
+                        lexer.input(input.readline());
+                        lookahead = get_token()     # Get the next token
                 else:
                     lookahead = lookaheadstack.pop()
                 if not lookahead:
@@ -614,7 +619,7 @@
 
         # If input was supplied, pass to lexer
         if input is not None:
-            lexer.input(input)
+            lexer.input(input.readline())
 
         if tokenfunc is None:
            # Tokenize function
@@ -647,6 +652,11 @@
             if not lookahead:
                 if not lookaheadstack:
                     lookahead = get_token()     # Get the next token
+
+                    # If we finished with the current line, then process the next
+                    if lookahead is None and input is not None:
+                        lexer.input(input.readline());
+                        lookahead = get_token()     # Get the next token 
                 else:
                     lookahead = lookaheadstack.pop()
                 if not lookahead:
@@ -886,7 +896,7 @@
 
         # If input was supplied, pass to lexer
         if input is not None:
-            lexer.input(input)
+            lexer.input(input.readline())
 
         if tokenfunc is None:
            # Tokenize function
@@ -919,6 +929,11 @@
             if not lookahead:
                 if not lookaheadstack:
                     lookahead = get_token()     # Get the next token
+
+                    # If we finished with the current line, then process the next
+                    if lookahead is None and input is not None:
+                        lexer.input(input.readline());
+                        lookahead = get_token()     # Get the next token
                 else:
                     lookahead = lookaheadstack.pop()
                 if not lookahead:

For applying the patch:

  1. Download and unpack PLY from the author site (patched version 3.3).
  2. Move on the ply-3.3/ply directory.
  3. Save the patch above in a file ply.patch in this directory.
  4. Execute patch <ply.patch to apply the patch.

Usage examples

Let’s try the patched version with the ply-3.3/example/calc.py example. Substitute the while loop with the following code.

1
2
3
f = open('data.txt')
yacc.parse(f)
f.close()

In the code above, data.txt is the file to be parsed. The file can contain any valid expression, eventually splitted on multiple lines (because in the lexer defined in calc.py carriage returns are ignored). For example, if the file is

1 +
2 +
3 +
4 +
5

the output of calc.py will be

$ python calc.py 
15

For parsing the standard input, just replace the file with thesys.stdin object (importing the sys module).

Parsing a string is still possible by using the StringIO module of Python, as shown below.

1
2
import StringIO
yacc.parse(StringIO.StringIO("1 + 2 + 3 + 4 + 5"))

And parsing each line independently is quite easily achievable.

1
2
3
4
5
6
7
import StringIO
file = open('data2.txt')
line = file.readline()
while line != "":
    yacc.parse(StringIO.StringIO(line))
    line = file.readline()
f.close()

If data2.txt contains

1 + 1
2 + 2
3 + 3 
4 + 4
5

the output will be

$ python calc.py 
2
4
6
8
5

Leave a Reply

Your email address will not be published. Required fields are marked *