GSoC 2022 Report: Implementing a Very Fast Parser

ㅤㅤㅤ

About Me

Hey there! I'm Akshansh Bhatt, a pre-final year (at the time of writing) undergraduate student pursuing a dual major in Physics and Electronics Engineering at BITS Pilani. I was part of the Google Summer of Code 2022 program as a student developer at PSF under the LPython sub-org. For the past 13 weeks, I, along with my mentors Thirumalai Shaktivel and Ondřej Čertík, have been working towards building the LPython compiler. More specifically, my project involved building the LPython's parser.

You can find the details related to my project here. For information related to my weekly progress in the program, you can check out my blog posts. I am available for discussions/feedback on this email address. This is my second project (in a row) in Google Summer of Code. In 2021, I had a project under SymPy, details of which you can find on this link.

Project Synopsis

My project broadly proposed to accomplish the following goals-

Write a fully functional Python parser using the Bison Parser Generator that could generate Abstract Syntax Tree (AST) for every valid Python code.
Test the parser's robustness by comparing the AST generated by the new parser against the ones generated by the CPython's parser on large codebases.
Benchmark and profile the parser's performance on typically complex cases and report critical slowdowns (if any).

Work Accomplished

PRs Merged

PR #1078 : [Parser] Fix a bug with indentation of blocks
PR #1070 : [Parser] Fix issue with different indentation type within the same file
PR #1068 : [Parser] Fix a bug in tokenizer related to docstring
PR #1053 : [Parser] Recognise starred_id as a valid id_item
PR #1052 : [Parser] Parse @=
PR #1050 : [Parser] Fix a bug with not in token
PR #995 : [Parser] Recognise tuple subscript
PR #994 : [Parser] Recognise lists and tuples in GenExpr
PR #993 : [Parser] Recognise ListComprehension Subscript
PR #979 : [Parser] Refactor TK_TYPE_COMMENT
PR #976 : [Parser] Fix bug with comment in multiline statements
PR #951 : [Parser] Add support for unicode identifiers
PR #948 : [Parser] Fix bug with non-ASCII characters in the new parser
PR #905 : [Parser] Fix bug with parsing key:value in separate lines
PR #902 : [Parser] Parse lambda expressions
PR #899 : [Parser] Parse multiple statements in single line as body of multiline statements
PR #885 : [Parser] Add support for GeneratorExp nodes (new parser)
PR #850 : [Parser] Properly handle functions with * and / in the new parser
PR #848 : [Parser] Revert BoolOp node generation in the new parser
PR #847 : [Parser] Add support for parsing strings with all possible stringprefix
PR #845 : [Parser] Parse type_comment in function args
PR #830 : [Parser] Add support for parsing nested function calls
PR #810 : [Parser] Recognise default values (keywords) in class-def
PR #809 : [Parser] Add support for parsing tuples in for-loop
PR #808 : [Parser] Fix issue 803
PR #801 : [Parser] Add type-comment support in new parser
PR #751 : [Parser] Parse indexing operations on attributes
PR #734 : [Parser] Parse raw strings
PR #724 : [Parser] Parse type-comments in for-loops
PR #709 : [Parser] Parsing support for in (Membership Operator)
PR #686 : [Parser] Parse list comprehension
PR #682 : [Parser] Parse conditional expressions
PR #624 : [Parser] Add support for parsing prefixed strings
PR #619 : [Parser] Fix issues with comment and indentation
PR #578 : [Parser] Rules for not in
PR #557 : [Parser] Rules for expressions containing is and is not
PR #538 : [Parser] Parsing rules for ... and None
PR #406 : [Parser] Implement single line if statement
PR #400 : [Parser] Implement parsing rules for functions
PR #363 : Fix the dedent token generation in new tokenizer

Issues Closed

Issue #974 : Do not allow mixed (spaces & tab) indent
Issue #1023 : Tokenizer error in the string
Issue #1027 : Starred id is not recognized in the comprehension
Issue #1025 : @= is not handled in the expression
Issue #1024 : Not unexpected in the expression
Issue #970 : Tuple Subscript is not recognized
Issue #968 : List is not recognized in the GeneratorExp
Issue #969 : ListComprehension Subscript is not recognized
Issue #972 : Type Comment for the Function decorator is not recognized
Issue #973 : Comment after the Singleline Multi statement is not recognized
Issue #916 : Non-ascii name is not recognized by the old parser
Issue #886 : Singleline multi-statements are not recognized
Issue #513 : [New] Parse / and * as a parameter in function definition
Issue #791 : [New] Nested Call is not recognized
Issue #787 : [New] Default value (Argument assignment) is not recognized in Class statement
Issue #779 : [Parser] Tuples are not parsed in the for-loop (new parser)
Issue #803 : [Parser] Error in parsing function definition (new parser)
Issue #509 : [New] Parse Comments after and inside Multiline statements
Issue #612 : [Tokenizer] Segfault during tokenization
Issue #617 : [Tokenizer] Newline token not returned after comments
Issue #485 : TODO: Extract type hints from the comments
Issue #511 : [New] Parse IN and IS in the expression
Issue #508 : Parse Single line if and for statements
Issue #510 : [New] Parse Ellipsis and None
Issue #358 : New tokenizer fails when there are multiple indentations

Work in Progress

Reporting any new issue related to the parser that crops up while testing large codebases. Till now, we have reported all the issues while parsing all the Python files in SymPy and NumPy's codebases, respectively. Currently, we are testing all the Python files in CPython's codebase and we are planning to test even more popular repositories so that every corner case is reported and fixed ASAP.
Closing all the issues under the Parser tag.

Outcome

The new parser is significantly faster than CPython's native parser. To demostrate this, I parsed the entire codebase of SymPy using both the new and the old parser, and the results are phenomenal!

I wrote a simple script to parse all the Python files in SymPy's codebase. For simplicity, I have broken it down into 4 separate scripts that can be run independently. I have written instructions in form of comments. You can try it out yourself!

Script 1

''' Part 1 : List all Python file from the root of sympy '''
import os

with open("testall.txt", "r") as f:
    counter = 1
    for root, dirs, files in os.walk("."):
        for file in files:
            if file.endswith(".py"):
                f.write(os.path.join(root, file))
                f.write("\n")
                counter += 1
print("Total:", counter)

Script 2

''' Part 2 : Test and benchmark all the files with old parser, new parser and AST module. Segregate the results into files '''
import os

BINARY_PATH = "/Users/akshansh/Documents/GitHub/lpython/src/bin/lpython"  # Replace it with your path for LPython's binary.

with open("testall.txt", "r") as f:
    for line in f:
        # Generates the AST of the python file using the OLD PARSER (Comment out the two lines below if you want to benchmark the OLD PARSER)
        os.system(f'''if {BINARY_PATH} --show-ast --no-color {line[:-1]} &> temp ; then echo "{line[:-1]}" >> passed_old.txt; else echo "{line[:-1]}" >> failed_old.txt; fi''')
        # Generates the AST of the python file using the NEW PARSER (Comment out the other two lines if you want to benchmark the NEW PARSER)
        os.system(f'''if {BINARY_PATH} --show-ast --no-color --new-parser {line[:-1]} &> temp ; then echo "{line[:-1]}" >> passed_new.txt; else echo "{line[:-1]}" >> failed_new.txt; fi''')
        # Generates the AST of the python file using the PYTHON AST MODULE (Comment out the other two lines if you want to benchmark CPython's AST MODULE)
        os.system(f'''if python3.11 -m ast {line[:-1]} &> temp ; then echo "{line[:-1]}" >> passed_python.txt; else echo "{line[:-1]}" >> failed_python.txt; fi''')
    os.system('rm temp')

Script 3

''' Part 3 : Find all Common files that run fine on both the old and the new parsers '''
os.system("comm -12 <(sort passed_old.txt) <(sort passed_new.txt) > passed_both.txt")
os.system("comm -12 <(sort failed_old.txt) <(sort failed_new.txt) > failed_both.txt")
os.system("comm -23 <(sort failed_old.txt) <(sort failed_new.txt) > failed_only_old_not_new.txt")
os.system("comm -13 <(sort failed_old.txt) <(sort failed_new.txt) > failed_only_new_not_old.txt")

Script 4

''' Part 4 : Generate diffs and find files which generate same output with both the parsers '''
import os

with open("passed_both.txt", "r") as f:
    for line in f:
        # os.system(f'/Users/akshansh/Documents/GitHub/lpython/src/bin/lpython --show-ast --no-color --show-ast {line[:-1]} > temp')
        os.system(f'''lpdiff {line[:-1]} {line[:-1]
          .replace("/Users/akshansh/Documents/GitHub/sympy/","")
          .replace(".py", "")
          .replace("/", "-")}''')

After generating all the files using the first script, I benchmarked all the files using all the three methods available for AST generation in the second script (see the comments) and these were the results (Note that testall.py is Script 2) -

# For the new Parser
$ time python3.11 testall.py
python3.11 testall.py  24.40s user 3.32s system 95% cpu 29.128 total

# Old Parser  (skipping files such as 687, 698) [With timeout]
$ time python3.11 testall.py
python3.11 testall.py  222.49s user 47.07s system 52% cpu 8:33.97 total

# Python's AST Module
$ time python3.11 testall.py
python3.11 testall.py  39.69s user 8.03s system 95% cpu 50.186 total

The new parser parsed the entire codebase in just 29 seconds!. This is an incredible boost in performance. It is more than 17.6 times faster than the old parser and about 2 times faster than the state-of-the-art pegen parser of CPython! To prove that this is not a fluke and that I am not cherrypicking data, you can try this out yourself on any Python codbase and verify the validity.

Future Work

Porting the new parser (Bison) as the main parser for LPython is one of the top priorities right now. After ironing out the left out bugs a little bit more, we will be confident enough that almost every python code will be parsed as anticipated. We can then make the --show-ast command show the AST generated by the new parser and consequently retire the current makeshift parser.
Generating code from AST (unparsing operation).
Better and consistent benchmarking metrics and tools for the new parser.

Conclusion

I had an amazing experience working as a compiler developer this summer. Honestly, I was a bit worried initially because I have never completed any course related to compilers and generally had notions that the inner workings of any compiler are pretty perplexing. I was a bit skeptical about my potential to complete this project, but I am glad I could finish it as planned and learn the best industry practices along the way. I want to thank the developers at LPython for giving me a chance to work on such a pivotal project, especially my mentors - Thirumalai Shaktivel and Ondřej Čertík, for guiding me in every way possible and being available whenever I faced any difficulty.

All this was just the beginning. I plan to contribute even more to LPython in the months to come. LPython's development has been going on in full swing, and day by day, we are taking one step towards the goals set for this compiler. It feels surreal to manifest what we have achieved in such a short duration. I am excited about the future of this project!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GSoC 2022 Report: Implementing a Very Fast Parser

About Me

Project Synopsis

Work Accomplished

PRs Merged

Issues Closed

Work in Progress

Outcome

Future Work

Conclusion

Files

README.md

Latest commit

History

README.md

File metadata and controls

GSoC 2022 Report: Implementing a Very Fast Parser

About Me

Project Synopsis

Work Accomplished

PRs Merged

Issues Closed

Work in Progress

Outcome

Future Work

Conclusion