Lexing & Parsing

In the final part of the project we will generate ASTs from textual representations of MiniJava programs.

Usage

java -jar mjavac.jar parse marshal inputProg.java out.xml

will write to out.xml a valid XML representation (per the schema) of the AST that corresponds to the input MiniJava program. (More about the XML representation of ASTs here.) Including line numbers in the resulting AST/XML file is completely optional.

In case of a parser/lexer error, no file should be created, an an error message starting with "Syntax error at line 11 of input." must be printed to stderr (not stdout!), where "11" is replaced by the line number where the error occurred. Use the supplied methods for error reporting as in the solution starter kit's parser example, including the code under scan with which captures exceptions thrown by the lexer. Make sure you report errors originating both from the lexer and the parser!

Examples

Here (see the readme). Our beloved AST examples also demonstrate successful parsing.

Starter Code

The solution starter kit includes a basic JFlex file, a basic CUP file (with methods for error handling), and an Ant build configuration that generates the lexical analyzer and parser from them before building. (Note that these files have been updated just before ex4 was announced; the modifications to the build file are slight and perform cleanup more properly.) You may also find the example from the recitation helpful.

MiniJava Grammar

The syntax of MiniJava is defined by the following BNF grammar. An <IDENTIFIER> is a sequence of letters, digits, and underscores, starting with a letter. Uppercase letters are distinguished from lowercase. An <INTEGER_LITERAL> is a sequence of decimal digits that denotes an integer value - it shouldn't start with 0 (which denotes an octal number, e.g. 012) or 0x (which denotes a hexadecimal number, e.g. 0xDEADBEAF).
For our project, /* comments */ are not nested, i.e., you need to implement /* */ comments, but the first */ terminates any open comment regardless of how many /* sequences have appeared before it. You are also required to support single-line // comments.

Which AST to Generate?

You must generate an AST that faithfully captures the syntax of the input program; printing the AST back to Java should be very similar to the original program, excepting indentation, spacing, and parentheses. For example, the order of classes in the AST's classdecls should be the same as the order of class declarations in the original program.

I am not currently aware of any scenario in which more than one AST corresponds to the program at the syntactic level (equivalence is not enough - in this phase we capture the syntax). Please let me know if you think of anything that might result in such a situation. That being said, we are planning to accommodate some minor variations if they represent the same program.

You do not need to include any line numbers in the ASTs you generate.

Parsing Semantically Invalid Programs

Your solution must successfully parse both valid programs and programs that would not pass semantic checking.

Some Things to Note

Anything that doesn't match the grammar is a syntax error that is captured at this stage. For example, defining a boolean[] variable is impossible according to the MiniJava, and your parser should identify this as a syntax error. (Indeed, there is no AST that can express this.)
You may need to massage the MiniJava grammar so that CUP can use it to produce a parser.
Take advantage of precedence and associativity declarations in the parser specification to keep the overall size of the parser grammar small. In particular, exp ::= exp op exp productions along with precedence and associativity declarations for various operators will shorten the specification considerably compared to a grammar that encodes that information in separate productions like expr, term, factor, etc. In any case, be sure to perform the same choices expected in standard Java concerning precedence and associativity.
Hint/warning: Be sure to test precedence and associativity carefully, and not just in the obvious cases involving operators like + and *. You should be sure that things like variables, array element references, parenthesized subexpressions, and method calls in expressions interact correctly with other operations and that the resulting ASTs have the right structure (your AST print visitor can be very helpful here). For example, a+b.f() must have a parse tree that corresponds to a+(b.f()), not (a+b).f().
Your grammar should not contain any reduce/reduce conflicts. Shift/reduce conflicts that generate warnings (but not errors) are allowed but discouraged.
Hint: Lists can be parsed by left recursion (cs ::= cs c | ) or right recursion (cs ::= c cs | ). Sometimes one would cause you conflicts that can disappear when you switch to the other in some non-terminal. Generally, a left recursion is preferable for efficiency reasons, but a right recursion more easily makes the grammar LR, so try right recursion when you encounter conflicts.
Note that example parser file supplied with the starter code is intended only to show how the tools work together, and methods for error reporting. The grammar given there is not a proper subset of the MiniJava grammar, and you should make appropriate changes as you create the full grammar.
In our lexical analysis, parsing and AST, several elements have somewhat different behavior than in Java. For example, main is not a regular function, and the word main can be a special token in your lexical analysis.
You can ask CUP to generate more debug information using dump flags; the demo in the recitation included these flags in build.xml.

Good luck!

E Pluribus Unum

Sit back and enjoy all your hard work coming together.
java -jar mjavac.jar parse compile inputProg.java out.ll
lli out.ll