In this part of the project we will translate a MiniJava AST into equivalent code in the intermediate representation used by the LLVM compiler project.
java -jar mjavac.jar unmarshal compile inputProg.xml out.ll
The LLVM language is documented in the LLVM Language Reference Manual, although you will use only a subset of the instructions, which is described below.
Interpreted directly using the lli
tool:
lli <prog>.ll
On Ubuntu, it can be installed by sudo apt install llvm-runtime
.
Make sure lli -version
prints version 6.0.0
or higher (otherwise the distribution you're using is too old).
Compiled to the native assembly and executed:
clang -o out code.ll
./out
On Ubuntu, it can be installed by sudo apt install clang
.
Make sure clang --version
prints version 6.0.0
or higher (otherwise the distribution you're using is too old).
There is also an online interpreter by TIO.
i1
- a single bit, used for booleans (practically takes up one byte)i8
- a single bytei8*
- similar to a char*
pointer in Ci32
- a single integeri32*
- a pointer to an integer, can be used to point to an integer array[20 x i8]
- a constant array of 20 charactersdeclare
is used for the declaration of external methods. Only a few specific methods (e.g., calloc
, printf
) need to be declared.
Example: declare i32 @puts(i8*)
define
is used for defining our own methods. The return and argument types need to be specified, and the method needs to end with a ret
instruction of the same type.
Example: define i32 @main(i32 %argc, i8** argv) {...}
ret
is the return instruction. It is used to return the control flow and a value to the caller of the current function. Example: ret i32 %rv
alloca
is used to allocate space on the stack of the current function for local variables. It returns a pointer to the given type. This space is freed when the method returns.
Example: %ptr = alloca i32
store
is used to store a value to a memory location. The parameters are the value to be stored and a pointer to the memory.
Example: store i32 %val, i32* %ptr
load
is used to load a value from a memory location. The parameters are the type of the value and a pointer to the memory.
Example: %val = load i32, i32* %ptr
call
is used to call a method. The result can be assigned to a register. (LLVM bitcode temporary variables are called "registers".) The return type and parameters (with their types) need to be specified.
Example: %result = call i8* @calloc(i32 1, i32 %val)
add, and, sub, mul, xor
are used for mathematical operations. The result is the same type as the operands.
Example: %sum = add i32 %a, %b
icmp
is used for comparing two operands. icmp slt
for instance does a signed comparison of the operands and will return i1 1
if the first operand is less than the second, otherwise i1 0
.
Example: %case = icmp slt i32 %a, %b
br
with a i1
operand and two labels will jump to the first label if the i1
is one, and to the second label otherwise.
Example: br i1 %case, label %if, label %else
br
with only a single label will jump to that label.
Example: br label %goto
label:
declares a label with the given name. The instruction before declaring a label needs to be a br
operation, even if that br
is simply a jump to the label.
Example: label123:
bitcast
is used to cast between different pointer types. It takes the value and type to be cast, and the type that it will be cast to.
Example: %ptr = bitcast i32* %ptr2 to i8**
getelementptr
is used to get the pointer to an element of an array from a pointer to that array and the index of the element. The result is also a pointer to the type that is passed as the first parameter (in the case below it's an i8*
). This example is like doing ptr_idx = &ptr[idx]
in C (you still need to do a load
to get the actual value at that position).
Example: %ptr_idx = getelementptr i8, i8* %ptr, i32 %idx
constant
is used to define a constant, such as a string. The size of the constant needs to be declared too. In the example below, the string is 12 bytes ([12 x i8]
). The result is a pointer to the given type (in the example below, @.str
is a [12 x i8]*
).
Example: @.str = constant [12 x i8] c"Hello world\00"
global
is used for declaring global variables - something you will need to do for creating v-tables. Just like constant
, the result is a pointer to the given type.
Example:
@.vtable = global [2 x i8*] [i8* bitcast (i32 ()* @func1 to i8*), i8* bitcast (i8* (i32, i32*)* @func2 to i8*)]
phi
is used for selecting a value from previous basic blocks, depending on which one was executed before the current block. Phi instructions must be the first in a basic block. It takes as arguments a list of pairs. Each pair contains the value to be selected and the predecessor block for that value. This is necessary in single-assignment languages, in places where multiple control-flow paths join, such as if-else statements, if one wants to select a value from the different paths. In the context of the exercise, you will need this for short-circuiting and (&&) expressions.
Example:
br i1 1, label %lb1, label %lb2
lb1:
%a = add i32 0, 100
br label %lb3
lb2:
%b = add i32 0, 200
br label %lb3
lb3:
%c = phi i32 [%a, %lb1], [%b, %lb2]
phi
instruction belongs (very peculiar things may happen if they aren't).
Assume that the given MiniJava program passes all the semantic checks for Java programs, and also satisfies all the semantic restrictions special to MiniJava. The full list of checks is listed in the next exercise, where we will enforce these restrictions; in this exercise we will simply assume that such programs have already been filtered out. The behavior of your compiler on MiniJava programs that do not pass all of these checks is undefined.
In contrast, when the input MiniProgram is valid, the behavior of compiler must be defined: it must produce an LLVM assembly/IR program that type checks (passes the semantic checks of LLVM), and has a defined behavior (as defined by the LLVM reference manual) that is is equivalent to the MiniJava program.
In particular, accessing an array out of bounds results in undefined behavior.
Thus, you will need to check each array access in order not to write or read beyond the limits of an array. If an illegal read/write is attempted, you will print the message "Out of bounds" and the program will exit (you may call the @throw_oob
provided below for that). Of course, you need to know the length of an array for that; instrument the code to perform these checks dynamically.
You will also need to check if an array is allocated with a negative length, and do the same process as above in that case.
We allow one exception to this rule: for simplicity, you do not need to implement NullPointerException
-like functionality. If a field in a class is an object and this field and it is null
(this is how fields are initialized), then method invocations on this field may crash the program.
@calloc
will leak since you're not implementing garbage collection. That's for another course :)Our target language, LLVM assembly/IR, is statically-typed: you will have to specify in the resulting code the type of each variable and function. This means that you will have to know the static type of each variable and method argument in the MiniJava program. Fortunately, MiniJava is also explicitly statically-typed, and this information is always declared. As above, you should assume the type safety of the program. This means that you can happily use the declared type (for example, without checking that the argument provided to a method indeed matches the type in the declaration) - if the input MiniJava program is not well-typed, it's perfectly OK for the resulting LLVM code to contain type errors (or crash).
One of the important distinctions in object-oriented languages is between the static and dynamic type of an object. For example, in A v = new B()
, the static type of v
is A
and the dynamic is B
. At runtime, method invocation is performed according the "actual" (dynamic) type - if B
overrides methods defined in A
, the methods defined in B
will be executed when invoked via v
. Since the dynamic type is not known at compile time, this is implemented using vtables (which we cover in class).
However, to correctly implement a method invocation you must know the static type of the object on which it is invoked; roughly, you need to know to which class hierarchy the object belongs in order access the vtable correctly. To this end, exploit the restriction that in MiniJava, method invocations can only be performed on an object that is the result of very simple expressions: in o.f(arg1, ...)
expressions, o
must be either a local variable, a field, a new A()
, or this
, but more complex expressions for o
are disallowed. In all these cases, the static type is clear (in the variable declaration or in the expression itself), so you can use it to generate code.
@calloc
, @print_int
and @throw_oob
.
declare i8* @calloc(i32, i32)
declare i32 @printf(i8*, ...)
declare void @exit(i32)
@_cint = constant [4 x i8] c"%d\0a\00"
@_cOOB = constant [15 x i8] c"Out of bounds\0a\00"
define void @print_int(i32 %i) {
%_str = bitcast [4 x i8]* @_cint to i8*
call i32 (i8*, ...) @printf(i8* %_str, i32 %i)
ret void
}
define void @throw_oob() {
%_str = bitcast [15 x i8]* @_cOOB to i8*
call i32 (i8*, ...) @printf(i8* %_str)
call void @exit(i32 1)
ret void
}
alloca
and keep the address in a register. You will use the load
and store
instructions to read and write to that local variable.%_1
, %_2
, etc.