CIL compilation hints and their effects

*Update 2008/02/20* I have just merged the CIL compiler branch into the trunk. The CompileToCil command is now officially part of Prexonite:
Prexonite source code (trunk)

In the last article, I presented the Prexonite CIL compiler and the huge performance improvements it comes with. Unfortunately, the compiled code has to be dynamically typed as the CIL compiler does not perform any data flow analysis and can therefore not possibly infer the correct types. It does not even create its own representation of the byte code program.

However, to say Prexonite Script (the language) is strictly dynamically typed is actually wrong, as the Prexonite compiler emits code for which the types and even the method overloads are known at compile time. It’s just that the virtual machine does not provide a way to take advantage of this knowledge.

One such example is the foreach loop, a construct that consists of

  • An expression (the list)
  • A block of statements
  • A left-hand value (the element)

and gets transformed into

var enum = $list$.GetEnumerator~IEnumerator;
while(enum.MoveNext())
{
    $element$ = enum.Current;
    $block$;
}

This pseudo code represents what is emitted by the Prexonite compiler for foreach AST nodes. It is clear that enum has to be at least of type IEnumerator in all cases. This information could enable the CIL compiler to statically type the variable enum, turning two late-bound calls (MoveNext and Current) into virtual calls.

CIL compilation hints

CIL compilation hints are basically a reverse mapping from byte code to AST nodes, reduced to the minimal amount of information required by the CIL compiler to emit optimized code. It is not that the whole AST is now encoded in the Meta tables of functions. Only nodes, for which the CIL compiler could generate better code, emit CIL hints.

One example is the foreach node, which emits the name of the enumerator variable and the addresses of the late-bound calls to be optimized. The CIL compiler decodes this information and performs the necessary steps. The enumerator variable for instance will be of type IEnumerator<PValue> and won’t be initialized ahead of time.

Impact on performance

The two main paradigms to interact with sequences in Prexonite are the combination of coroutines (sequence operators like where and map) and the use of foreach loops. While coroutines have the advantage of compose ability and deferred execution, foreach loops are usually faster.

Again, I used micro benchmarks to demonstrate the impact on performance. For practical reasons the number of iterations depends on the size of the set to iterate over in the inner loop. N = 200′000 makes the basis. With sets of 10 and 100 elements, N is reduced to 20′000 and 2′000 respectively.

Iterations over a set (Measurements)

What you are seeing here are performance improvements of 950 to 3′400%, but keep in mind that those are very specialized micro benchmarks and that unless your program exclusively consists of mindless foreach loops, you will not likely experience such speed-ups.

Nonetheless, iteration over lists is a very important aspect of many of the programs I have written in Prexonite Script.

Prexonite CIL Functions

Save the "What the f…" for later and just look at the two snippets below.

ldloc.1
ldc.i4.5
add
stloc.1

Listing 1: a = a + 5 in CIL assembler

ldloci  1
ldc.int 5
add
stloci  1

Listing 2: a = a + 5 in Prexonite assembler

On the left you see four CIL assembler op codes, while the other snippet represents the exact same program, just written in Prexonite byte code assembler. The fact that the two programs look so similar is no coincidence as the Prexonite virtual machine was actually modelled after the CIL’s execution model. This exact similarity can be exploited to make Prexonite a lot faster.

A Prexonite to CIL compiler

Now before you get too excited, Prexonite Script still is what they call a “Dynamic Language” and a lot of its features are implemented in the underlying Prexonite virtual machine instead of the language compiler. Also, Prexonite byte code is not statically typed, which makes a straight translation to CIL impossible without very sophisticated data flow analysis and complete type inference. As I am not familiar with either of these topics, I decided to keep the Prexonite functions untyped. This is where the PValue class comes into play. It encapsulates a dynamically typed piece of data and provides many methods to interact with the contained data via late binding.

In all cases, an implementation of a Prexonite function in CIL must show the exact same behaviour as the original, interpreted implementation. Functions that interact with Prexonite stack frames cannot be compiled to CIL as they are no longer executed on the virtual machine’s stack but the CLR’s instead. Therefore, CIL implementations must be able to exist alongside interpreted implementations and that as transparently as possible. Also, since the Prexonite virtual machine allows for code generation and manipulation at runtime, CIL implementations must be replaceable. This unfortunately also means that function calls inside CIL implementations cannot be statically linked as the target function might change the implementation strategy (interpreted, CIL) every moment.

How it’s done

Since the Prexonite to CIL compiler operates on Prexonite byte code, it would not make much sense to use the C# or VB CodeDOM and the corresponding compiler. Instead System.Reflection.Emit provides the necessary API. Since implementations must be replaceable, dynamic types are not an option and the so called lightweight functions are used.

The compiler is designed to operate at runtime, invoked by the running program itself. This is, because it analyses the whole application to identify functions that are not compatible with compilation to CIL. Such functions are marked with the Meta entry volatile.

The compilation process itself is actually quite straight forward. First the function is analysed in order to determine the number of temporary variables required, to build up a symbol table and to identify shared (via closures) and non-shared variables. Then the common function header is emitted including the creation of PVariable objects for shared variables and the initialisation of non-shared variables with PType.Null.

Then, the variables representing arguments are initialised with either PType.Null or the value supplied in the arguments array and finally the special variable args is set to a list of those same arguments if required by the function.

What follows is a huge loop that iterates over every instruction in the functions code and passes it into a giant switch statement, which translates every Prexonite byte code instruction into a series of CIL op codes.

Therefore, the CIL implementation of the program in Listing 2 will look like in the pseudo CIL in Listing 3.

As you can see, an untyped implementation of this simple program expands into quite some code. Notice that due to the absence of a rotation op code, the implementation requires temporary variables to insert the local stack context in the call to Addition.

ldloc var1
ldc.i4.5
box int32
call IntPType PType::get_Int()
newobj instance void PValue::.ctor(object, PType)
stloc temp1
ldloc sctx
ldloc temp1
call instance class PValue PValue::Addition(StackContext, PValue)
stloc var1

Listing 3: Actual CIL implementation of the program in Listing 2

Note: I have shortened the fully qualified type names for better readability.

Is it worth the effort?

As with all optimization techniques, we must ask ourselves whether the effort for implementing it is worth the gain in performance (be it memory or speed). At this point, let me just throw the results of an amateurish micro benchmark at you.

CIl_micro_benchmark

One can clearly see that CIL implementations are superior. They perform the same tasks in 60% (empty_loop) to 30% (rec_echo x 100) of the time required by the interpreted versions. Since the CIL compiler performs many of the Meta data lookups required for the creation of a stack frame at compile time, function calls to CIL implementations are much faster. Keep in mind though that only interpreted functions can take advantage of tail calls. To prevent an overflow of the managed stack, you should implement infinite recursive loops in interpreted functions.

Overall, you could say that compilation to CIL will result in a free performance improvement of over 65 percent in most cases.

function rec_echo(n) =
    if(n == 0)
        0
    else
        1 + rec_echo(n-1)
;
function rec_echo_direct(n,r) =
    if(n == 0)
        r
    else
        rec_echo(n-1,(r??0)+1)
;