Prexonite May 2008 Update

Downloads

Update

There have not been any significant changes since the introduction of the CIL compiler into the Prexonite, yet the current version comes with a number of performance optimizations regarding the generated CIL byte code.

The majority of the built-in commands and types now use the ICilCompilerAware interface, which is used by the CIL compiler to let commands and types emit highly customized code. Calling println with no arguments for instance, results in a static call to void System::Console.WriteLine() directly in the compiled method.

Similarly, type expressions in CIL functions are no longer implemented via type expression parsing but by directly referencing the corresponding singleton PType objects.

But the most important improvement is the possibility to statically link Prexonite function calls in CIL compiled methods, which makes yet another hashtable lookup redundant at the cost of additional memory: A dynamically generated class has static fields for each and every function used by the compiled application. This can be a problem if you plan to re-compile your CIL-implementations, as dynamic type, unlike dynamic functions, cannot be garbage collected by design. It is, however, possible to disable the generation of such a class by passing false to CompileToCil.

And on a side node: The often used library function struct has been implemented as a compiler hook for improved performance. By resolving the members at compile time one does not only save run time, but also removes the need for dynamic lookups, which in turn enables the use of CIL compilation for struct-functions. This is especially helpful for immutable structs.

CIL compilation hints and their effects

*Update 2008/02/20* I have just merged the CIL compiler branch into the trunk. The CompileToCil command is now officially part of Prexonite:
Prexonite source code (trunk)

In the last article, I presented the Prexonite CIL compiler and the huge performance improvements it comes with. Unfortunately, the compiled code has to be dynamically typed as the CIL compiler does not perform any data flow analysis and can therefore not possibly infer the correct types. It does not even create its own representation of the byte code program.

However, to say Prexonite Script (the language) is strictly dynamically typed is actually wrong, as the Prexonite compiler emits code for which the types and even the method overloads are known at compile time. It’s just that the virtual machine does not provide a way to take advantage of this knowledge.

One such example is the foreach loop, a construct that consists of

  • An expression (the list)
  • A block of statements
  • A left-hand value (the element)

and gets transformed into

var enum = $list$.GetEnumerator~IEnumerator;
while(enum.MoveNext())
{
    $element$ = enum.Current;
    $block$;
}

This pseudo code represents what is emitted by the Prexonite compiler for foreach AST nodes. It is clear that enum has to be at least of type IEnumerator in all cases. This information could enable the CIL compiler to statically type the variable enum, turning two late-bound calls (MoveNext and Current) into virtual calls.

CIL compilation hints

CIL compilation hints are basically a reverse mapping from byte code to AST nodes, reduced to the minimal amount of information required by the CIL compiler to emit optimized code. It is not that the whole AST is now encoded in the Meta tables of functions. Only nodes, for which the CIL compiler could generate better code, emit CIL hints.

One example is the foreach node, which emits the name of the enumerator variable and the addresses of the late-bound calls to be optimized. The CIL compiler decodes this information and performs the necessary steps. The enumerator variable for instance will be of type IEnumerator<PValue> and won’t be initialized ahead of time.

Impact on performance

The two main paradigms to interact with sequences in Prexonite are the combination of coroutines (sequence operators like where and map) and the use of foreach loops. While coroutines have the advantage of compose ability and deferred execution, foreach loops are usually faster.

Again, I used micro benchmarks to demonstrate the impact on performance. For practical reasons the number of iterations depends on the size of the set to iterate over in the inner loop. N = 200′000 makes the basis. With sets of 10 and 100 elements, N is reduced to 20′000 and 2′000 respectively.

Iterations over a set (Measurements)

What you are seeing here are performance improvements of 950 to 3′400%, but keep in mind that those are very specialized micro benchmarks and that unless your program exclusively consists of mindless foreach loops, you will not likely experience such speed-ups.

Nonetheless, iteration over lists is a very important aspect of many of the programs I have written in Prexonite Script.

Prexonite CIL Functions

Save the "What the f…" for later and just look at the two snippets below.

ldloc.1
ldc.i4.5
add
stloc.1

Listing 1: a = a + 5 in CIL assembler

ldloci  1
ldc.int 5
add
stloci  1

Listing 2: a = a + 5 in Prexonite assembler

On the left you see four CIL assembler op codes, while the other snippet represents the exact same program, just written in Prexonite byte code assembler. The fact that the two programs look so similar is no coincidence as the Prexonite virtual machine was actually modelled after the CIL’s execution model. This exact similarity can be exploited to make Prexonite a lot faster.

A Prexonite to CIL compiler

Now before you get too excited, Prexonite Script still is what they call a “Dynamic Language” and a lot of its features are implemented in the underlying Prexonite virtual machine instead of the language compiler. Also, Prexonite byte code is not statically typed, which makes a straight translation to CIL impossible without very sophisticated data flow analysis and complete type inference. As I am not familiar with either of these topics, I decided to keep the Prexonite functions untyped. This is where the PValue class comes into play. It encapsulates a dynamically typed piece of data and provides many methods to interact with the contained data via late binding.

In all cases, an implementation of a Prexonite function in CIL must show the exact same behaviour as the original, interpreted implementation. Functions that interact with Prexonite stack frames cannot be compiled to CIL as they are no longer executed on the virtual machine’s stack but the CLR’s instead. Therefore, CIL implementations must be able to exist alongside interpreted implementations and that as transparently as possible. Also, since the Prexonite virtual machine allows for code generation and manipulation at runtime, CIL implementations must be replaceable. This unfortunately also means that function calls inside CIL implementations cannot be statically linked as the target function might change the implementation strategy (interpreted, CIL) every moment.

How it’s done

Since the Prexonite to CIL compiler operates on Prexonite byte code, it would not make much sense to use the C# or VB CodeDOM and the corresponding compiler. Instead System.Reflection.Emit provides the necessary API. Since implementations must be replaceable, dynamic types are not an option and the so called lightweight functions are used.

The compiler is designed to operate at runtime, invoked by the running program itself. This is, because it analyses the whole application to identify functions that are not compatible with compilation to CIL. Such functions are marked with the Meta entry volatile.

The compilation process itself is actually quite straight forward. First the function is analysed in order to determine the number of temporary variables required, to build up a symbol table and to identify shared (via closures) and non-shared variables. Then the common function header is emitted including the creation of PVariable objects for shared variables and the initialisation of non-shared variables with PType.Null.

Then, the variables representing arguments are initialised with either PType.Null or the value supplied in the arguments array and finally the special variable args is set to a list of those same arguments if required by the function.

What follows is a huge loop that iterates over every instruction in the functions code and passes it into a giant switch statement, which translates every Prexonite byte code instruction into a series of CIL op codes.

Therefore, the CIL implementation of the program in Listing 2 will look like in the pseudo CIL in Listing 3.

As you can see, an untyped implementation of this simple program expands into quite some code. Notice that due to the absence of a rotation op code, the implementation requires temporary variables to insert the local stack context in the call to Addition.

ldloc var1
ldc.i4.5
box int32
call IntPType PType::get_Int()
newobj instance void PValue::.ctor(object, PType)
stloc temp1
ldloc sctx
ldloc temp1
call instance class PValue PValue::Addition(StackContext, PValue)
stloc var1

Listing 3: Actual CIL implementation of the program in Listing 2

Note: I have shortened the fully qualified type names for better readability.

Is it worth the effort?

As with all optimization techniques, we must ask ourselves whether the effort for implementing it is worth the gain in performance (be it memory or speed). At this point, let me just throw the results of an amateurish micro benchmark at you.

CIl_micro_benchmark

One can clearly see that CIL implementations are superior. They perform the same tasks in 60% (empty_loop) to 30% (rec_echo x 100) of the time required by the interpreted versions. Since the CIL compiler performs many of the Meta data lookups required for the creation of a stack frame at compile time, function calls to CIL implementations are much faster. Keep in mind though that only interpreted functions can take advantage of tail calls. To prevent an overflow of the managed stack, you should implement infinite recursive loops in interpreted functions.

Overall, you could say that compilation to CIL will result in a free performance improvement of over 65 percent in most cases.

function rec_echo(n) =
    if(n == 0)
        0
    else
        1 + rec_echo(n-1)
;
function rec_echo_direct(n,r) =
    if(n == 0)
        r
    else
        rec_echo(n-1,(r??0)+1)
;

A functional touch

The last days, I've been working on two things: The reorganization of built-in commands and the improvement of the "Functional Experience".

Why do commands need reordering? Because it gets difficult to find the right file among over 40 commands.
Why the sudden increase in numbers? I added proxies for System.Math methods for both easy and fast access to mathematical functions such as Sqrt and Sin, but also Pi.
Additionally, the most important coroutines from the Prexonite Standard Repository for list processing have been implemented in managed code, again for performance reasons. Map, Where, Limit, Skip and friends now inject managed coroutines into the stack.
The commands are now organized in the namespaces Core, List, Math and Text. The latter currently contains the fixed layout functions SetCenter, SetLeft and SetRight, which fill a given string with some character sequence until it has a certain length and is aligned correctly.

Now what the hell do you mean by "Functional Experience"?

I haven't told anyone but the Prexonite VM is absolutely terrible when it comes to recursion. Unfortunately, recursion happens to be one of the key elements in functional programming and, as you might have noticed, Prexonite Script comes with a lot of syntactic sugar that makes it look like a functional programming language.
Ok, lambda expressions and closures are "true" functional features but the lack of a sophisticated type system makes it almost impossible to reason about a program in the way functional compilers do. Nonetheless, I added two features with the last commit, that make PXS a tiny little bit more functional.

First of all: Tail Call Optimization

Yes, the thing that helps with recursion.

Prexonite Script:
  1. function fac(n,r) =
  2.     if(n == 1)
  3.         r
  4.     else
  5.         fac(n-1,n*r);

I benchmarked this function three times, with different tail call optimization strategies. The difference is huge. See for yourself (10'000 computations of 16!):

Comparison of different tail call optimization strategies.

Two strategies are employed: An implementation of tail call optimization for directly recursive functions inside the compiler, that turns recursive calls into direct iterations (jumps to the beginning of the function with different arguments). What I call "virtual machine optimization" is a special tail call instruction that removes the current stack frame after having called the function or closure.
Now apparently the virtual machine "optimization" is not particularly fast but uses far less memory than the normal invocation.

Prexonite will never be able to recognize indirect recursion due to the lack of control flow analysis. This, however, does not mean that return statements inside conditions or calls in tail position are not recognized. I'm not sure if Prexonite will ever handle simple recursive return expressions like the normal definition of the factorial:

Prexonite Script:
  1. function fac n =
  2.     if (n==1)
  3.         1
  4.     else
  5.         n*fac(n-1);

Also in the repository is an experimental and partial implementation of the famous call-with-current-continuation from Scheme. In PXS it is known as call\cc.

I must admit that I don't really know much about call/cc and how it works, especially regarding the stack. Creating a callable object from the current state of a function invocation is no problem. I just don't understand some of the scheme samples, I've been looking at (terribly difficult to read...)

The following snippet stores a continuation of the function two in the global variable plusone. Invoking this continuation with, say, 6 returns 7 as the name suggests.

Prexonite Script:
  1. var plusone;
  2.  
  3. function two =
  4.     1 + call\cc(->one);
  5.        
  6. function one(continuation)
  7. {
  8.     plusone = continuation;
  9.     return 1;
  10. }

The Philosophy Behind: The Prexonite Type System

This is the second article in the "Philosophy Behind"-series, picking up a specialty of one of my projects and explaining how it came to be made. Last time I wrote about the "auto dereferencing" concept in Prexonite Script.

In today's article I will explain the reasons behind the design of the Prexonite type system.
Prexonite faces the same problem as other implementations of late-bound languages on the .NET platform: How to map the CTS to the languages type system.

Prexonite_TypeSystem

I think the basic types Int32, Double, Boolean and String are more suited for a statically typed environment, so my type system must allow me to provide wrappers around third-party classes/structs.
Wrapping and unwrapping objects must be as transparent as possible. Return values from base class library methods have to be wrapped in their Prexonite equivalent.

At the same time, it is not practical to write a custom wrapper for every possible C# or VB.NET library, so there must be some sort of universal wrapper for CLR objects. With users of Prexonite being able to write their own wrappers, it must be possible to have multiple wrappers for the same CTS type. Also, some wrappers might handle more than one type.

The solution for Prexonite is the abstract class PType and some concrete subclasses, including the universal ObjectPType, which does all the late binding. Since Prexonite Script performs type checks at runtime, type information has to be associated with every data object, which is just what the class PValue does.

What might surprise you, is the fact that Null is considered a type. Every null reference automatically has type Null. Unlike the sturdy null references in C#, instances of Prexonite Null are completely functional objects. They react to operators, can be converted to basic values (Int, String,...) and even provide a ToString method. However, Null does have a special position in the Prexonite type system: it is not possible to write and use your own null reference wrapper.