Making PISC faster: Exercises in de-stupiding code

I’ve been reflecting on the various bits of side-project work I’ve done in 2017. One of the most memorable side-projects was a set of optimizations to make PISC not completely slow. It’s still not properly fast, since PISC is currently implemented as the stack-based equivalent to an AST-walking interpreter. However, it is now at a reasonable speed, rather than an incompetent one. The inspiration for my first pass at benchmarking PISC came from this talk about implementing a compiler for K-Lambda Scheme using Rust. There was a specific benchmark from the talk that stuck out to me:

(defun build (counter list)
    (cond 
        ((= 0 counter) list)
        ( true (build ( - counter 1)
             (cons counter list)))))
(build 100000)

In the talk, Siram noted that the compiled-to-rust from K-lambda version of that code took 7 minutes to run. I was very curious to see if PISC would do any better, especially since it can lean on Go’s runtime, rather than performing the rather large amount of copying in the Rusty Runtime. I translated this to two PISC programs, one testing allocation/iteration speed, the other testing the current limits of tail-recursion (something that PISC doesn’t currently optimize for).

: build ( counter list -- list )
    :list :counter
    $counter 0 = 
        [ $list ] 
        [ $counter 1 - { $counter $list } build ] 
    if ;


: build-loop ( counter list -- list )
    :list :counter
        [ $counter 0 = not ] 
        [ { $counter $list } :list $counter 1 - :counter ] 
    while ;

: cons-test ( -- ) 
        # "Code took" is prefixed by time ATM.
        ${ [ 100000 <vector> build-loop ] time " for loop version" } print
        ${ [ 100000 <vector> build drop ] time " for recursive version" } print
;

When I started out, this code was taking upwards of 1.5 minutes to run in the tail-recursive form, and >30 seconds for the looped version. While still faster than the 7 minutes for the K-lambda version, seemed painfully slow for a mere 100,000 list allocations, especially when both Lua and various Scheme implementations could do it in subsecond times. After a late night benchmarking and looking for hot spots, the looped version runs subsecond, and the tail-recursive version runs in 3-10 seconds. (Note 1)

The improvements came from 3 changes focusing on reducing the repeated effort in resolving a string to a piece of executed code.

The first major overhead was in using regular expressions to differentiate integers and doubles from other tokens. This led to a lot of time getting spent in evaluating regex functions, as the way things were set up, a regex would get run over each token every time the relevant piece of code is evaluated. For 0 100 [ 1 + ] times, this would result in both 1 and + having a regex run over them 100 times. Talk about interpretation overhead! Cleaning this up lead to about a 40% runtime reduction for both versions of the code.

The second overhead was by far the biggest, and still isn’t 100% removed: PISC has to perform name resolution on each token it parses. This process currently involves 14 string comparisons, and 4 hashtable table lookups. Before the time spent on optimizing the name resolution process was repeated every time a token was executed. Caching the results of this lookup into function pointers doubled speed of the PISC interpreter in the list building benchmarks. I haven’t been able to get this 100% completed, as I had some mutability and lack of sleep related difficulties in caching the right data for avoiding iterating over quotations and comments when a quotation in re-executed, but this has yet to come up as a hotspot in practice, though I could see it happening. (Note 2)

The third overhead affected function calls and tail-recursion a fair amount: earlier versions of PISC used defer to reset the state of the code quotation being executed after it was finished. This happened every time a PISC definition/quotation was executed. This meant that the cost of defer was getting added to every function call, which was vigorously stressed by tail-recursive code. Using the classic process of splitting a recursive call into recursive and initial forms allowed me to eliminate the use of defer here, which also cut down greatly on the costs of function calls.

Other small improvements have included converting heavily used PISC words from having PISC-heavy defintions to doing most of their work in Go. This included ++, --, as well as the various dictionary utility words, such as -> and <-.

Overall, this puts PISC in the within an order of magnitude of CPython (though python still handles recursive calls better, and is far more mature), and within spitting distance of Go-based Lua implementations, though PISC still doesn’t hold a candle to C-based Lua. As things currently stand, I’m happy with where the performance is for now. PISC needs improvement in other areas (like some kind of module system that isn’t just the PISC version of #include). At some point, I’ll probably start comparing it to TCL, or get a burr in my saddle, and will sit down to try to finish the work I started in removing the evaluation of comments in then AST, as well as finishing out as much reduction in name resolution as I can get away with.

Notes

1) Recursion in Go

I discovered an interesting aspect of the Go runtime in the tail-recursive list builder. Apparently, on 64-bit systems, Go allows up to 1GB memory to be taken up by stack space. This limit is closer to 250MB on 32 bit machines, but is interesting as a demonstration of the unique characteristics of the Go runtime.

2) Problems in handling quotations and comments

There are currently an aspect to how PISC handles quotation-like words, and comments that I don’t care for, but haven’t found the time to fix (yet. Unless someone volunteers a patch, which is very unlikely, though very welcome, it’s something I’ll be taking care of in 2018):

20 [ ${12 43} /* Comment here. */ ] times

This code will copy the ${, 12, 43 and } tokens into a new code quotation for each iteration of the loop body, as well as having to iterate over the tokens in the comment each time. To be fair, this code is already allocation heavy, so it’s going to be a poor idea for a hot loop, but I’d love to clean this up.