<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Retrospective on Andy Rambles</title>
    <link>http://www.andy-rambles.com/tags/retrospective/index.xml</link>
    <description>Recent content in Retrospective on Andy Rambles</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <atom:link href="/tags/retrospective/index.xml" rel="self" type="application/rss+xml" />
    
      
        
          <item>
            <title>Making PISC faster: Exercises in de-stupiding code</title>
            <link>http://www.andy-rambles.com/post/PISC-Perf-2017/</link>
            <pubDate>Wed, 03 Jan 2018 00:00:00 +0000</pubDate>
            
            <guid>http://www.andy-rambles.com/post/PISC-Perf-2017/</guid>
            <description>

&lt;p&gt;I&amp;rsquo;ve been reflecting on the various bits of side-project work I&amp;rsquo;ve done in 2017. One of the most memorable side-projects was a set of optimizations to make PISC not &lt;em&gt;completely&lt;/em&gt; slow. It&amp;rsquo;s still not properly fast, since PISC is currently implemented as the stack-based equivalent to an AST-walking interpreter. However, it is now at a reasonable speed, rather than an incompetent one. The inspiration for my first pass at benchmarking PISC came from &lt;a href=&#34;https://youtu.be/U3upi-y2pCk?t=1594&#34;&gt;this talk&lt;/a&gt; about implementing a compiler for K-Lambda Scheme using Rust. There was a specific benchmark from the talk that stuck out to me:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(defun build (counter list)
    (cond 
        ((= 0 counter) list)
        ( true (build ( - counter 1)
             (cons counter list)))))
(build 100000)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In the talk, Siram noted that the compiled-to-rust from K-lambda version of that code took 7 minutes to run. I was very curious to see if PISC would do any better, especially since it can lean on Go&amp;rsquo;s runtime, rather than performing the rather large amount of copying in the Rusty Runtime. I translated this to two PISC programs, one testing allocation/iteration speed, the other testing the current limits of tail-recursion (something that PISC doesn&amp;rsquo;t currently optimize for).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;: build ( counter list -- list )
    :list :counter
    $counter 0 = 
        [ $list ] 
        [ $counter 1 - { $counter $list } build ] 
    if ;


: build-loop ( counter list -- list )
    :list :counter
        [ $counter 0 = not ] 
        [ { $counter $list } :list $counter 1 - :counter ] 
    while ;

: cons-test ( -- ) 
        # &amp;quot;Code took&amp;quot; is prefixed by time ATM.
        ${ [ 100000 &amp;lt;vector&amp;gt; build-loop ] time &amp;quot; for loop version&amp;quot; } print
        ${ [ 100000 &amp;lt;vector&amp;gt; build drop ] time &amp;quot; for recursive version&amp;quot; } print
;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;When I started out, this code was taking upwards of 1.5 minutes to run in the tail-recursive form, and &amp;gt;30 seconds for the looped version. While still faster than the 7 minutes for the K-lambda version, seemed painfully slow for a mere 100,000 list allocations, especially when both Lua and various Scheme implementations could do it in subsecond times. After a late night benchmarking and looking for hot spots, the looped version runs subsecond, and the tail-recursive version runs in 3-10 seconds. &lt;a href=&#34;#tail-recursive-aside&#34;&gt;(Note 1)&lt;/a&gt; &lt;span id=&#34;tail-recursion-anchor&#34;&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The improvements came from 3 changes focusing on reducing the repeated effort in resolving a string to a piece of executed code.&lt;/p&gt;

&lt;p&gt;The first major overhead was in using regular expressions to differentiate integers and doubles from other tokens. This led to a lot of time getting spent in evaluating regex functions, as the way things were set up, a regex would get run over each token every time the relevant piece of code is evaluated. For &lt;code&gt;0 100 [ 1 + ] times&lt;/code&gt;, this would result in both &lt;code&gt;1&lt;/code&gt; and &lt;code&gt;+&lt;/code&gt; having a regex run over them 100 times. Talk about interpretation overhead! Cleaning this up lead to about a 40% runtime reduction for both versions of the code.&lt;/p&gt;

&lt;p&gt;The second overhead was by far the biggest, and still isn&amp;rsquo;t 100% removed: PISC has to perform name resolution on each token it parses. This process currently involves 14 string comparisons, and 4 hashtable table lookups. Before the time spent on optimizing the name resolution process was repeated &lt;em&gt;every&lt;/em&gt; time a token was executed. Caching the results of this lookup into function pointers doubled speed of the PISC interpreter in the list building benchmarks. I haven&amp;rsquo;t been able to get this 100% completed, as I had some mutability and lack of sleep related difficulties in caching the right data for avoiding iterating over quotations and comments when a quotation in re-executed, but this has yet to come up as a hotspot in practice, though I could see it happening. &lt;a href=&#34;#second-overhead-aside&#34;&gt;(Note 2)&lt;/a&gt; &lt;span id=&#34;second-overhead-anchor&#34;&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The third overhead affected function calls and tail-recursion a fair amount: earlier versions of PISC used &lt;code&gt;defer&lt;/code&gt; to reset the state of the code quotation being executed after it was finished. This happened every time a PISC definition/quotation was executed. This meant that the cost of &lt;code&gt;defer&lt;/code&gt; was getting added to &lt;em&gt;every&lt;/em&gt; function call, which was vigorously stressed by tail-recursive code. Using the classic process of splitting a recursive call into recursive and initial forms allowed me to eliminate the use of defer here, which also cut down greatly on the costs of function calls.&lt;/p&gt;

&lt;p&gt;Other small improvements have included converting heavily used PISC words from having PISC-heavy defintions to doing most of their work in Go. This included &lt;code&gt;++&lt;/code&gt;, &lt;code&gt;--&lt;/code&gt;, as well as the various dictionary utility words, such as &lt;code&gt;-&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;-&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Overall, this puts PISC in the within an order of magnitude of CPython (though python still handles recursive calls better, and is far more mature), and within spitting distance of Go-based Lua implementations, though PISC still doesn&amp;rsquo;t hold a candle to C-based Lua. As things currently stand, I&amp;rsquo;m happy with where the performance is for now. PISC needs improvement in other areas (like some kind of module system that isn&amp;rsquo;t just the PISC version of &lt;code&gt;#include&lt;/code&gt;). At some point, I&amp;rsquo;ll probably start comparing it to TCL, or get a burr in my saddle, and will sit down to try to finish the work I started in removing the evaluation of comments in then AST, as well as finishing out as much reduction in name resolution as I can get away with.&lt;/p&gt;

&lt;h2 id=&#34;notes&#34;&gt;Notes&lt;/h2&gt;

&lt;p&gt;&lt;div id=&#34;tail-recursive-aside&#34;&gt;&lt;/div&gt;&lt;/p&gt;

&lt;h4 id=&#34;a-href-tail-recursion-anchor-1-recursion-in-go-a&#34;&gt;&lt;a href=&#34;#tail-recursion-anchor&#34;&gt;1) Recursion in Go&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;I discovered an interesting aspect of the Go runtime in the tail-recursive list builder. Apparently, on 64-bit systems, Go allows up to 1GB memory to be taken up by stack space. This limit is closer to 250MB on 32 bit machines, but is interesting as a demonstration of the unique characteristics of the Go runtime.&lt;/p&gt;

&lt;p&gt;&lt;div id=&#34;#second-overhead-aside&#34;&gt;&lt;/div&gt;&lt;/p&gt;

&lt;h4 id=&#34;a-href-second-overhead-anchor-2-problems-in-handling-quotations-and-comments-a&#34;&gt;&lt;a href=&#34;#second-overhead-anchor&#34;&gt; 2) Problems in handling quotations and comments &lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;There are currently an aspect to how PISC handles quotation-like words, and comments that I don&amp;rsquo;t care for, but haven&amp;rsquo;t found the time to fix (yet. Unless someone volunteers a patch, which is &lt;em&gt;very&lt;/em&gt; unlikely, though very welcome, it&amp;rsquo;s something I&amp;rsquo;ll be taking care of in 2018):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;20 [ ${12 43} /* Comment here. */ ] times 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This code will copy the &lt;code&gt;${&lt;/code&gt;, &lt;code&gt;12&lt;/code&gt;, &lt;code&gt;43&lt;/code&gt; and &lt;code&gt;}&lt;/code&gt; tokens into a new code quotation for &lt;em&gt;each&lt;/em&gt; iteration of the loop body, as well as having to iterate over the tokens in the comment each time. To be fair, this code is already allocation heavy, so it&amp;rsquo;s going to be a poor idea for a hot loop, but I&amp;rsquo;d love to clean this up.&lt;/p&gt;
</description>
          </item>
        
      
    
  </channel>
</rss>
