Threaded Code (and Composite)

Let’s look at the threaded code approach. It’s a less-well-known way to build an interpreter for a programming language. (The more common approach simulates each instruction.) Threaded code has a high-level analog, and some extra tricks when done in assembly language.

Core Idea

A threaded-code interpreter can build and assemble functional pipelines. Although it’s most often used for programming languages, consider an image transformation tool.

We have primaries that are the basic image transforms: scale, rotate, change contrast, etc. We may also build secondaries out of primaries and other secondaries. For example, when I post a video on YouTube, they let me add a thumbnail image. To create it, I take a screenshot then send that through a script that scales, increases contrast, and reduces the jpeg quality. This saves me from opening Gimp and making those changes manually.

To use a threaded-code style approach, build a Composite-like structure where primaries are the leaves, and secondaries are a compound object listing other primaries or secondaries.

In Swift, I’d use a recursive enum; for Java, C#, etc., I’d use subclasses. It would look something like this:

indirect enum Transforms {
  case primary1(its control parameters)
  case primary2
  case primaryN
  case secondary([Transforms])
}

We’d then create a method that runs primaries directly, and handles secondaries by walking through list of Transforms and applying them in order. The result of the final transform becomes the result of the secondary.

Primaries convert an input image to an output image. Secondaries have a list of primaries and secondaries to apply, and also convert an input image to an output image.

Aside: Programming Languages

We don’t always think of programming this way, but a step in a program can be looked at as a transform:

A program construct converts a "before" state to an "after" state.

Because we know programming languages are so general, we should expect that the tree-style approach can be used in many situations, including images (2d and 3d), sound editing, text styles, text transformation, build pipelines, SQL queries, data preparation, search, and more.

Threaded Code: The Assembly Language Twist

[You’ll be forgiven if you skip this section:)]

I’ve been aware of the threaded-code technique for a while, but I’ve never implemented it in assembly language. Lately, I’ve been doing that: implementing a Forth-like programming language. (See the Bibliography.) Note: I haven’t yet applied the twist (optimization) I’ve described below, but it’s in my plans:)

Setup

To treat primaries and secondaries uniformly, we’re going to give each a block with a one-word header, containing the address of the routine that should interpret it. The address of the header is called the word address (since this programming language defines “words”).

For primaries, the header points to the primary’s code:

Primary block: a header with a pointer to the code for this primary, and a separate block of the primary code itself.

All secondaries will point to a single routine I’ve called start2d. Start2d knows how to walk through the list of addresses in its block.

A secondary: a header pointing to the start2d routine, and a list of word addresses of other primaries and secondaries, ending with end2d.

Code

Let’s look at the code for DUP (duplicate the top of the data stack), start2d, and end2d.

DUP:

dup:
  DATA_TOP x0    // Load top of data stack to x0
  DATA_PUSH x0   // Push contents of x0 onto data stack
  ret            // Return

This code looks at the top of the stack, then pushes a copy and returns. Its header lives elsewhere.

Start2d:

start2d:
  str lr, [sp, #-16]!	// push LR and...
  str x20, [sp, #8]	// VPC to system stack

  add x20, x0, #8	// start VPC just past start2d call

L_interpreter_loop:
  ldr x0, [x20], #8	// Load word address, increment VPC
  ldr x1, [x0]		// Load address of code
  blr x1		// Call method
  b L_interpreter_loop // Repeat

The first two str (store) instructions save call information to the system stack. The add instruction sets X20, the VPC (virtual program counter), which keeps track of which row we’re on.

Then we hit the main loop. The first ldr (load from register) instruction reads the row, the word address (pointer) to the header. The second ldr then reads the header block. Then blr (branch linked using a register) calls the routine the register points to. When that routine returns, b (branch) takes us to the top of the loop.

End2d:

end2d:
  ldr x20, [sp, #8]	// Restore VPC and...
  ldr lr, [sp], #16	//   LR from system stack
  ret			// Return to start2d's caller

We exit the interpreter’s loop by running this routine. It first restores the information we pushed on the stack at the start. The key is that we now have the lr (link register) that was in effect when we first called start2d. When we execute ret, we return to the original caller.

The way we get here is a little convoluted. It works because we require every secondary to have an end2d at the end of its list.

A Trace

Consider the code sequence DUP 1 SUB. This duplicates the top item, pushes 1, and subtracts; thus it’s an idiom for “push a-1 on the stack“, where a is the value at the top.

If you trace what’s happening (key moments only), you’ll see this:

:
L_interpreter_loop:
:
blr   // call DUP
:
ret   // from DUP
b L_interpreter_loop
L_interpreter_loop:
:
blr   // call "1" routine
:
ret   // from "1"
b L_interpreter_loop
L_interpreter_loop:
:
blr   // call SUB
:
ret   // from SUB
b L_interpreter_loop
L_interpreter_loop
:
blr   // call end2d
:
ret   // from start2d

This dynamic view shows a recurring pattern:

ret
b L_interpreter_loop

This pattern isn’t as obvious statically.

The Twist

The unexpected twist is that we can eliminate procedure calls from our core loop.

The ret (return) at the end of a primary always jumps to the same place, to the "b L_interpreter_loop” instruction. So, let’s make the last line of each primary be “b L_interpreter_loop“. This cuts out the return instruction and an extra jump.

Since we now don’t want primaries to return, we shouldn’t use blr in start2d, as it’s for calling a routine. Instead, we can jump (br) where the register says. This might be a little faster but is certainly less work.

This work is in the critical path of the start2d routine, so it will save time for every secondary we use.

What Have We Lost?

I find this approach interesting because it takes us back to a world without routines in the usual sense. Most programming languages use them implicitly, whether they call them subroutines, procedures, functions, or methods.

Normal call-return is only one alternative; the threaded-code approach uses indirect branches instead.

Another alternative model is coroutines. Instead of starting at the start of the routine every time, a coroutine picks up where it last left off. C# and Kotlin both have such an ability, but I’ve only seen people use coroutines to build generators (returning the next value in a sequence according to some systematic enumeration). I haven’t yet seen them used as true mutual coroutines. I’d like to explore this further.

Other alternative models are possible: Lisp, Prolog, and static functional languages clearly have a somewhat different model in mind, but they’re shoehorned into the procedural model.

Conclusion

We explored the idea of threaded code, whether implemented as a tree structure in a high-level language, or via indirect jumps in assembly language. We looked at an optimization twist for the assembly implementation, and briefly considered some unusual alternatives to the procedure model.

Next time you have a situation where you have transformations and groups of transformations, try the threaded-code approach and see how it fits!

Bibliography

“Assembler“, YouTube playlist. Retrieved 2024-03-03. These are videos from the sessions where we’re exploring some of these ideas.

“Threaded code“, Wikipedia. Retrieved 2024-03-03.