The Toy Programming Language

TOY is a concatenative imperative programming language inspired by FORTH and Postscript but with a few added twists. As such, global and local variables play a much lesser role than in most main-stream languages because functions directly and explicitly manage a stack of values. While local variables in most programming languages exist within the stack frame of the function where they are defined and thus are inextricably tied to the lifetime of this stack frame, values on the data stack in TOY can continue to exist after the function that created these values returns.

As an example, consider the three functions +, -, and *. Each of them expects two numbers on the stack, pops them off the stack, and pushes the result of adding, subtracting or multiplying these two numbers onto the stack. As we will see, number literals in our program text have the effect that the corresponding number gets pushed onto the stack. Thus, the arithmetic expression (2 + 3) * (4 - 2) can be evaluated using

 

This is of course exactly reverse Polish notation (not a coincidence). To understand this program, after the fragment 2 3, we have the numbers 2 and 3 on the stack. + pops them off the stack and pushes their sum, 5, onto the stack. The fragment 4 2 pushes 4 and 2 onto the stack, so the stack now contains 5 4 2 - takes 4 and 2 off the stack and pushes their difference, 2, onto the stack. The stack now contains 5 2 * takes these two values off the stack and pushes their product, 10, onto the stack.

Given that most function's behaviour is defined by the effect they have on the stack contents, we adopt the FORTH convention here to annotate every function with the stack content it expects before it is called and the content it leaves behind when it exits. We list the stack content with the topmost element last. As in C++, comments start with // and extend to the end of the line. So the type annotation of the + function would look like this:

 

Note that this is only an annotation for the human reader. The interpreter ignores these annotations because they're simply comments. Also note that this annotation says only that + expects two values on the top of the stack and replaces them with their sum. It says nothing about other elements that are on the stack below a and b. These elements are simply not important to + and are left untouched.

To get a feel for what a TOY program looks like, here's a function to compute the nth Fibonacci number:

 

Let's dissect this. We define a function fibonacci that expects an integer n on the stack and replaces it with the nth Fibonacci number. To do this, it pushes 0 and 1 (the (–1)st and the 0th Fibonacci number) on the stack, thereby creating the stack content n F_{-1} F_0. Now we invoke the function fib, which expects the stack contents n-i F_{i-1} F_i and replaces these three numbers on the top of the stack with F_n. So the current stack content is exactly what fib expects for i = 0. Thus, once fib returns, we have F_n on the stack, as desired.

How does fib work? This is best described using by writing the sequence of steps and the changes to the stack contents that occur:

 

So rotl rotates the top 3 stack elements one position to the left, dup duplicates the top stack element, 0 pushes itself on the stack, = compares the two elements on the top of the stack and pushes the result on the stack, and lambda expressions of the form [ ... ] push references to themselves on the stack. Now, ?? expects a Bool and two function references on the top of the stack and pops them. If the Bool is true, then the first function is executed, otherwise the second function is executed. So let's look at these two cases. In both cases, the stack looks like this immediately before [drop...] or [1...] is executed:

 

If n-i = 0, then i = n and thus F_i = F_n. In this case, the first lambda expression is executed:

 

So we are left with exactly the stack contents we want. After that, fib reaches its terminating . and returns control to fibonacci, which also reaches its terminating . and thus exits with F_n on the stack.

If n-i != 0, then the second lambda expression is executed:

 

So now we have again produced the input state expected by fib but now for i+1. Thus, we call fib (tail-)recursively to produce F_n from this state.

After this glimpse of how to program in TOY, let's define the language (semi-)formally.

Overall Structure of a TOY Program

A TOY program is made up of type definitions, variable definitions, function definitions, and comments.

Scoping Example

TOY supports statically scoped local definitions, that is, functions can define local types, functions, and variables; these are visible anywhere in the function, including inside inner functions, but not outside the function.

 

Comments

Comments begin with // and extend to the end of the line.

Function Definitions

A function definition is of the form

 

The function body is composed of functions to be called to implement the function, including recursive calls to the function itself, and of local type, function, and variable definitions.

Local type, function, and variable definitions are visible anywhere in the function, even before they are defined. They can be arbitrarily interleaved with the implementation of the function, but for readability, it is recommended to put the definitions at the beginning or end of the function.

Anonymous functions are defined as

 

This is TOY's notion of a lambda expression. The spaces are significant since all tokens in TOY need to be separated by spaces. The same goes for the final . in the definition of a named function or type. [ <function body> ] creates a function object and pushes a reference to this function object on the stack. This function can be invoked using call, which pops the function object from the stack and then executes its constituent instructions, or conditionally using ? or ?? (see below). We can also store this reference to a function object in a variable:

 

The function can then be invoked by writing <function name> anywhere where the variable is in scope. In fact, the above definition fun <function name> <function body> . is simply syntactic sugar for assigning an anonymous function to a variable.

Variable Definitions

A variable is defined using

 

This takes the topmost element from the stack and stores it in the variable with name <variable name>. There is no way to change the value of this variable. In this sense, variables are much like variables in functional languages. To create mutable variables, store references in these variables (see below).

For most values stored in variable, writing <variable name> as part of a function definition will result in pushing the content of the variable on the top of the stack. The exception to this rule is when the variable stores a reference to a function object (or to a coroutine object if you do the bonus assignment). In this case, writing <variable name> invokes the function. This is done for convenience because we invoke functions more often than manipulating references to them. If you want the reference to the function object to be pushed on the stack, for example to be passed as an argument to another function, use #<variable name>. For consistency, you can also write #<variable name> for variables that store values other than references to function objects, but the effect is exactly the same as if you had written <variable name> in this case.

Type Definitions

As discussed below, TOY has a fairly limited number of built-in types. User-defined types are comparable to structs in C. They are defined using

 

Writing <type name> as part of a function body then pushes a reference to an uninitialized object of type <type name> on the stack. The object can then be manipulated using field accessors. To access the value of a field <field>, write <field> as part of the function. This pops the object from the stack and pushes the value of the named field onto the stack. To assign to the field with name <field>, use <field>=. This expects an object and the value to be assigned to the field <field> on the stack and pushes a new object onto the stack whose field <field> now has the specified value. An example:

 

creates a point type with two fields x and y. To create a point with x-coordinate 1 and y-coordinate 2, we write

 

If this is followed immediately by x, then the value on the stack is 1 (and the object is no longer on the stack). If you want to access the x-coordinate while keeping the object around, you need to make a copy of it, using dup (see below).

Main Function

Every TOY program must have a main function. This function is the entry point of the program.

Built-in Data Types

TOY has 4 primitive built-in data types: Bool, Int, Float, and Char. In addition, it has an Array type and a String type implemented as an Array of Char. Bool, Int, Float, and Char are stored by value. Array (and thus also String) are stored by reference.

Functions are first-class objects, that is, references to functions can be stored on the stack. The same is true for coroutines if you do the bonus assignment.

A final data type that comes in handy sometimes is the reference type. This is essentially a pointer. You can create two types of references in TOY: to an arbitrary stack location or to a heap location created using new (see below).

Bools

The Bool type has two values, true and false. TOY offers two built-in functions, true and false that push a true or false Boolean onto the stack:

 

A number of built-in functions manipulate Bool values:

 

We can also compare Bools, where false < true. This allows us to create logical relationships such as XOR (not equal), equivalence (equal) or implication (less than or equal):

 

Comparisons

In fact, all built-in types support comparisons. For integers, floating point numbers, and characters, the comparison is defined in the obvious fashion. For arrays, we compare them element by element until we find the first position where the arrays differ. This position determines the result of the comparison. If there is no such position, the shorter array is less than the longer one. If the two arrays have the same length and are equal in all positions, the two arrays are equal.

References and user-defined types are compared based on their addresses in memory.

Integers

The usual way to create integers on the stack is by including integer literals in function definitions. Think of an integer literal as a constant function that pushes the value of this literal onto the stack.

In addition to the comparison functions defined for all data types, integers support the following arithmetic and bitwise logical orperations:

 

Floating Point Numbers

Float literals behave like integer literals; they push themselves on the stack. Floats support comparisons and the 4 basic arithmetic operations:

 

Characters

Character literals are of the form '<char>', where <char> can be any character other than ' and \. To create these two characters, write '\'' and '\\'. (For the sake of making our implementation easy, we will also assume that escaped characters such as '\n' have the same meaning as in Python (or C/C++ if you choose to use it as your implementation language). Characters themselves can only be compared.

Arrays

Arrays can store arbitrary values. Different array positions can store values of different types, just as in Python. To create an empty array, write array. This creates a new array object and pushes a reference to this object on the stack. TOY has a number of functions to manipulate arrays:

 

Strings

Strings are simply arrays of characters and thus support all the array operations. In addition, we can create string objects by writing string literals such as "This is a bit of text", which creates an array object storing these characters and pushes a reference to this object onto the stack. Again, the string must not contain " or \ except in escaped form: \" and \\.

References

References are TOY's way of pointing to other objects. An uninitialized reference is created using ref or new. ref creates a reference that points to the stack location currently on the top of the stack and pushes this reference onto the stack. This can be useful if we want to implement some deeply nested code that needs to hold on to a specific fixed location on the stack. new allocates a new location where values can be stored on the heap and pushes a reference to this location on the stack. These objects are garbage-collected. Since we implement TOY in Python or Java, the host language's garbage collector takes care of this for us, so we don't have to worry about implementing our own garbage collector.

There are really only two operations references support: @ pops a reference object from the stack and pushes its contents onto the stack. @= pops a reference object and a value from the stack and updates the referenced memory location with the given value. Note that these two operators are overloaded because they are also used to access and update array cells. In your implementation, you simply have to check the type of the topmost stack entry. If it's a reference, you need to treat it the way discussed here. If it's an integer and the object below it is an array, then treat it as an array dereference operation. Anything else is an error.

Symbols

A symbol is of the form :<identifier>. This pushes a representation of the <identifier> onto the stack. (In your implementation, you probably want to represent symbols as Python strings.) One use case is the symbols :read and :write used to indicate whether to open a file for reading or writing.

Built-in Functions

Type-specific functions were already discussed above. Here, we discuss functions that are type-agnostic or didn't really fit in the above discussion.

Coercion

While stack cells, references, and variables can store values of any type, operations are type-aware. Addition, for example, expects two integers or two floats and pushes a result of the same type. Any other combination of arguments should result in an error. This includes trying to add an Int and a Float.

If you want to add an Int and a Float, you have to choose which type you want the result to have and explicitly coerce one of the arguments to Int or Float. Coercion to a given type is supported only between the four basic types Bool, Int, Float, and Char. To replace the topmost stack entry with the result of coercing it to type t, use t as if it was a function. For example, if the Float 1.5 is on the top of the stack, then Int used as a function application replaces it with the value 1. The exact semantics of coercion are as follows:

To Bool: An integer or float that is 0 and the character '0' is converted to false. Any other value is converted to true.

To Int: Any floating point value is rounded down to the nearest integer. false is converted to 0, true is converted to 1. A character is converted to its ASCII code (we ignore the complexities of UTF-8 here).

To Float: Any integer is converted to a float with the same value. false becomes 0.0, true becomes 1.0. A character is converted to its ASCII code represented as a Float.

To Char: false becomes '0'; true becomes '1'. Any integer is read as an ASCII code and converted to the corresponding character. Any float is rounded down to the nearest integer and then converted to the character with the resulting ASCII code.

Printing and Parsing

TOY has two functions for converting between strings and other types and vice versa.

format pops the topmost value, which must be a Bool, Int or Float, from the stack and pushes its representation as a literal of its type onto the stack. Thus, 42 (an Int) becomes the string "42". 43.23 becomes the string "43.23". false and true are converted to the strings "false" and "true", respectively. Trying to format any value that is not a Bool, Int or Float is an error.

asInt, asBool, and asFloat perform the opposite conversion. For Int and Float, the semantics are obvious. For Bool, both "false" and "0" are converted to false and "true" and "1" are converted to true.

We also have asSymbol and fromString to convert between strings and symbols. :foo asSymbol produces the string "foo". "foo" fromString produces the symbol :foo.

I/O

TOY's built-in I/O functions are fairly limited, just enough to get us going.

getChar reads a character from stdin and pushes it onto the stack. putChar pops a character from the stack and writes it to stdout. nl writes a newline character to stdout.

getStr and putStr are similar to getChar and putChar but read/write a string. getStr reads until the next newline or end of file and pushes the read string excluding the read newline onto the stack. putStr writes whatever string it is given to stdout, without any newline at the end.

open opens a file and pushes a file object onto the stack. It takes two arguments from the stack, a file name and one of the symbols :read or :write for opening the file in read or write mode. close expects a file handle on the stack and closes the file. We now have getCharF, putCharF, getStrF, and putStrF, which expect, for example in the case of putCharF, a file object and a character on the top of the stack; the character is then written to the file.

There's a built-in eof function, which takes a file handle and replaces it with a Bool that indicates whether the file has been read all the way to the end. Applying eof to a file opened in write mode always produces false.

We will keep our I/O implementation simple by not implementing any means to catch errors that may happen or worry about distinguishing between appending to a file or overwriting to a file opened in write mode. TOY simply overwrites every file it opens in write mode.

As programs in any language, TOY programs need to have a way to access their command line arguments. getArgs pushes an array of strings containing these command line arguments onto the stack.

Stack Manipulation

Since TOY is based on manipulating a data stack explicitly, it helps to have a number of primitives to manipulate the stack contents explicitly.

dup makes a copy of the element on the top of the stack:

 

dupn expects an integer n on the top of the stack and makes copies of the next n elements below it:

 

drop drops the top element from the stack. dropn expects an integer n on the stack and drops it along with the next n elements on the stack.

 

rotl rotates the top 3 elements on the stack one position to the left. rotln expects an integer n on the stack, pops it, and then rotates the next n elements on the stack one position to the left.

 

rotr and rotrn do the opposite:

 

Finally, swap swaps the top two elements on the stack and swapn expects an integer n on the top of the stack and pops it; it then swaps the new topmost element on the stack with the nth element from the top of the stack, where 0 refers to the topmost element, so swap and 1 swapn do the same thing.

 

Formal Syntax

Lexical Structure

Here's a list of TOY's tokens along with the regular expressions that define what a valid token of this type is:

 

In addition to these complex tokens, [, ], [|, |], and . are tokens that I don't give a special name to. The constraint for ids is that the id must not start with // (the start of a comment) and must not match any of the keywords ., fun, cofun, type, and var. The constraint for references is that the part after the % must satisfy the constraint for ids.

Syntactic Structure

Here's the context-free grammar that describes the structure of a TOY program:

 

 

Semantic Structure and Scoping

The semantics of the different constructs were spelled out before, so I won't iterate them here. The scoping rules are static. A scope is delimited by fun/cofun and its matching . An identifier in a given scope refers to the variable with this name defined in the smallest enclosing scope where a variable with this name exists. Defining the same variable twice in the same scope is an error.

Extensions (Bonus Assignment)

Tail Recursion

We want to avoid building up a massive stack when a function calls itself recursively, as for example the fib function above does. This is particularly important because TOY has no built-in loop constructs. Loops need to be implemented using ?? and tail recursion. You can support tail recursion in two ways:

Goto

Since functions do not explicitly return any values but only manipulate a stack, the last function called from a function body can always be jumped to rather than making a recursive call the function has to return from. This is roughly how tail recursion is implemented for example in Haskell. However, given that even function calls earlier in the function body do not have any return values and only manipulate the stack, there's a more elegant alternative that avoids recursive calls altogether.

Direct Threaded Code

If you attempt the bonus assignment and aim to also implement coroutines, I recommend this second implementation because it makes it a bit easier to implement coroutines.

FORTH systems use a simple execution model that is faster than implementing function calls using a call stack and achieves this by exploiting the simple structure of a FORTH program. (Our interpreter will be slow because it's implemented in Python, but we can still use the same technique to implement the interpreter.) While TOY has a slightly richer structure, the same execution model can be used here.

The idea is to maintain the functions to be called on a stack, with the next function to be executed on the top. To avoid confusion with the data stack used to hold data items to be manipulated, we refer to this stack as the execution stack. Recall that every function is implemented as a function object and that the start of the program is the main function. The program execution starts by initializing an empty execution stack and jumping to the main function. The execution of a function depends on whether it is a built-in or a user-defined function:

Coroutines

TOY's implementation of coroutines presents itself to the programmer as follows: The following code defines a coroutine that generates the sequence of positive integers:

 

The following code then prints the first 10 integers:

fun printTen
  ints go
  fun go
    swap dup format putStr nl
    10 < [ resume go ] [ drop ] ??
  .
.

printTen calls ints. ints creates its own local data stack (and its local execution stack)—this is one of the main characteristics of a coroutine. It then pushes 0 on this local stack and calls its inner function go (this is just a regular function call, so it continues to operate on the data stack created by ints. go adds 1 to the value on the top of the stack, so the top of the stack is now 1. It duplicates this number, producing the stack content 1 1. Then << moves the element on the top of ints's data stack to the top of its caller's (printTen) data stack. yield suspends the execution of the coroutine and pushes a reference to the current execution state of the coroutine onto printTen's data stack. printTen now finds a 1 and a reference, let's call it intsState, on its data stack. It call its own local function go. go swaps intsState with 1 to put 1 on top of the stack and duplicates the 1. It turns the topmost 1 into a string using format and writes it to the screen using putStr nl. It then checks whether 1 < 10. Since this is not the case, it executes resume go. resume takes intsState from the top of the stack, suspends the execution of printTen and resumes the execution of ints's inner function go. go was suspended after yield, with 1 on the top of its stack. It calls itself (tail-)recursively, so it again adds 1 to the top element and duplicates it, giving the stack content 2 2. The topmost 2 gets moved to printTen's data stack by <<, and then go gets suspended again by yield. This passes control back to printTen, which resumes after resume. It thus calls its inner function go, which swaps intState and 2 on the top of its stack, duplicates the 2, prints it to the screen using format putStr nl, checks 2 < 10, and the whole dance continues until we finally reach the situation where the element on the top of printTen's stack is 10. In this case, the comparison 10 < 10 fails and the block [ drop ] is executed instead of [ resume go ], which drops the reference to ints's execution state from printTen's data stack. printTen reaches its end and exits. However, ints is still in suspended state. The semantics of TOY's coroutines specify that a coroutine terminates when it either reaches its end (its execution stack becomes empty) or no more references to its suspended state exist anywhere. (Again, this is easy to implement by taking advantage of Python's garbage collector.)

So, in general, a coroutine is defined like a function but using the keyword cofun instead of fun. The execution of a coroutine starts by creating a new empty data stack and a new empty execution stack. These two stacks become the state of the coroutine. In order to allow communication with the caller, a coroutine also has a reference to its "parent's" execution state. For now, let's assume that the resume/yield handoff always happens with the same parent that initially called the coroutine. We'll get rid of this assumption below.

There are now six built-in functions to communicate with the parent coroutine:

Now, coroutines are first-class values just like functions and thus can be passed around. We can thus pass a reference to a coroutine foo to some function by pushing it onto the stack using %foo. We can also create anonymous coroutines using the notation [| ... |]. The more interesting thing happens when we pass a reference to a suspended coroutine to another coroutine call, something like this:

fun createCoroutine
  co1 resume co2
  
  cofun co1
    yield yield .
    
  cofun co2
    >> resume .
.

In this case, createCoroutine calls the coroutine co1, which immediately yields back to createCoroutine. The resume in createCoroutine does what we expect: it passes control back to co1, which yields again. Now we have a reference to co1's execution state on createCoroutine's stack. We call co2, which copies this referenco to its own state and then resumes co1. When co1 returns (or if it did, when it yields again), control should now return to co2, not to createCoroutine. The implementation is simple: Whenever a reference to a coroutine is used to resume it, the execution state that was active at the time the coroutine was resumed gets stored as the parent execution state in the execution state of the coroutine.