You may want to inform yourself about human rights in China.

On Nix's Language: Mathematical Expressions, Brainf*ck

tags: computing nix
date: 2022-12-08
update: 2024-04-01

A great thing about Nix (the language) is that it’s tiny. Tiny enough that besides some builtins and a few NixOS/Nixpkgs centered features (e.g. paths), we’ve covered, for most intent and purposes, all of it.

The next article will be about writing a λ-calculus interpreter. In preparation, we’ll focus here on implementing three small projects:

  1. A prefix/polish expression evaluator (e.g. something that reads a string like + + 3 * 5 + 2 2 19, and interprets it as 3+5*(2+2)+19=42);
  2. A Brainf*ck evaluator (if you don’t know what it is, I guess an example wouldn’t help much);
  3. A mathematical expression evaluator (e.g. something that reads and interprets 3+5*(2+2)+19);

The code for this series is still available on github.


Grasses (Травки, 1892), oil on canvas

Grasses (Травки, 1892), oil on canvas by Ivan Ivanovich Shishkin (Ива́н Ива́нович Ши́шкин, 1832-1898) through gurneyjourney.blogspot.com wikiart.orgPublic domain

Testing

Our code is going to get slightly more complex than before. As a result, and because we’ll often work iteratively, from simple to complex, we’ll want to have ways to gain a firm hand on the stability of the code.

Here’s a simple, generic test module, a Nix implementation of the one described in this article:

with builtins;
rec {
	apply = f: l: foldl' (x: y: x y) f l;

	run1 = {descr ? "", fun, args, expected, indent ? ""} :
		let
			# This means, if we want only one argument that is a list,
			# we have to use a list with a single element: [theSingleListArg].
			got = if isList(args) then (apply fun args)
			      else                 (fun args);
			r = got == expected;
			ok = if r then "OK" else "failed!";
		in
			trace("${indent}${ok}\t:${descr}") (if r then r else
			trace("${indent}\tGot:")
			trace(deepSeq got got)
			trace("${indent}\tExpected:")
			trace(deepSeq expected expected)
			r)
	;
	run = foldl' (ok: t: ok && (run1 t)) true;
}
{ apply = <CODE>; run = <CODE>; run1 = <CODE>; }

Here’s how we could write tests for this test module itself:

#!/usr/bin/env -S nix-instantiate --eval
with builtins;
let
	ftests = (import ./ftests.nix);
	tests = [
		{
			descr     = "run1 [1 2 3] ≠ [1 2 3 4]";
			fun      = ftests.run1;
			args     = {
				descr     = "(should fail)";
				indent   = "// ";
				fun      = _ : [ 1 2 3];
				args     = [];
				expected = [1 2 3 4];
			};
			expected = false;
		}
	];
in ftests.run tests
trace: // failed!       :(should fail)
trace: //       Got:
trace: «lambda fun @ ./nix/ftests_test.nix:12:16»
trace: //       Expected:
trace: [ 1 2 3 4 ]
trace: OK       :run1 [1 2 3][1 2 3 4]
true

Exercise: There are other options for testing. Try to find how Nixpkgs’s strings.nix module is tested for instance.

Solution:[+]

Note: I’ll still be using ad-hoc tests for simple pieces of code. Also, while I’ll be giving you tests in one format, feel free to use any other you might prefer. The key point is to have an automated way to make sure your code behaves as expected, and keeps doing so despite changes.

Prefix expressions

Introduction

The polish notation, or prefix notation, is a way to write down mathematical expressions involving binary operators (a binary operator is one that eats two numbers): addition, subtraction, multiplication and division, in our case.

As the name suggests, the idea is to put operators in front of the numbers they operate on, e.g. + 4 5 instead of 4 + 5. A noticeable side-effect is that expressions became unambiguous to the point that parentheses become superfluous. For instance (3 + 4) * 5 can be written as * + 3 4 5: the * expects to eat two numbers, but what follows it is an operator, so we must first evaluates + 3 4 before we can process the *, and so forth.

Exercise: There’s an engineering practice called TDD, which as the name suggests, consists in first writing tests, then writing code matching those tests. We’re incidentally going to use it now, not out of ideology, but so that you can manually handle prefixed expressions before attempting to parse them, and to help you think about how to handle some special cases.

So, fill the “blanks” of the following tests:

		{
			descr     = ''exec ""'';
			fun      = prefix.exec
			args     = "";
			expected = a;
		}
		{
			descr     = ''exec "3"'';
			fun      = prefix.exec;
			args     = "3";
			expected = b;
		}
		{
			descr     = ''exec "+1 + 2 33"'';
			fun      = prefix.exec;
			args     = "+1 + 2 33";
			expected = c;
		}
		{
			descr     = ''exec "*2 +1 + 2hello33"'';
			fun      = prefix.exec;
			args     = "*2 +1 + 2hello33";
			expected = d;
		}
		{
			descr     = ''exec "*2.1 + 1 4.23"'';
			fun      = prefix.exec;
			args     = "*2.1 + 1 4.23";
			expected = e;
		}
		{
			descr     = ''exec "+ 1 2 3"'';
			fun      = prefix.exec;
			args     = "+ 1 2 3";
			expected = f;
		}
		{
			descr     = ''exec "+ 1"'';
			fun      = prefix.exec;
			args     = "+ 1";
			expected = g;
		}
Solution:[+]

Exercise: At this point, you may want to stop reading and spend some time writing a solution.


Implementation

Exercise: Here’s something to warm you up: write a function parseInt s which parses an integer from a string s. You’ll want to use the ascii.nix and strings.nix module from the previous article.

Solution:[+]

Exercise: Let’s make it slightly more difficult: write a parseNum s function, which can parse integers, but also floating numbers.

Tip:[+]

Tip:[+]

Solution:[+]

There are two main approaches to implement the prefix expression parser:

  1. Either, you pre-parse the input by tokenizing it: you build on list of tokens, for instance:
    [ "+" "*" "134" "aaa" ".24" ".25" ]
    Then you may call our previous parseNum on number-looking tokens to get actual numbers from strings;
  2. Or, you go through the input byte-per-byte, and as it was done for the regular expression matcher.

The second option is perhaps less “intuitive”, but actually leads to a much more compact and elegant code, so I’ll focus on this one.


Exercise: Try to find rewrite parseNum so that it returns not just the parsed number, but the position in s at which it stopped parsing.

Tip:[+]

Solution:[+]

Exercise: We’ve written about half the code; you just need to find a way to tie everything together.

Tip:[+]

Solution:[+]
In the middle of the valley (Среди долины ровныя…, 1883), oil on canvas, 136×203cm

In the middle of the valley (Среди долины ровныя…, 1883), oil on canvas, 136×203cm by Ivan Ivanovich Shishkin (Ива́н Ива́нович Ши́шкин, 1832-1898) through wikimedia.orgPublic domain

Brainf*ck

Introduction

Brainf*ck is an esoteric “programming language”. Despite it lack of usability, it remains turing-complete: it’s a great way to get a feel for what a Turing machine is and what it’s capable of.

The language itself consists of eight commands, operating on an infinite memory band, consisting of integer-valued cells, addressed by a single head/pointer. The “current cell” corresponds to the cell being pointed by the single head/pointer. Here’s an example, assuming we can store characters in the cells instead of just numbers:

-----------------------------------------------
 ... |b|1|1|b|#|#|#|#|o|,| |w|o|r|l|d|!|b| ...
----------^------------------------------------

As for the available commands:

Command Description
< move the head one step to the left
> move the head one step to the right
+ increment the current cell’s value by one
- decrement the current cell’s value by one
. output the current cell’s value
, read one byte; set the current cell’s value to it
[ if the current cell’s value is zero, jump to the command right after the corresponding ];
otherwise move one step to the right
] if the current cell’s value is non-zero, jump to the command right after the corresponding [;
otherwise, move one step to the right

Exercise: Here’s a sample program. Can you figure out what it does and how?

++++++++++[>+++++++>++++++++++>+++>+<<<<-]
>++.>+.+++++++..+++.>++.<<+++++++++++++++.
>.+++.------.--------.>+.>.

Tip:[+]

Solution:[+]

Exercise: Without looking further, try to write an implementation. You can test it to some reasonable degree with the previous program. You may then want to keep reading, and compare your solution to the one proposed.

Here’s a variant of the same program:

>+++++++++[<++++++++>-]<.>++++++[<+++++>-]
<-.+++++++..+++.>>+++++++[<++++++>-]<++.--
----------.<++++++++.--------.+++.------.--
------.>+.>++++++++++.

Memory tape: list zippers

Most of the code should be rather straightforward, especially considering the “mover” object developed at the end of the previous article. But we can be a little fancy regarding the memory tape.

We could define it as a finite, (boring) regular Nix list. We could even make it “cyclic”, for instance by going to the beginning once we reach the end via > or the other way around, with <. Or, we could make it cyclic by creating a “real” cyclic list, for instance with sets.

But let’s face it, wouldn’t it be much more fun if we could build an “infinite” band? After all, we’ve already saw that we could build infinite lists, for instance, infinite streams of zeroes. Surely, this shouldn’t be too hard.

Exercise: Can you figure out a way of moving left and right in an infinite stream of zeroes? Obviously, we’ll later want to be able to update our structure locally so as to contain other numbers than zeroes.

Tip:[+]

Tip:[+]

Solution:[+]

State

As already shown in a previous exercise, we can implement the state with a set. Then, “updating” the state is performed by having functions returning the updated set/state. Here’s an example of the state to represent a Brainf*ck interpreter:

	state = {
		# Let's use our fancy memory tape
		mem = tape zeroes 0 zeroes;

		# stdin/stdout ("in" is a reserved keyword)
		xin = "";
		out = "";

		# Note that there's only one error (jump failure)
		err = "";

		# Command and pointer to the current character
		# in cmd
		cmd  = "";
		pcmd = 0;
	};

Finishing it

We’re mostly done. A key element that remains is this one:

Exercise: Write two functions jmpnxt and jmpprv who respectively jump to a matching [ or ]. You want to take into account the fact that you can have nested loops. The code for both functions is rather similar: you should be able to factorize it.

Tip:[+]

Solution:[+]

Exercise: We’re mostly done, we just need to write a recursive exec s function, which takes a state as input and returns it updated.

Solution:[+]
View near Dusseldorf (Вид в окрестностях Дюссельдорфа, 1865), oil on canvas, 106×151cm

View near Dusseldorf (Вид в окрестностях Дюссельдорфа, 1865), oil on canvas, 106×151cm by Ivan Ivanovich Shishkin (Ива́н Ива́нович Ши́шкин, 1832-1898) through wikimedia.orgPublic domain

Mathematical expressions

Introduction

We’ve started this article with a prefixed expression parser: we’re going to make things more difficult now by trying to parse regular mathematical expressions, always involving the four elementary binary operators: addition, multiplication, subtraction and division. We’ll furthermore allow the use of the minus/plus to serve as unary operators, e.g. we should be able to write -3 to get minus three, instead of the more cumbersome 0-3.

Note that the interpreter is almost trivial: 95%, 98% of the work and the difficulties are in the parsing.

We’re going to use a technique commonly referred to as recursive descent parsing. The approach is simple enough to be introduced intuitively; parsing remains a subtle/tricky subject, so be on your guard. There’s an “old” book on compilers, informally called the “Dragon Book”, as its cover depicts a knight engaging a dragon: a metaphor for taming the inherent complexities of compilers.

I propose the following plan of action:

  1. I’ll start by giving you general advises on how to proceed, so that you can start working on your own: give it a serious try; it’s an important exercise: it captures most of the difficulties you can encounter when parsing a programming language, once you’ve got this under your belt, it’s not much harder to parse more advanced languages;
  2. In case you’ll need it, I’ll give you a bunch of tests;
  3. I’ll then sketch a solution, guiding you through the main difficulties;
  4. And finally, I’ll give you the code.

Exercise: Again, I encourage you to give it a try before diving further. Work incrementally: start by parsing numbers, numbers in parentheses. Then, try to make unary expressions work e.g -4 or --5 or +(-67.)). Move on to parsing additions and subtractions. Finish with multiplications/divisions. First tip below is about the data structures to be used, and some general guidance on the parsing; the second is a list of tests.

Tip:[+]

Tip:[+]

Sketching a solution

Let’s try to work this out, step-by-step. Recursive descent parsing may be an intimidating name, but the principles are rather intuitive. Let’s start by writing down a few sample expressions (I’ll be using integers here for simplicity, but you should aim at making it work with floats):

1
-3
+5
-(+42)
1 + 3
1 + 3 + 5
1 - 4 + 3
1 + 3 * 5
1 * 3
1 + 3 * (5 + 1)
1 / 3 + 5

As mentioned in a previous tip, our goal will be to create trees out of expressions. What kind of trees? Well let’s think this through:

  1. Isolated numbers will be the leaves of the tree. Each node of the tree can be represented either by a list or a set, but a set will be clearer. We could avoid the use of a type, but this will also contribute to improve readability. For instance, the number 3 could be represented by the following:
     { type = "Num"; val = 3; } 
  2. OK, let’s try to represent an unary operation? For instance -3:
     { type = "Num"; val = -3; } 
    That seems reasonable, but it would break on something like -(3+5), so we need to adjust things a little:
     { type = "Unary"; sgn = -1; expr = { ... }; } 
    On the other hand, we can simply ignore the + e.g. in +3 or +(4+5). But we could also use the previous structure with sgn to +1 instead of -1.
  3. What about an addition? Well, now things are getting more interesting. What would you say of the following option?
    { type = "Add"; left = 3; right = 4; }
    Hm, consider (3+4)+5: clearly, one of those addition involves a left part which is yet another addition, a tree. Perhaps it’d be better if left/right were trees/nodes:
    { type = "Add"; left = { ... }; right = { ... }; }
  4. From there, simply changing the type to "Sub"", "Mult" and so forth would give us all the other binary operators;
  5. We could try to go this way, but I’ll take a shortcut: it’s customary to group additions/subtractions together and call either a “term”. By comparison, multiplications/divisions can be grouped together and either one is called “factor”. That distinction allows us to express the fact that there’s a difference in priority: multiplications/divisions should be interpreted before additions/subtractions. So this would gives us something like:
    { type = "Term";   op = "+"; left = { ... }; right = { ... }; }
    { type = "Factor"; op = "/"; left = { ... }; right = { ... }; }

I’ve said that this is a shortcut because we could have made it work with the previous format, but it’s customary to represent things this way, and will help in formulating what follows more directly.

So essentially, a mathematical expression is either:

This is a neat observation: there are no other choices. This means, if we want to operate recursively on an expression, we just have to manage those four cases. It’s the same kind of idea that when we considered lists recursively as being either empty, or a head attached to a tail.

This also means that during the parsing, we’ll be parsing either one at a given moment. Which implies that we’ll have ~four “states”. Now if you recall the regular expression matcher from the previous article, we can “encode” each state by a function. Say:

	parseNum   = ...;
	parseUnary = ...;
	parseFact  = ...;
	parseTerm  = ...;

	# Here's the entry point
	parseExpr  = ...;

Now what should all those functions eat/return? Essentially, they will be eating the input string and a pointer to that string (the current position in the input string), as we did for the Brainf*ck interpreter.

As for the returned values, they should return the new input position and a tree, or an error. While not strictly speaking mandatory, it’s convenient, as we did for the Brainf*ck interpreter, to wrap all the inputs/outputs in a state variable: all those functions should then take both as input and output such a state:

	start = {
		err  = null;
		s    = "";
		p    = 0;
		expr = {};
	};

OK, so, how to make all those functions work with each other? If we want to parse a single number, say 4, and if parseExpr is our entry point, then this means that parseExpr should find a way to call parseNum, as parseNum is responsible for parsing a number. So let’s say that parseExpr directly calls parseNum.

Now consider a basic addition:

1 + 4

Clearly this doesn’t work anymore: we’d parse 1 and then return, ignoring everything else. How should we proceed? Well, we must find a way to call parseTerm between parseExpr and parseNum. The following pseudo-code stills allows us to parse bare integers

	parseNum   = ...;
	parseUnary = ...;
	parseFact  = ...;
	parseTerm  = s: parseNum s;

	# Here's the entry point
	parseExpr  = s: parseTerm s;

But we can now inject more code in parseTerm to detect and parse an addition:

	parseNum   = ...;
	parseUnary = ...;
	parseFact  = ...;
	parseTerm  = s: let t = parseNum s; in
		if haveAPlusSign s then
			let u = parseNum (skipPlus s); in
				u // { type = "Term"; op = "+"; left = t.expr; right = u.expr; }
		else s
	;

	# Here's the entry point
	parseExpr  = s: parseTerm s;

Now, what about:

1 + 3 + 6

Hm. Perhaps we can find a way to have some sort of recursive call to parseTerm instead of calling parseNum for the second term? For instance, parsing the previous expression as:

1 + (3 + 6)

Okay, but what if we had:

1 - 3 + 6

Then we would have parsed it as:

1 - (3 + 6)

Which is clearly not what we want. Instead of calling parseTerm directly, we could have an auxiliary function, allowing us to build a tree with the correct associativity, so as to parse that last expression as:

(1 - 3) + 6

How would this work in pseudo-code?

	parseNum   = ...;
	parseUnary = ...;
	parseFact  = ...;
	parseTerm  = s: let t = parseNum s; in
		let aux = t:
			if haveAPlusSign s then
				let u = parseNum (skipPlus s); in
				aux (u // { type = "Term"; op = "+"; left = t.expr; right = u.expr; })
			else
				t
		; in aux t
	;

	# Here's the entry point
	parseExpr  = s: parseTerm s;

Now, how about parsing multiplications?

1 * 3

Clearly, we’d need parseExpr to find a way to call parseFact, but should it be done directly, or for instance, via parseTerm? Consider for instance:

1 + 3 * 5

or

1 * 42 + 4

We’d want the multiplication to be parsed before we do the addition right? Which means that when we’re on an addition, we’d want to parse the left/right hand side of it as either term, number, or factor:

(factor|number) + (term|factor|number)

So far, we were only considering the case where we had a number on the left and either a term or a number on the right:

(number) + (term|number)

So, to make it work we simply need to call parseFactor to get us either a factor or a number:

	parseNum   = ...;
	parseUnary = ...;
	parseFact  = s: let t = parseNum s; in
		...
	;
	parseTerm  = s: let t = parseFact s; in
		let aux = t:
			if haveAPlusSign s then
				let u = parseFact (skipPlus s); in
				aux (u // { type = "Term"; op = "+"; left = t.expr; right = u.expr; })
			else
				t
		; in aux t
	;

	# Here's the entry point
	parseExpr  = s: parseTerm s;

I’ll stop here because all the major aspects have been developed: parsing unary expressions requires the same kind of thinking; parsing subtractions or divisions are almost trivial. You should also think about error handling, such as premature end of content, missing parenthesis, etc.

Just take it slow, one feature at a time, and write tests to validate what you’ve been doing so far. I would encourage you to write in that order:

  1. parseNum (you may want to revisit the floating number parsing function we wrote earlier);
  2. parseUnary;
  3. parseTerm;
  4. parseFact;

Finishing it

Exercise: So try to fill in the blanks on your own.

Solution:[+]

Conclusion

In the next article, we’ll make things more interesting by targeting a more refined language: the λ-calculus. We’ll see that recursive descent parsing, as we’ve used it so far, has some limitations, and how to overcome them.

And, most importantly, we’ll study a real interpreter, which is at the basis of functional programming languages such as Nix. While the parser will remain of comparable difficulty to the last one developed here, the interpreter on the other hand will be more refined (but still minute in comparison to real life languages, such as Nix).

Lumbering (Рубка лесаа, 1867), oil on canvas, 125×196cm

Lumbering (Рубка лесаа, 1867), oil on canvas, 125×196cm by Ivan Ivanovich Shishkin (Ива́н Ива́нович Ши́шкин, 1832-1898) through wikimedia.orgPublic domain


In the series:


Comments

By email, at mathieu.bivert chez:

email