A great thing about Nix (the language) is that it’s tiny. Tiny
enough that besides some builtins and a few NixOS/Nixpkgs centered
features (e.g. paths), we’ve covered, for most intent and purposes,
all of it.
The next article
will be about writing a λ-calculus interpreter.
In preparation, we’ll focus here on implementing three small projects:
A prefix/polish expression evaluator (e.g. something that reads
a string like + + 3 * 5 + 2 2 19, and interprets it as 3+5*(2+2)+19=42);
A Brainf*ck evaluator (if you don’t know what it is, I guess an
example wouldn’t help much);
A mathematical expression evaluator (e.g.
something that reads and interprets 3+5*(2+2)+19);
The code for this series is still available on github.
Testing
Our code is going to get slightly more complex than before. As a result, and
because we’ll often work iteratively, from simple to complex, we’ll
want to have ways to gain a firm hand on the stability of the code.
Here’s a simple, generic test module, a Nix implementation of the one
described in this article:
with builtins;
rec {
apply = f: l: foldl' (x: y: x y) f l;
run1 = {descr ?"", fun, args, expected, indent ?""} :
let# This means, if we want only one argument that is a list,# we have to use a list with a single element: [theSingleListArg]. got =if isList(args) then (apply fun args)
else (fun args);
r = got == expected;
ok =if r then"OK"else"failed!";
in trace("${indent}${ok}\t:${descr}") (if r then r else trace("${indent}\tGot:")
trace(deepSeq got got)
trace("${indent}\tExpected:")
trace(deepSeq expected expected)
r)
;
run = foldl' (ok: t: ok && (run1 t)) true;
}
{ apply = <CODE>; run = <CODE>; run1 = <CODE>; }
Here’s how we could write tests for this test module itself:
As of now (2022-11-19), the tests are located in
lib/tests/misc.nix. They use an even simpler
approach, where each test is described by a set containing
two entries: expr, the expression to evaluate, and
expected, the expected results of the evaluation. For
instance:
Note: I’ll still be using ad-hoc tests for simple pieces of
code. Also, while I’ll be giving you tests in one format,
feel free to use anyother
you might prefer. The key point is to have an automated way to make sure
your code behaves as expected, and keeps doing so despite changes.
Prefix expressions
Introduction
The polish notation, or prefix notation,
is a way to write down mathematical expressions involving binary
operators (a binary operator is one that eats two numbers):
addition, subtraction, multiplication and division, in our case.
As the name suggests, the idea is to put operators in front of
the numbers they operate on, e.g. + 4 5 instead of 4 + 5.
A noticeable side-effect is that expressions became unambiguous
to the point that parentheses are superfluous.
For instance (3 + 4) * 5 can be written as * + 3 4 5:
the * expects to eat two numbers, but what follows it is an
operator, so we must first evaluates + 3 4 before we can
process the *, and so forth.
Exercise: There’s an engineering practice called
TDD, which
as the name suggests, consists in first writing tests, then writing
code matching those tests. We’re incidentally going to use it now, not
out of ideology, but so that you can manually handle prefixed expressions
before attempting to parse them, and to help you think about how to handle
some special cases.
/*
* This is a possible *convention*. There are others.
*/ {
descr =''exec ""'';
fun = prefix.exec
args ="";
expected =0;
}
{
descr =''exec "3"'';
fun = prefix.exec;
args ="3";
expected =3;
}
{
descr =''exec "+1 + 2 33"'';
fun = prefix.exec;
args ="+1 + 2 33";
expected =36;
}
{
descr =''exec "*2 +1 + 2hello33"'';
fun = prefix.exec;
args ="*2 +1 + 2hello33";
expected =72;
}
{
descr =''exec "*2.1 + 1 4.23"'';
fun = prefix.exec;
args ="*2.1 + 1 4.23";
expected =10.983;
}
/*
* This one, and errors generally, can be handled
* in different ways. I'll let the interpreter crash,
* but you can choose another option.
*/ {
descr =''exec "+ 1 2 3"'';
fun = prefix.exec;
args ="+ 1 2 3";
expected = crash;
}
/*
* This one isn't necessarily an error...
*/ {
descr =''exec "+ 1"'';
fun = prefix.exec;
args ="+ 1";
expected = mystery;
}
Exercise: At this point, you may want to stop reading
and spend some time writing a solution.
Implementation
Exercise: Here’s something to warm you up: write a
function parseInt s which parses an integer from a string s.
You’ll want to use the ascii.nix and strings.nix module from
the previous article.
As you can see, I've chosen a rather peculiar way to handle
unwanted characters: this is a pragmatic choice to keep
the implementation simple: raising errors would make things
slightly more verbose, and wouldn't bring much more to the user.
I'll keep a similar convention for the final prefix interpreter.
#!/usr/bin/env -S nix-instantiate --evalwith builtins;
with (import../string/strings.nix);
let ascii = (import../string/ascii.nix);
parseInt = s: let n = stringLength s;
aux = acc: i: let c = charAt s i; inif i == n ||!ascii.isNum(c) then acc
else aux (acc*10+(ascii.toNum c)) (i +1)
; in aux 00;
in trace(parseInt "123")
trace(parseInt "0")
trace(parseInt "aaa")
trace(parseInt "456aaa12")
"ok"
trace: 123trace: 0trace: 0trace: 456"ok"
Exercise: Let’s make it slightly more difficult: write a parseNum s
function, which can parse integers, but also floating numbers.
#!/usr/bin/env -S nix-instantiate --evalwith builtins;
with (import../string/strings.nix);
let ascii = (import../string/ascii.nix);
toDecimal = n: if n <1then n else toDecimal (n / 10);
toFloat = n: n +0.1 - 0.1;
parseNum = s: let n = stringLength s;
aux = acc: i: x: let c = charAt s i; inif c =="."&&!x thenlet d = aux 0 (i+1) true; in acc + toDecimal(toFloat(d))
elseif i == n ||!ascii.isNum(c) then acc
else aux (acc*10+(ascii.toNum c)) (i +1) x
; in aux 00false;
in trace(parseNum "123")
trace(parseNum "123.0")
trace(parseNum "123.0aaa")
trace(parseNum "123.aa")
trace(parseNum "123.23")
trace(parseNum ".23ei")
trace(parseNum ".23.13")
"ok"
Parametrize the code with a function: on termination,
call that function with the all the values to be returned
as parameters.
#!/usr/bin/env -S nix-instantiate --evalwith builtins;
with (import../string/strings.nix);
let ascii = (import../string/ascii.nix);
toDecimal = n: if n <1then n else toDecimal (n / 10);
toFloat = n: n +0.1 - 0.1;
parseNum = s: f: let n = stringLength s;
aux = acc: i: x: f: let c = charAt s i; inif c =="."&&!x then aux 0 (i+1) true (d: i:
f (acc + toDecimal(toFloat(d))) i
)
elseif i == n ||!ascii.isNum(c) then f acc i
else aux (acc*10+(ascii.toNum c)) (i +1) x f
; in aux 00false f;
# Dumb function retrieving the two "returned" values f = n: i: "i=${toString i}; n=${toString n}";
in trace(parseNum "123" f)
trace(parseNum "123.0" f)
trace(parseNum "123.0aaa" f)
trace(parseNum "123.aa" f)
trace(parseNum "123.23" f)
trace(parseNum ".23ei" f)
trace(parseNum ".23.12" f)
"ok"
We're aiming at interpreting the expression as we read it. This
means, we must find a way to encode the state of a partially
applied expression: as we've done earlier with Eratosthenes' sieve
or with apply, the key is to use a function as
an accumulator.
Try to picture yourself going one byte at a time through the
input, and figure out how to progressively build/apply this
function.
with builtins;
with (import./string/strings.nix);
rec {
ascii = (import./string/ascii.nix);
ops = {
"+"= (x: y: x + y);
"-"= (x: y: x - y);
"*"= (x: y: x * y);
"/"= (x: y: x / y);
};
toDecimal = n: if n <1then n else toDecimal (n / 10);
toFloat = n: n +0.1 - 0.1;
exec = s: let e = stringLength s; aux = f: i:
if i == e then (f) elselet c = charAt s i; inif hasAttr c ops then aux (x: y: (f (ops."${c}" x y))) (i +1)
elseif ascii.isNum c thenlet parseNum = n: i: d: aux: let c = charAt s i; inif c =="."&&!d then parseNum 0 (i +1) true (m: i:
aux (n + toDecimal (toFloat m)) i
)
elseif i == e ||!ascii.isNum c then aux n i
else parseNum ((10* n) + ascii.toNum(c))(i +1) d aux
; in parseNum 0 i false (n: i: aux (f n) i)
# Ignore everything else (punctuation, whites, letters, etc.)else aux f (i +1)
; in aux (x: x) 0 ;
}
The code allows non-fully applied expressions, e.g.
+12 returns a function which takes an argument
and adds 12 to it. To ease the tests, we arbitrarily
apply those to dummy arguments.
trace: OK :exec ""trace: OK :exec "3"trace: OK :exec "+"trace: OK :exec "+12"trace: OK :exec "+ 1 2"trace: OK :exec "+1 + 2 33"trace: OK :exec "*2 +1 + 2hello33"trace: OK :exec "+ + 3 * 5 + 2 2 19"trace: OK :exec "*2.1 + 1 4.23"true
Brainf*ck
Introduction
Brainf*ck is an esoteric “programming language”.
Despite it lack of usability, it remains turing-complete:
it’s a great way to get a feel for what a Turing machine
is and what it’s capable of.
The language itself consists of eight commands, operating on
an infinite memory band, consisting of integer-valued cells,
addressed by a single head/pointer. The “current cell”
corresponds to the cell being pointed by the single head/pointer.
Here’s an example, assuming we can store characters in
the cells instead of just numbers:
The following is still executable: everything that isn't
a known character can be considered a comment / a no-op.
mem(0)=10 ++++++++++
while mem(0) is not 0[ add 7 to mem(1) >+++++++
add 10 to mem(2) >++++++++++
add 3 to mem(3) >+++
add 1 to mem(4) >+
go back to mem(0) and decrement it
until we've reached zero
<<<<-
]
At this point, we've ran the previous loop 10 times, and we're back on mem(0)
Memory now contains:
mem(0) = 0;
mem(1) = 70; mem(2) = 100;
mem(3) = 30; mem(4) = 10 ('\n')
We progressively build the correct
character codes from the values
previously created:
>++. 'H'
>+. 'e'
+++++++.. 'll'
+++. 'o'
>++. ''
<<+++++++++++++++. 'w'
>. 'o'
+++. 'r'
------. 'l'
--------. 'd'
>+. '!'
>. '\n'
Exercise: Without looking further, try to write
an implementation. You can test it to some reasonable degree with
the previous program. You may then want to keep reading,
and compare your solution to the one proposed.
Most of the code should be rather straightforward, especially
considering the “mover” object developed at the end of the
previous article.
But we can be a little fancy regarding the memory tape.
We could define it as a finite, (boring) regular Nix list. We could
even make it “cyclic”, for instance by going to the beginning
once we reach the end via > or the other way around, with <.
Or, we could make it cyclic by creating a “real” cyclic list,
for instance with sets.
But let’s face it, wouldn’t it be much more fun if we could build
an “infinite” band? After all, we’ve already saw that we could
build infinite lists, for instance, infinite streams of zeroes.
Surely, this shouldn’t be too hard.
Exercise: Can you figure out a way of moving
left and right in an infinite stream of zeroes? Obviously, we’ll
later want to be able to update our structure locally so as to
contain other numbers than zeroes.
The data structure to be found is called a list zipper:
it consists of two lists and an element: this singled-out element
will be the element pointed by the head of our Turing machine/Brainf*ck
interpreter.
To move left or right, because we're in a "pure" language, you
must create a new zipper corresponding to a newly focused element.
The idea is surprisingly simple: you start with two
infinite lazy lists of zeroes left and right, and with a zero
element. When you move right (the process is symmetric when
moving left):
add the current element on top of the left list;
take the head of the right list and use this as the current element;
and reduce the right list to its tail.
The infinite lists allows to fetch additional zeroes when hitting the
"borders" of the list. Note that we just need enough of a lazy list
implementation to build infinite lists (so no need for nil
or isEmpty).
#!/usr/bin/env -S nix-instantiate --evalwith builtins;
let cons = h: t: [h t];
car = xs: elemAt xs 0;
cdr = xs: elemAt xs 1;
zeroes = cons 0 (zeroes);
tape = b: e: a: [b e a];
prev = xs: elemAt xs 0;
elem = xs: elemAt xs 1;
next = xs: elemAt xs 2;
right = t: tape (cons (elem t) (prev t)) (car (next t)) (cdr (next t));
left = t: tape (cdr (prev t)) (car (prev t)) (cons (elem t) (next t));
# Obviously, there are other ways of altering the current cell,# this is good enough for us alter = t: n: tape (prev t) ((elem t) + n) (next t);
start = tape zeroes 0 zeroes;
mem = alter (left (left (alter (right (alter start 5)) 3))) 1;
in trace(deepSeq mem mem)
"ok"
As already shown in a previous exercise, we can implement the
state with a set. Then, “updating” the state is performed by having
functions returning the updated set/state. Here’s an example of the
state to represent a Brainf*ck interpreter:
state = {
# Let's use our fancy memory tape mem = tape zeroes 0 zeroes;
# stdin/stdout ("in" is a reserved keyword) xin ="";
out ="";
# Note that there's only one error (jump failure) err ="";
# Command and pointer to the current character# in cmd cmd ="";
pcmd =0;
};
Finishing it
We’re mostly done. A key element that remains is this one:
Exercise: Write two functions jmpnxt and jmpprv
who respectively jump to a matching [ or ]. You want to
take into account the fact that you can have nested loops. The
code for both functions is rather similar: you should be able
to factorize it.
jmp = d: a: b: e: c: n: let aux = st: c: i:
# syntax errorif (i == e) thennullelseif (charAt c i) == a then aux (st +1) c (i + d)
elseif (charAt c i) == b thenif st ==1then i
else aux (st - 1) c (i + d)
else aux st c (i + d)
; in aux 0 c n;
jmpnxt = c: jmp 1"[""]" (stringLength c) c;
jmpprv = jmp (-1) "]""["0;
Exercise: We’re mostly done, we just need to write
a recursive exec s function, which takes a state as input and
returns it updated.
Here's the final, complete implementation. We've got
a dbg entry in the state that can be used
for debugging/tracing.
#!/usr/bin/env -S nix-instantiate --evalwith builtins;
with (import./string/strings.nix);
let ascii = (import./string/ascii.nix);
# --- lazy lists (partial) implementation --- cons = h: t: [h t];
car = xs: elemAt xs 0;
cdr = xs: elemAt xs 1;
zeroes = cons 0 (zeroes);
# --- list zipper implementation ---# (aka, double-infinite stream) tape = b: e: a: [b e a];
prev = xs: elemAt xs 0;
elem = xs: elemAt xs 1;
next = xs: elemAt xs 2;
right = t: tape (cons (elem t) (prev t)) (car (next t)) (cdr (next t));
left = t: tape (cdr (prev t)) (car (prev t)) (cons (elem t) (next t));
alter = t: n: tape (prev t) ((elem t) + n) (next t);
# --- Starting state --- state = {
mem = tape zeroes 0 zeroes;
# stdin/stdout ("in" is a reserved keyword) xin ="";
out ="";
# Note that there's only one error (jump failure) err ="";
# Command and pointer to the current character# in cmd cmd ="";
pcmd =0;
# Use (_: y: y) to disable (expects# a builtins.trace-compatible function). dbg = trace;
};
# --- Looping ---# jump in the given direction# d is the direction (as an increment of i)# a is the type of quote we're jumping away from# b is the type of quote we're jumping away to# e is the value of i marking the end of our search# c is the string (command) we're jumping in# n is our starting position jmp = d: a: b: e: c: n: let aux = st: c: i:
# syntax errorif (i == e) thennullelseif (charAt c i) == a then aux (st +1) c (i + d)
elseif (charAt c i) == b thenif st ==1then i
else aux (st - 1) c (i + d)
else aux st c (i + d)
; in aux 0 c n;
jmpnxt = c: jmp 1"[""]" (stringLength c) c;
jmpprv = jmp (-1) "]""["0;
# --- Entry point --- exec = s:
if s.pcmd ==nullthen s // { err ="syntax error: cannot jump"; }
elseif s.pcmd == (stringLength s.cmd) then s
elselet p = s.dbg(charAt s.cmd s.pcmd) (charAt s.cmd s.pcmd);
# By default (unless eventually for [ and ]), jump# to next instruction at next stepinlet t = (s // { pcmd = s.pcmd +1; }) // (
if p ==">"then { mem = right s.mem; }
elseif p =="<"then { mem = left s.mem; }
elseif p =="+"then { mem = alter s.mem ( 1); }
elseif p =="-"then { mem = alter s.mem (-1); }
elseif p =="."then {
out = s.out + (ascii.toChar (toString (elem s.mem)));
} elseif p ==","then {
mem = tape (prev s.mem) (ascii.toInt (charAt s.xin 0)) (next s.mem);
xin = eat1 s.xin;
} elseif p =="["&& (elem s.mem) ==0then {
pcmd = jmpnxt s.cmd s.pcmd;
} elseif p =="]"&& (elem s.mem) !=0then {
pcmd = jmpprv s.cmd s.pcmd;
# Comments, no-op } else {}
); in s.dbg(deepSeq t t) (exec t)
;
s = exec(state // { dbg = (_: y: y); cmd =''
++++++++++[>+++++++>++++++++++>+++>+<<<<-]
>++.>+.+++++++..+++.>++.<<+++++++++++++++.
>.+++.------.--------.>+.>.
'';});
/*
s = exec(state // { dbg = (_: y: y); cmd = ''
>+++++++++[<++++++++>-]<.>++++++[<+++++>-]
<-.+++++++..+++.>>+++++++[<++++++>-]<++.--
----------.<++++++++.--------.+++.------.--
------.>+.>++++++++++.
'';});
*/in s.out
"Hello World!\n"
Mathematical expressions
Introduction
We’ve started this article with a prefixed expression parser: we’re going to make
things more difficult now by trying to parse regular mathematical
expressions, always involving the four elementary binary operators:
addition, multiplication, subtraction and division. We’ll furthermore
allow the use of the minus/plus to serve as unary operators, e.g. we
should be able to write -3 to get minus three, instead of the more
cumbersome 0-3.
Note that the interpreter is almost trivial: 95%, 98% of the work
and the difficulties are in the parsing.
We’re going to use a technique commonly referred to as
recursive descent parsing. The approach is simple enough
to be introduced intuitively; parsing remains a subtle/tricky subject, so
be on your guard. There’s an “old” book on compilers, informally
called the “Dragon Book”, as its cover
depicts a knight engaging a dragon: a metaphor for taming the
inherent complexities of compilers.
I propose the following plan of action:
I’ll start by giving you general advises on how to proceed,
so that you can start working on your own: give it a serious
try; it’s an important exercise: it captures most of the
difficulties you can encounter when parsing a
programming language, once you’ve got this under your belt,
it’s not much harder to parse more advanced languages;
In case you’ll need it, I’ll give you a bunch of tests;
I’ll then sketch a solution, guiding you through the main
difficulties;
And finally, I’ll give you the code.
Exercise: Again, I encourage you to give it a try
before diving further. Work incrementally: start by parsing numbers,
numbers in parentheses. Then, try to make unary expressions work
e.g -4 or --5 or +(-67.)). Move on to parsing additions
and subtractions. Finish with multiplications/divisions. First
tip below is about the data structures to be used, and some general
guidance on the parsing; the second is a list of tests.
OK so the main data structure to use is a
tree:
your parser should create a tree from an input string. You might want
write down a few expressions and try to understand what kind of
trees would best represent them.
As already mentioned, the interpreter is almost trivial:
writing it first might give you some insights on how to
design the trees.
Finally, regarding the parsing, I would recommend you to
keep going byte-per-byte rather than tokenizing the input:
that overhead isn't necessary. However if you really have
trouble doing so, maybe start by tokenizing, make sure
your tokenizer works, and then write a parser using tokens.
You can always trim the tokenizer away later anyway.
#!/usr/bin/env -S nix-instantiate --evalwith builtins;
let ftests = (import./ftests.nix);
exprs = (import./exprs.nix);
tests = [
{
descr =''parse ""'';
fun = exprs.parse;
args ="";
expected = {
err ="unexpected EOF";
s ="";
p =0;
expr = {};
};
}
{
descr =''parse "3"'';
fun = exprs.parse;
args ="3";
expected = {
err =null;
s ="3";
p =1;
expr = { type ="Num"; val =3; };
};
}
{
descr =''parse "3.42"'';
fun = exprs.parse;
args ="3.42";
expected = {
err =null;
s ="3.42";
p =4;
expr = { type ="Num"; val =3.42; };
};
}
{
descr =''parse "((3.42))"'';
fun = exprs.parse;
args ="((3.42))";
expected = {
err =null;
s ="((3.42))";
p =8;
expr = { type ="Num"; val =3.42; };
};
}
{
descr =''parse "((3.42"'';
fun = exprs.parse;
args ="((3.42";
expected = {
err ="expecting ')'";
s ="((3.42";
p =6;
expr = { type ="Num"; val =3.42; };
};
}
{
descr =''parse "( ( 3.42"'';
fun = exprs.parse;
args ="( ( 3.42";
expected = {
err ="expecting ')'";
s ="( ( 3.42";
p =10;
expr = { type ="Num"; val =3.42; };
};
}
{
descr =''parse "aa12"'';
fun = exprs.parse;
args ="aa12";
expected = {
err ="digit/dot expected, got 'a'";
s ="aa12";
p =0;
expr = {};
};
}
# Not so great I guess? {
descr =''parse "12aa"'';
fun = exprs.parse;
args ="12aa";
expected = {
err =null;
s ="12aa";
p =2;
expr = { type ="Num"; val =12; };
};
}
{
descr =''parse "-3"'';
fun = exprs.parse;
args ="-3";
expected = {
err =null;
s ="-3";
p =2;
expr = {
type ="Unary";
sgn =-1;
expr = { type ="Num"; val =3; };
};
};
}
{
descr =''parse "-(3)"'';
fun = exprs.parse;
args ="-(3)";
expected = {
err =null;
s ="-(3)";
p =4;
expr = { type ="Unary"; sgn =-1; expr = { type ="Num"; val =3; }; };
};
}
{
descr =''parse "-(-(3.42 ))"'';
fun = exprs.parse;
args ="-(-(3.42 ))";
expected = {
err =null;
s ="-(-(3.42 ))";
p =12;
expr = {
type ="Unary";
sgn =-1;
expr = {
type ="Unary";
sgn =-1;
expr = { type ="Num"; val =3.42; };
};
};
};
}
{
descr =''parse "- - 3.42"'';
fun = exprs.parse;
args ="- - 3.42";
expected = {
err =null;
s ="- - 3.42";
p =10;
expr = {
type ="Unary";
sgn =-1;
expr = {
type ="Unary";
sgn =-1;
expr = { type ="Num"; val =3.42; };
};
};
};
}
{
descr =''parse "+(+(3.42 ))"'';
fun = exprs.parse;
args ="+(+(3.42 ))";
expected = {
err =null;
s ="+(+(3.42 ))";
p =12;
expr = { type ="Num"; val =3.42; };
};
}
{
descr =''parse "+ +3.42"'';
fun = exprs.parse;
args ="+ +3.42";
expected = {
err =null;
s ="+ +3.42";
p =9;
expr = { type ="Num"; val =3.42; };
};
}
{
descr =''parse "1 + +3.42"'';
fun = exprs.parse;
args ="1 + +3.42";
expected = {
err =null;
s ="1 + +3.42";
p =11;
expr = {
type ="Term";
op ="+";
left = { type ="Num"; val =1; };
right = { type ="Num"; val =3.42; };
};
};
}
{
descr =''parse "+ +3.42 + -2.25"'';
fun = exprs.parse;
args ="+ +3.42 + -2.25";
expected = {
err =null;
s ="+ +3.42 + -2.25";
p =17;
expr = {
type ="Term";
op ="+";
left = { type ="Num"; val =3.42; };
right = {
type ="Unary";
sgn =-1;
expr = { type ="Num"; val =2.25; };
};
};
};
}
# We definitely don't want this to be parsed# as 1 - (42+12); meaning, + & - are left associative {
descr =''parse "1 - 42 + 12"'';
fun = exprs.parse;
args ="1 - 42 + 12";
expected = {
err =null;
s ="1 - 42 + 12";
p =11;
expr = {
type ="Term";
op ="+";
left = {
type ="Term";
op ="-";
left = { type ="Num"; val =1; };
right = { type ="Num"; val =42; };
};
right = { type ="Num"; val =12; };
};
};
}
{
descr =''parse "1 + 42 * 12"'';
fun = exprs.parse;
args ="1 + 42 * 12";
expected = {
err =null;
s ="1 + 42 * 12";
p =11;
expr = {
type ="Term";
op ="+";
left = { type ="Num"; val =1; };
right = {
type ="Factor";
op ="*";
left = { type ="Num"; val =42; };
right = { type ="Num"; val =12; };
};
};
};
}
{
descr =''parse "1- 3 * 5 + (1 + 34 )/ 3."'';
fun = exprs.parse;
args ="1- 3 * 5 + (1 + 34 )/ 3.";
expected = {
err =null;
s ="1- 3 * 5 + (1 + 34 )/ 3.";
p =25;
expr = (exprs.mkTerm
(exprs.mkTerm
(exprs.mkNum 1)
"-" (exprs.mkFact (exprs.mkNum 3) "*" (exprs.mkNum 5)))
"+" (exprs.mkFact
(exprs.mkTerm (exprs.mkNum 1) "+" (exprs.mkNum 34))
"/" (exprs.mkNum 3.)));
};
}
{
descr =''parse "0 / 78 * 12"'';
fun = exprs.parse;
args ="0 / 78 * 12";
expected = {
err =null;
s ="0 / 78 * 12";
p =11;
expr = {
type ="Factor";
op ="*";
left = {
type ="Factor";
op ="/";
left = { type ="Num"; val =0; };
right = { type ="Num"; val =78; };
};
right = { type ="Num"; val =12; };
};
};
}
{
descr =''parse "1 + 3 * 5 + ( 1 - 34 )"'';
fun = exprs.parse;
args ="1 + 3 * 5 + ( 1 - 34 )";
expected = {
err =null;
s ="1 + 3 * 5 + ( 1 - 34 )";
p =22;
expr = (exprs.mkTerm
(exprs.mkTerm
(exprs.mkNum 1)
"+" (exprs.mkFact (exprs.mkNum 3) "*" (exprs.mkNum 5)))
"+" (exprs.mkTerm (exprs.mkNum 1) "-" (exprs.mkNum 34)));
};
}
{
descr =''exec "pre-parsed 1+3*5+(1-34)"'';
fun = exprs.exec;
args = {
type ="Term";
op ="+";
left = { type ="Num"; val =1; };
right = {
type ="Term";
op ="+";
left = {
type ="Factor";
op ="*";
left = { type ="Num"; val =3; };
right = { type ="Num"; val =5; };
};
right = {
type ="Term";
op ="-";
left = { type ="Num"; val =1; };
right = { type ="Num"; val =34; };
};
};
};
expected =1+3*5+(1-34);
}
{
descr =''exec (parse "2*(3+(2+4*5)-78+-12)")'';
fun = exprs.exec;
args = (exprs.parse "2*(3+(2+4*5)-78+-12)").expr;
expected =2*(3+(2+4*5)-78+-12);
}
];
in ftests.run tests
trace: OK :parse ""trace: OK :parse "3"trace: OK :parse "3.42"trace: OK :parse "((3.42))"trace: OK :parse "((3.42"trace: OK :parse "( ( 3.42"trace: OK :parse "aa12"trace: OK :parse "12aa"trace: OK :parse "-3"trace: OK :parse "-(3)"trace: OK :parse "-(-(3.42 ))"trace: OK :parse "- - 3.42"trace: OK :parse "+(+(3.42 ))"trace: OK :parse "+ +3.42"trace: OK :parse "1 + +3.42"trace: OK :parse "+ +3.42 + -2.25"trace: OK :parse "1 - 42 + 12"trace: OK :parse "1 + 42 * 12"trace: OK :parse "1- 3 * 5 + (1 + 34 )/ 3."trace: OK :parse "0 / 78 * 12"trace: OK :parse "1 + 3 * 5 + ( 1 - 34 )"trace: OK :exec "pre-parsed 1+3*5+(1-34)"trace: OK :exec (parse "2*(3+(2+4*5)-78+-12)")true
Sketching a solution
Let’s try to work this out, step-by-step. Recursive descent parsing
may be an intimidating name, but the principles are rather intuitive. Let’s
start by writing down a few sample expressions (I’ll be using integers here for
simplicity, but you should aim at making it work with floats):
As mentioned in a previous tip, our goal will be to create trees out
of expressions. What kind of trees? Well let’s think this through:
Isolated numbers will be the leaves of the tree. Each node
of the tree can be represented either by a list or a set, but a
set will be clearer. We could avoid the use of a type, but
this will also contribute to improve readability. For instance,
the number 3 could be represented by the following:
{ type ="Num"; val =3; }
OK, let’s try to represent an unary operation? For instance -3:
{ type ="Num"; val =-3; }
That seems reasonable, but it would break on something
like -(3+5), so we need to adjust things a little:
{ type ="Unary"; sgn =-1; expr = { ... }; }
On the other hand, we can simply ignore the + e.g. in +3
or +(4+5). But we could also use the previous structure with sgn
to +1 instead of -1.
What about an addition? Well, now things are getting more
interesting. What would you say of the following option?
{ type ="Add"; left =3; right =4; }
Hm, consider (3+4)+5: clearly, one of those addition involves
a left part which is yet another addition, a tree. Perhaps it’d
be better if left/right were trees/nodes:
{ type ="Add"; left = { ... }; right = { ... }; }
From there, simply changing the type to "Sub"", "Mult" and
so forth would give us all the other binary operators;
We could try to go this way, but I’ll take a shortcut: it’s customary
to group additions/subtractions together and call either a “term”. By
comparison, multiplications/divisions can be grouped together and either
one is called “factor”. That distinction allows us to express the fact
that there’s a difference in priority: multiplications/divisions should
be interpreted before additions/subtractions. So this would
gives us something like:
{ type ="Term"; op ="+"; left = { ... }; right = { ... }; }
{ type ="Factor"; op ="/"; left = { ... }; right = { ... }; }
I’ve said that this is a shortcut because we could have made it work
with the previous format, but it’s customary to represent things this
way, and will help in formulating what follows more directly.
So essentially, a mathematical expression is either:
an isolated number;
an unary expression;
a factor;
a term.
This is a neat observation: there are no other choices. This means,
if we want to operate recursively on an expression, we just have
to manage those four cases. It’s the same kind of idea that when we
considered lists recursively as being either empty, or a head attached
to a tail.
This also means that during the parsing, we’ll be parsing either one
at a given moment. Which implies that we’ll have ~four “states”. Now
if you recall the regular expression matcher from the
previous article,
we can “encode” each state by a function. Say:
parseNum =...;
parseUnary =...;
parseFact =...;
parseTerm =...;
# Here's the entry point parseExpr =...;
Then, what should all those functions eat/return? Essentially, they
will be eating the input string and a pointer to that string (the current
position in the input string), as we did for the Brainf*ck interpreter.
As for the returned values, they should return the new input position and
a tree, or an error. While not strictly speaking mandatory, it’s convenient,
as we did for the Brainf*ck interpreter, to wrap all the inputs/outputs
in a state variable: all those functions should then take both as input
and output such a state:
start = {
err =null;
s ="";
p =0;
expr = {};
};
OK, so, how to make all those functions work with each other? If we
want to parse a single number, say 4, and if parseExpr is our
entry point, then this means that parseExpr should find a way
to call parseNum, as parseNum is responsible for parsing a
number. So let’s say that parseExpr directly calls parseNum.
Now consider a basic addition:
1 + 4
Clearly this doesn’t work
anymore: we’d parse 1 and then return, ignoring everything else.
How should we proceed? Well, we must find a way to call parseTerm
between parseExpr and parseNum. The following pseudo-code
stills allows us to parse bare integers
But we can now inject more code in parseTerm to detect
and parse an addition:
parseNum =...;
parseUnary =...;
parseFact =...;
parseTerm = s: let t = parseNum s; inif haveAPlusSign s thenlet u = parseNum (skipPlus s); in u // { type ="Term"; op ="+"; left = t.expr; right = u.expr; }
else s
;
# Here's the entry point parseExpr = s: parseTerm s;
Now, what about:
1 + 3 + 6
Hm. Perhaps we can find a way to
have some sort of recursive call to parseTerm instead of calling
parseNum for the second term? For instance, parsing the previous
expression as:
1 + (3 + 6)
Okay, but what if we had:
1 - 3 + 6
Then we would have parsed it as:
1 - (3 + 6)
Which is clearly not what we want. Instead of calling parseTerm
directly, we could have an auxiliary function, allowing us to
build a tree with the correct associativity, so as to parse
that last expression as:
(1 - 3) + 6
How would this work in pseudo-code?
parseNum =...;
parseUnary =...;
parseFact =...;
parseTerm = s: let t = parseNum s; inlet aux = t:
if haveAPlusSign s thenlet u = parseNum (skipPlus s); in aux (u // { type ="Term"; op ="+"; left = t.expr; right = u.expr; })
else t
; in aux t
;
# Here's the entry point parseExpr = s: parseTerm s;
Now, how about parsing multiplications?
1 * 3
Clearly, we’d need parseExpr to find a way to call parseFact, but
should it be done directly, or for instance, via parseTerm? Consider
for instance:
1 + 3 * 5
or
1 * 42 + 4
We’d want the multiplication to be parsed before we do the addition
right? Which means that when we’re on an addition, we’d want to parse
the left/right hand side of it as either term, number, or factor:
(factor|number) + (term|factor|number)
So far, we were only considering the case where we had a number on the
left and either a term or a number on the right:
(number) + (term|number)
So, to make it work we simply need to call parseFactor
to get us either a factor or a number:
parseNum =...;
parseUnary =...;
parseFact = s: let t = parseNum s; in... ;
parseTerm = s: let t = parseFact s; inlet aux = t:
if haveAPlusSign s thenlet u = parseFact (skipPlus s); in aux (u // { type ="Term"; op ="+"; left = t.expr; right = u.expr; })
else t
; in aux t
;
# Here's the entry point parseExpr = s: parseTerm s;
I’ll stop here because all the major aspects have been developed: parsing
unary expressions requires the same kind of thinking; parsing subtractions
or divisions are almost trivial. You should also think about error handling,
such as premature end of content, missing parenthesis, etc.
Just take it slow, one feature at a time, and write tests to validate
what you’ve been doing so far. I would encourage you to write in that
order:
parseNum (you may want to revisit the floating number parsing
function we wrote earlier);
parseUnary;
parseTerm;
parseFact;
Finishing it
Exercise: So try to fill in the blanks on your own.
#!/usr/bin/env -S nix-instantiate --evalwith builtins;
with (import./string/strings.nix);
rec {
ascii = (import./string/ascii.nix);
ops = {
"+"= (x: y: x + y);
"-"= (x: y: x - y);
"*"= (x: y: x * y);
"/"= (x: y: x / y);
};
mkNum = n: { type ="Num"; val = n; };
mkUnary = s: e: { type ="Unary"; sgn = s; expr = e; };
mkFact = m: o: n: { type ="Factor"; op = o; left = m; right = n; };
mkTerm = m: o: n: { type ="Term"; op = o; left = m; right = n; };
start = {
err =null;
s ="";
p =0;
expr = {};
};
toDecimal = n: if n <1then n else toDecimal (n / 10);
toFloat = n: n +0.1 - 0.1;
atEOF = s: s.p == (stringLength s.s);
hasErr = s: s.err !=null;
peek1 = s: charAt s.s (skipWhites s.s s.p);
next1 = s: s // { p = (skipWhites s.s s.p) +1; };
skipws = s: s // { p = (skipWhites s.s s.p); };
# slightly adjusted from others/parseNum2.nix parseFloat = s: f: let aux = acc: i: x: f: let c = charAt s.s i; inif c =="."&&!x then aux 0 (i+1) true (d: i:
f (acc + toDecimal(toFloat(d))) i
)
elseif atEOF s ||!ascii.isNum(c) then f acc i
else aux (acc*10+(ascii.toNum c)) (i +1) x f
; in aux 0 s.p false f;
parseNum = s:
if atEOF s then s // { err ="unexpected EOF"; }
elselet c = peek1 s; inif c =="("thenlet t = parseExpr (next1 s);
inif hasErr t then t
elseif (peek1 t) !=")"then t // { err ="expecting ')'"; }
else next1 t
elseif c !="."&&!(ascii.isNum c) then s // { err ="digit/dot expected, got '${c}'"; }
else parseFloat (skipws s) (n: p: s // {
p = p; expr = mkNum(n);
})
;
parseUnary = s: let c = peek1 s; inif c =="-"thenlet t = parseUnary (next1 s); inif hasErr t then t else t // { expr = mkUnary (-1) t.expr; }
elseif c =="+"then parseUnary (next1 s)
else parseNum s
;
parseFact = s: let t = parseUnary s; inif hasErr t then t
elselet aux = t:
let c = peek1 t; inif c =="*"|| c =="/"thenlet u = parseUnary (next1 t); inif hasErr u then u
else aux (u // { expr = mkFact t.expr c u.expr; })
else t
; in aux t
;
parseTerm = s: let t = parseFact s; inif hasErr t then t
elselet aux = t:
let c = peek1 t; inif c =="+"|| c =="-"thenlet u = parseFact (next1 t); inif hasErr u then u
else aux (u // { expr = mkTerm t.expr c u.expr; })
else t
; in aux t
;
parseExpr = parseTerm;
parse = s: parseExpr (start // { s = s; });
exec = s:
if s.type =="Num"then s.val
elseif s.type =="Unary"then s.sgn * (exec s.expr)
else (getAttr s.op ops) (exec s.left) (exec s.right)
;
}
Conclusion
In the next article,
we’ll make things more interesting by targeting
a more refined language: the λ-calculus. We’ll see that recursive
descent parsing, as we’ve used it so far, has some limitations,
and how to overcome them.
And, most importantly, we’ll study a real interpreter, which
is at the basis of functional programming languages such as Nix.
While the parser will remain of comparable difficulty to the last
one developed here, the interpreter on the other hand will be
more refined (but still minute in comparison to real life
languages).
Comments
By email, at mathieu.bivert chez: