While taking the MOOC Introduction to Functional Programming by Erik Meijer on edX the current lecture greatly increased my interest in functional parsers. Functional parsers are about the essence of (functional) programming because they are about the concept of composition which is one of the core concepts of programming whatsoever.
The content of the lecture is closely related to the chapter 8 Functional Parsers of the book Programming in Haskell by Graham Hutton. It starts out with the definition of a type for a parser and a few very basic parsers.
I'm totally amazed by the simplicity, the elegance and the compositional aspect of these examples. I find it impressive how primitive but yet powerful these simple parsers are because they can easily be combined to form more complex and very capable parsers.
As an exercise, out of curiosity, and because of old habits I implemented the examples from the book in C#.
At the end of this post there will be an ultimate uber-cool parser example of a parser for arithmetic expressions.
The complete code from this post can be found here on GitHub.
Here is what I got:
The parser type
The type is just a function that takes a string as input and returns a list of tuples of a value of any type and a string:
public delegate List<Tuple<T, String>> Parser<T>(String inp);
The first item of the tuple represents a parsed value and the second item is the remaining unconsumed part of the input string. The fact that the return type is a list indicates that the result can also be empty if the parser fails. Otherwise, if the parser succeeds, the result will always be a singleton list.
Basic parsers
Here are the three most basic parsers.
Return always succeeds and returns the value that it was instantiated with as well as the unconsumed input string.
Failure always fails and therefore returns an empty list.
Item returns the first character of the input string as well as the remaining unconsumed input.
public static Parser<T> Return<T>(T v)
{
return inp => new List<Tuple<T, string>> { Tuple.Create(v, inp) };
}
public static Parser<T> Failure<T>()
{
return inp => new List<Tuple<T, string>>();
}
public static readonly Parser<char> Item =
inp => String.IsNullOrEmpty(inp)
? Failure<char>()(inp)
: Return(inp[0])(inp.Substring(1));
Instead of calling the parsers directly we will use an application function Parse
to abstract from the representation of parsers:
public static List<Tuple<T, string>> Parse<T>(this string input, Parser<T> p)
{
return p(input);
}
Note: In C# we don't really need this. I think, in Haskell this is needed because the parsers will be instances of the Monad
type class.
Sequencing
Next we need a function to combine multiple parsers. This can be done with Bind
which takes a parser and a function that returns another parser based on the result of the first parser. If the first parser fails, the whole computation will fail. Otherwise the function will be applied to the result of the first parser and return a parser that then will be applied to the unconsumed remaining input string.
public static Parser<T2> Bind<T1, T2>(this Parser<T1> p, Func<T1, Parser<T2>> f)
{
return inp =>
{
var result = inp.Parse(p);
return !result.Any()
? Failure<T2>()(inp)
: result.First().Item2.Parse(f(result.First().Item1));
};
}
Hers is an example:
var p = Parsers.Item.Bind(
x => Parsers.Item.Bind(
_ => Parsers.Item.Bind(
z => Parsers.Return(new string(new[] { x, z })))));
var result = "abc".Parse(p);
Check.That(result).ContainsExactly(Tuple.Create("ac", String.Empty));
LINQ query syntax
Now I added two functions Select
and SelectMany
to get some LINQ syntax sugar.
By adding Select
the Parser<T>
type becomes a Functor because now we can map functions over it. The function will be applied to the inner value when the parser is applied.
public static Parser<TResult> Select<TSource, TResult>(this Parser<TSource> source, Func<TSource, TResult> selector)
{
return inp =>
{
var result = inp.Parse(source);
return !result.An
? Failure<TResult>()(inp)
: Return(selector(result.First().Item1))(result.First().Item2);
};
}
SelectMany
is needed so that we can use the LINQ query syntax instead of using Bind
which similar to the do
-Notation in Haskell.
public static Parser<TResult> SelectMany<TSource, TValue, TResult>(this Parser<TSource> source, Func<TSource, Parser<TValue>> valueSelector, Func<TSource, TValue, TResult> resultSelector)
{
return source.Bind(s => valueSelector(s).Select(v => resultSelector(s, v)));
}
Here is an example of the use of the LINQ query syntax:
var p =
from x in Parsers.Item
from _ in Parsers.Item
from z in Parsers.Item
select new string(new []{x,z});
var result = "abc".Parse(p);
Check.That(result).ContainsExactly(Tuple.Create("ac", String.Empty));
Choice
Another way of combining parsers is to apply one parsers and if it fails apply another one with the Choice
function:
public static Parser<T> Choice<T>(this Parser<T> p, Parser<T> q)
{
return inp =>
{
var result = inp.Parse(p);
return !result.Any()
? inp.Parse(q)
: result;
};
}
Derived primitives
We can now create other simple parsers by combining what we have so far.
First we define a parser that takes a predicate and parses the first character if the predicate holds:
public static Parser<char> Sat(Predicate<char> p)
{
return Item.Bind(c => p(c) ? Return(c) : Failure<char>());
}
Now we can combine Sat
to create single character parsers for numeric, upper and lower case, alphabetic, alphanumeric or specific characters:
public static readonly Parser<char> Digit = Sat(Char.IsDigit);
public static readonly Parser<char> Lower = Sat(Char.IsLower);
public static readonly Parser<char> Upper = Sat(Char.IsUpper);
public static readonly Parser<char> Letter = Sat(Char.IsLetter);
public static readonly Parser<char> AlphaNum = Sat(Char.IsLetterOrDigit);
public static Parser<char> CharP(char c)
{
return Sat(x => x == c);
}
The parser CharP
can now be composed to a parser that matches a string:
public static Parser<string> StringRec(string str)
{
return String.IsNullOrEmpty(str)
? Return(String.Empty)
: from a in CharP(str[0])
from b in StringP(str.Substring(1))
from result in Return(str)
select result;
}
Also two very useful parsers are Many
and Many1
that apply a parser as many times as possible, where Many
allows zero or more and Many1
requires at least one successful application.
public static Parser<List<T>> Many<T>(this Parser<T> p)
{
return p.Many1().Choice(Return(new List<T>()));
}
public static Parser<List<T>> Many1<T>(this Parser<T> p)
{
return p.Bind(v => p.Many().Bind(vs => Return(new List<T> { v }.Concat(vs).ToList())));
}
The problem with recursion
In C# we have a little problem here. StringRec
is defined as a recursive function and Many
and Many1
are defined as mutually recursive functions. That means that they are defined in terms of each other. This is a beautiful and elegant thing. But since the C# compiler is not very good at supporting recursive structures, stack overflow exceptions are likely to happen already with relatively small input strings (> a few thousand characters).
Therefore we have to define those functions as non-recursive.
Here is another non-recursive version of Many
:
public static Parser<List<T>> Many<T>(this Parser<T> p)
{
return inp =>
{
var state = Tuple.Create(new List<T>(), inp);
while (true)
{
var newState = p(state.Item2);
if (!newState.Any())
break;
state = Tuple.Create(state.Item1.Concat(new[] { newState.First().Item1 }).ToList(), newState.First().Item2);
}
return Return(state.Item1)(state.Item2);
};
}
This looks almost like an unfold operation over the input string. But I couldn't really figure out how to refactor out a yet underlying pattern here. I'm not sure if that is possible at all or makes sense. (Maybe the list concatenation can be abstracted out?)
The non-recursive string parser looks like this:
public static Parser<string> StringP(string str)
{
return inp => !inp.StartsWith(str)
? Failure<string>()(inp)
: Return(str)(inp.Substring(str.Length));
}
Identifiers, numbers and spaces
Now we can define parsers for identifiers, numbers and spaces:
public static readonly Parser<int> Nat = Digit.Many1().Bind(n => Return(Int32.Parse(String.Concat(n))));
public static readonly Parser<string> Ident =
from c in Lower
from cs in AlphaNum.Many()
from result in Return(String.Concat(new[] {c}.Concat(cs)))
select result;
public static readonly Parser<Unit> Space = Sat(Char.IsWhiteSpace).Many().Select(x => Unit.Value);
Handling spacing
To be able to ignore arbitrary number of spaces before or after tokens to be parsed we define a parser Token
:
public static Parser<T> Token<T>(this Parser<T> p)
{
return
from before in Space
from t in p
from after in Space
from result in Return(t)
select result;
}
Now we can combine Token
with other parsers:
public static readonly Parser<string> Identifier = Ident.Token();
public static readonly Parser<int> Natural = Nat.Token();
public static Parser<string> Symbol(string xs)
{
return inp => inp.Parse(StringP(xs).Token());
}
Example: Parsing a list of integers
Now the first example of a useful parser is a parsers for non-empty lists of integers (in Haskell syntax) that ignores spacing around tokens. A correct example of this syntax could be this:
[ 1, 2 , 3, 4 , 5 ]
We can combine the parser like this:
var p =
from ign1 in Parsers.Symbol("[")
from n in Parsers.Natural
from ns in (from ign2 in Parsers.Symbol(",") from n2 in Parsers.Natural select n2)
.Many()
from ign3 in Parsers.Symbol("]")
select (new[] { n }.Concat(ns));
var result = "[ 1, 2 , 3, 4 , 5 ]".Parse(p);
Check.That(result.First().Item1).ContainsExactly(1, 2, 3, 4, 5);
// this shoudld fail
var result2 = "[1,2,]".Parse(p);
Check.That(result2).IsEmpty();
Example 2: Parsing arithmetic expressions
The next example will be a parser for arithmetic expressions with addition and multiplication like this or 2 + 3 * 4
or this (2 + 3) * 4
.
In my opinion, it is the ultimate uber-cool parser example because it really shows the power of the composability of parsers and FP in general.
Grammar
The grammar for these kinds of expressions looks like this :
Implementation
Expr
, Term
and Factor
are defined in terms of each other. Therefore the implementation is recursive.
What is really remarkable is that this parser tuple of Expr, Term and Factor is only composed of other general purpose primitives!
public static Parser<int> Expr()
{
return
from t in Term()
from r in
(from _ in Parsers.Symbol("+")
from e in Expr()
from r in Parsers.Return(t+e)
select r)
.Choice(Parsers.Return(t))
select r;
}
public static Parser<int> Term()
{
return
from f in Factor()
from r in
(from _ in Parsers.Symbol("*")
from t in Term()
from r in Parsers.Return(f*t)
select r)
.Choice(Parsers.Return(f))
select r;
}
public static Parser<int> Factor()
{
return
(from ign1 in Parsers.Symbol("(")
from e in Expr()
from ign2 in Parsers.Symbol(")")
from r in Parsers.Return(e)
select r)
.Choice(Parsers.Natural);
}
Also, please note that there is an obvious, visible, direct mapping between the grammar and the implementation.
Here are a few tests:
[TestCase("42", 42)]
[TestCase("(((((42)))))", 42)]
[TestCase("1+1", 2)]
[TestCase("(1+1)", 2)]
[TestCase("1*1", 1)]
[TestCase("1*2", 2)]
[TestCase("(1*2)", 2)]
[TestCase("2*3+4", 10)]
[TestCase("2*(3+4)", 14)]
[TestCase("2 * 3 + 4", 10)]
[TestCase("2*( 3+ 4) ", 14)]
[TestCase("2*3-4", 6)] // leaves "-4" unconsumed
[TestCase("((1))*(2+(((3)))*(4+(((5))+6))*(((7*8)))+9)", 2531)]
public void ArithmeticExpressions_should_succeed(string inp, int expected)
{
var result = inp.Parse(ArithmeticExpressions.Expr());
Check.That(result.First().Item1).IsEqualTo(expected);
}
[TestCase("-1")]
[TestCase("()")]
[TestCase("(5")]
[TestCase("(1+2")]
[TestCase("(1+2()")]
public void ArithmeticExpressions_should_fail(string inp)
{
var result = inp.Parse(ArithmeticExpressions.Expr());
Check.That(result).IsEmpty();
}