Functional Monadic Parsers ported to C#

While taking the MOOC Introduction to Functional Programming by Erik Meijer on edX the current lecture greatly increased my interest in functional parsers. Functional parsers are about the essence of (functional) programming because they are about the concept of composition which is one of the core concepts of programming whatsoever.

The content of the lecture is closely related to the chapter 8 Functional Parsers of the book Programming in Haskell by Graham Hutton. It starts out with the definition of a type for a parser and a few very basic parsers.

I’m totally amazed by the simplicity, the elegance and the compositional aspect of these examples. I find it impressive how primitive but yet powerful these simple parsers are because they can easily be combined to form more complex and very capable parsers.

As an exercise, out of curiosity, and because of old habits I implemented the examples from the book in C#.

At the end of this post there will be an ultimate uber-cool parser example of a parser for arithmetic expressions.

The complete code from this post can be found here on GitHub.

Here is what I got:

The parser type

The type is just a function that takes a string as input and returns a list of tuples of a value of any type and a string:

public delegate List<Tuple<T, String>> Parser<T>(String inp);

The first item of the tuple represents a parsed value and the second item is the remaining unconsumed part of the input string. The fact that the return type is a list indicates that the result can also be empty if the parser fails. Otherwise, if the parser succeeds, the result will always be a singleton list.

Basic parsers

Here are the three most basic parsers.

Return always succeeds and returns the value that it was instantiated with as well as the unconsumed input string.

Failure always fails and therefore returns an empty list.

Item returns the first character of the input string as well as the remaining unconsumed input.

public static Parser<T> Return<T>(T v)
{
    return inp => new List<Tuple<T, string>> { Tuple.Create(v, inp) };
}

public static Parser<T> Failure<T>()
{
    return inp => new List<Tuple<T, string>>();
}

public static readonly Parser<char> Item = 
    inp => String.IsNullOrEmpty(inp)
        ? Failure<char>()(inp)
        : Return(inp[0])(inp.Substring(1));

Instead of calling the parsers directly we will use an application function Parse to abstract from the representation of parsers:

public static List<Tuple<T, string>> Parse<T>(this string input, Parser<T> p)
{
    return p(input);
}

Note: In C# we don’t really need this. I think, in Haskell this is needed because the parsers will be instances of the Monad type class.

Sequencing

Next we need a function to combine multiple parsers. This can be done with Bind which takes a parser and a function that returns another parser based on the result of the first parser. If the first parser fails, the whole computation will fail. Otherwise the function will be applied to the result of the first parser and return a parser that then will be applied to the unconsumed remaining input string.

public static Parser<T2> Bind<T1, T2>(this Parser<T1> p, Func<T1, Parser<T2>> f)
{
    return inp =>
    {
        var result = inp.Parse(p);
        return !result.Any() 
            ? Failure<T2>()(inp)
            : result.First().Item2.Parse(f(result.First().Item1));
    };
}

Hers is an example:

var p = Parsers.Item.Bind(
        x => Parsers.Item.Bind(
            _ => Parsers.Item.Bind(
                z => Parsers.Return(new string(new[] { x, z })))));

var result = "abc".Parse(p); 
Check.That(result).ContainsExactly(Tuple.Create("ac", String.Empty));                

LINQ query syntax

Now I added two functions Select and SelectMany to get some LINQ syntax sugar.

By adding Select the Parser<T> type becomes a Functor because now we can map functions over it. The function will be applied to the inner value when the parser is applied.

public static Parser<TResult> Select<TSource, TResult>(this Parser<TSource> source, Func<TSource, TResult> selector)
{
    return inp =>
    {
        var result = inp.Parse(source);
        return !result.An
            ? Failure<TResult>()(inp)
            : Return(selector(result.First().Item1))(result.First().Item2);
    };
}

SelectMany is needed so that we can use the LINQ query syntax instead of using Bind which similar to the do-Notation in Haskell.

public static Parser<TResult> SelectMany<TSource, TValue, TResult>(this Parser<TSource> source, Func<TSource, Parser<TValue>> valueSelector, Func<TSource, TValue, TResult> resultSelector)
{
    return source.Bind(s => valueSelector(s).Select(v => resultSelector(s, v)));
}

Here is an example of the use of the LINQ query syntax:

var p = 
    from x in Parsers.Item
    from _ in Parsers.Item
    from z in Parsers.Item
    select new string(new []{x,z});

var result = "abc".Parse(p); 
Check.That(result).ContainsExactly(Tuple.Create("ac", String.Empty));

Choice

Another way of combining parsers is to apply one parsers and if it fails apply another one with the Choice function:

public static Parser<T> Choice<T>(this Parser<T> p, Parser<T> q)
{
    return inp =>
    {
        var result = inp.Parse(p);
        return !result.Any()
            ? inp.Parse(q)
            : result;
    };
}

Derived primitives

We can now create other simple parsers by combining what we have so far.

First we define a parser that takes a predicate and parses the first character if the predicate holds:

 public static Parser<char> Sat(Predicate<char> p)
{
    return Item.Bind(c => p(c) ? Return(c) : Failure<char>());
}

Now we can combine Sat to create single character parsers for numeric, upper and lower case, alphabetic, alphanumeric or specific characters:

public static readonly Parser<char> Digit = Sat(Char.IsDigit);

public static readonly Parser<char> Lower = Sat(Char.IsLower);

public static readonly Parser<char> Upper = Sat(Char.IsUpper);

public static readonly Parser<char> Letter = Sat(Char.IsLetter);

public static readonly Parser<char> AlphaNum = Sat(Char.IsLetterOrDigit);

public static Parser<char> CharP(char c)
{
    return Sat(x => x == c);
}

The parser CharP can now be composed to a parser that matches a string:

public static Parser<string> StringRec(string str)
{
    return String.IsNullOrEmpty(str)
        ? Return(String.Empty)
        : from a in CharP(str[0])
          from b in StringP(str.Substring(1))
          from result in Return(str)
          select result;
}

Also two very useful parsers are Many and Many1 that apply a parser as many times as possible, where Many allows zero or more and Many1 requires at least one successful application.

public static Parser<List<T>> Many<T>(this Parser<T> p)
{
    return p.Many1().Choice(Return(new List<T>()));
}

public static Parser<List<T>> Many1<T>(this Parser<T> p)
{
    return p.Bind(v => p.Many().Bind(vs => Return(new List<T> { v }.Concat(vs).ToList())));
}

The problem with recursion

In C# we have a little problem here. StringRec is defined as a recursive function and Many and Many1 are defined as mutually recursive functions. That means that they are defined in terms of each other. This is a beautiful and elegant thing. But since the C# compiler is not very good at supporting recursive structures, stack overflow exceptions are likely to happen already with relatively small input strings (> a few thousand characters).

Therefore we have to define those functions as non-recursive.

Here is another non-recursive version of Many:

public static Parser<List<T>> Many<T>(this Parser<T> p)
{
    return inp =>
    {
        var state = Tuple.Create(new List<T>(), inp);

        while (true)
        {
            var newState = p(state.Item2);
            if (!newState.Any())
                break;
            state = Tuple.Create(state.Item1.Concat(new[] { newState.First().Item1 }).ToList(), newState.First().Item2);
        }

        return Return(state.Item1)(state.Item2);
    };
}

This looks almost like an unfold operation over the input string. But I couldn’t really figure out how to refactor out a yet underlying pattern here. I’m not sure if that is possible at all or makes sense. (Maybe the list concatenation can be abstracted out?)

The non-recursive string parser looks like this:

public static Parser<string> StringP(string str)
{
    return inp => !inp.StartsWith(str)
        ? Failure<string>()(inp)
        : Return(str)(inp.Substring(str.Length));
}

Identifiers, numbers and spaces

Now we can define parsers for identifiers, numbers and spaces:

public static readonly Parser<int> Nat = Digit.Many1().Bind(n => Return(Int32.Parse(String.Concat(n))));

public static readonly Parser<string> Ident =
    from c in Lower
    from cs in AlphaNum.Many()
    from result in Return(String.Concat(new[] {c}.Concat(cs)))
    select result;

public static readonly Parser<Unit> Space = Sat(Char.IsWhiteSpace).Many().Select(x => Unit.Value);

Handling spacing

To be able to ignore arbitrary number of spaces before or after tokens to be parsed we define a parser Token:

public static Parser<T> Token<T>(this Parser<T> p)
{
    return
        from before in Space
        from t in p
        from after in Space
        from result in Return(t)
        select result;
}

Now we can combine Token with other parsers:

public static readonly Parser<string> Identifier = Ident.Token();

public static readonly Parser<int> Natural = Nat.Token();

public static Parser<string> Symbol(string xs)
{
    return inp => inp.Parse(StringP(xs).Token());
}

Example: Parsing a list of integers

Now the first example of a useful parser is a parsers for non-empty lists of integers (in Haskell syntax) that ignores spacing around tokens. A correct example of this syntax could be this:

[ 1,  2    , 3,  4 ,   5   ]

We can combine the parser like this:

var p =
    from ign1 in Parsers.Symbol("[")
    from n in Parsers.Natural    
    from ns in (from ign2 in Parsers.Symbol(",") from n2 in Parsers.Natural select n2)
                        .Many()
    from ign3 in Parsers.Symbol("]")
    select (new[] { n }.Concat(ns));

var result = "[ 1,  2    , 3,  4 ,   5   ]".Parse(p);

Check.That(result.First().Item1).ContainsExactly(1, 2, 3, 4, 5);

// this shoudld fail
var result2 = "[1,2,]".Parse(p);
Check.That(result2).IsEmpty();

Example 2: Parsing arithmetic expressions

The next example will be a parser for arithmetic expressions with addition and multiplication like this or 2 + 3 * 4 or this (2 + 3) * 4.

In my opinion, it is the ultimate uber-cool parser example because it really shows the power of the composability of parsers and FP in general.

Grammar

The grammar for these kinds of expressions looks like this :

 expr ::= term (+ expr | \epsilon) \

 term ::= factor (* term | \epsilon) \

 factor ::= (expr) | nat \

 nat ::= 0 | 1 | 2 | \dots \

Implementation

Expr, Term and Factor are defined in terms of each other. Therefore the implementation is recursive.

What is really remarkable is that this parser tuple of Expr, Term and Factor is only composed of other general purpose primitives!

public static Parser<int> Expr()
{       
    return     
        from t in Term()
        from r in
           (from _ in Parsers.Symbol("+")
            from e in Expr()
            from r in Parsers.Return(t+e)
            select r)
                .Choice(Parsers.Return(t))
        select r;
}

public static Parser<int> Term()
{
    return
        from f in Factor()
        from r in
           (from _ in Parsers.Symbol("*")
            from t in Term()
            from r in Parsers.Return(f*t)
            select r)
                .Choice(Parsers.Return(f))
        select r;
}

public static Parser<int> Factor()
{
    return
       (from ign1 in Parsers.Symbol("(")
        from e in Expr()
        from ign2 in Parsers.Symbol(")")
        from r in Parsers.Return(e)
        select r)
            .Choice(Parsers.Natural);
}

Also, please note that there is an obvious, visible, direct mapping between the grammar and the implementation.

Here are a few tests:

[TestCase("42", 42)]
[TestCase("(((((42)))))", 42)]
[TestCase("1+1", 2)]
[TestCase("(1+1)", 2)]
[TestCase("1*1", 1)]
[TestCase("1*2", 2)]
[TestCase("(1*2)", 2)]
[TestCase("2*3+4", 10)]
[TestCase("2*(3+4)", 14)]
[TestCase("2 * 3 +  4", 10)]
[TestCase("2*(     3+ 4)  ", 14)]
[TestCase("2*3-4", 6)] // leaves "-4" unconsumed
[TestCase("((1))*(2+(((3)))*(4+(((5))+6))*(((7*8)))+9)", 2531)]
public void ArithmeticExpressions_should_succeed(string inp, int expected)
{
    var result = inp.Parse(ArithmeticExpressions.Expr());
    Check.That(result.First().Item1).IsEqualTo(expected);
}

[TestCase("-1")]
[TestCase("()")]
[TestCase("(5")]
[TestCase("(1+2")]
[TestCase("(1+2()")]
public void ArithmeticExpressions_should_fail(string inp)
{
    var result = inp.Parse(ArithmeticExpressions.Expr());
    Check.That(result).IsEmpty();
}