Parsing Roman Numerals in C# and Haskell

This is a follow-up from my last post Functional Monadic Parsers ported to C# where I showed the implementation of basic parsers from the book Programming in Haskell by Graham Hutton in C#.

When these primitives are used to compose a parser for Roman Numerals the result yet again demonstrates the amazing capabilities and elegance of functional programming. The problem of parsing Roman Numerals is not a very difficult one. But still, I find the simplicity of constructing a solution by combining primitive parsers fascinating.

Here is the implementation. It is super easy, I was able to write this in less than 15 minutes without tests first, worked the first time.

public readonly static Parser<int> RomanNumeral =
    Parsers.StringP("IV").Select(x => 4)
        .Choice(Parsers.StringP("IX").Select(x => 9))
        .Choice(Parsers.StringP("XL").Select(x => 40))
        .Choice(Parsers.StringP("XC").Select(x => 90))
        .Choice(Parsers.StringP("CD").Select(x => 400))
        .Choice(Parsers.StringP("CM").Select(x => 900))
        .Choice(Parsers.StringP("I").Select(x => 1))
        .Choice(Parsers.StringP("V").Select(x => 5))
        .Choice(Parsers.StringP("X").Select(x => 10))
        .Choice(Parsers.StringP("L").Select(x => 50))
        .Choice(Parsers.StringP("C").Select(x => 100))
        .Choice(Parsers.StringP("D").Select(x => 500))
        .Choice(Parsers.StringP("M").Select(x => 1000))
        .Select(rns => rns.Sum());

The code can be found here on GitHub.

Note that all syntactically correct Roman Numerals will be parsed correctly. However, there are syntactically incorrect combinations like e.g. "IIIII" that will be parsed as well.

Here are a few spot check tests:

[TestCase("I", 1)]
[TestCase("III", 3)]
[TestCase("IX", 9)]
[TestCase("MLXVI", 1066)]
[TestCase("MCMLXXXIX", 1989)]
public void RomanNumerals_tests(string rn, int an)
    var numnber = rn.Parse(RomanNumerals.RomanNumeral);


Checking for correct syntax

Checking for correct syntax is also not that hard. It can be done by combining primitive parsers as well.

The requirements are defined like this:

  • Every symbol may only appear once
  • Except I, X and C which may be repeated up to 3 times
  • And M may be repeated many times (Off course this is limited by stack, memory, ranges etc. but I won't take this into account right now.)
  • The values of the sequence must be strictly decreasing (e.g. XV is valid - VX is not valid)

To be able to check the constraints we can use the Sat function which takes a predicate and will succeed if the predicate holds. Otherwise it will fail:

public static Parser<T> Sat<T>(this Parser<T> parser, Predicate<T> predicate)
    return parser.Bind(c => predicate(c) ? Return(c) : Failure<T>());

Now we can easily check our constraints while parsing the input:

public static readonly Parser<int> RomanNumeral =
    Parsers.StringP("IV").Select(x => 4)
        .Choice(Parsers.StringP("IX").Select(x => 9))
        .Choice(Parsers.StringP("XL").Select(x => 40))
        .Choice(Parsers.StringP("XC").Select(x => 90))
        .Choice(Parsers.StringP("CD").Select(x => 400))
        .Choice(Parsers.StringP("CM").Select(x => 900))
        .Choice(Parsers.CharP('I').Select(x => 1)
            .Select(cs => cs.Sum())
            .Sat(sum => sum <= 3))
        .Choice(Parsers.CharP('X').Select(x => 10)
            .Select(cs => cs.Sum())
            .Sat(sum => sum <= 30))
        .Choice(Parsers.CharP('C').Select(x => 100)
            .Select(cs => cs.Sum())
            .Sat(sum => sum <= 300))
        .Choice(Parsers.CharP('M').Select(x => 1000)
            .Select(x => x.Sum()))
        .Choice(Parsers.CharP('V').Select(x => 5))
        .Choice(Parsers.CharP('L').Select(x => 50))
        .Choice(Parsers.CharP('D').Select(x => 500))
        .Sat(ns => ns.Zip(ns.Skip(1), (a, b) => a > b).All(b => b))
        .Select(ns => ns.Sum());

The code can be found here on GitHub.

Comparison to the imperative version

When googling for "Parsing Roman Numerals" one of the first results is this one from Black Belt Coder. This is a rather imperative implementation of a Roman Numerals parser. (Note that it does not check for correct syntax.)

Advantages of the functional approach

Under the covers both versions are pretty much doing the same thing. Internally the Many function is implemented imperatively using a while loop, similar to the other example, because it is more efficient in C# than recursion. In other languages like Haskell or F# tail recursive implementations are fine. But this is not so important because the internals of Many are encapsulated and don't matter to the caller. This is one of the main differences and a big advantages of declarative or functional programming. This leads to some more advantages:

  • The details of implementation of common tasks are abstracted away from the caller
  • The internals of such primitive functions are totally encapsulated
  • The primitives are well-tested or even proven to be correct
  • Primitives are ready for massive reuse
  • More concise code
  • Less code repetition and therefore less chances of making mistakes
  • Functions that are composed of smaller functions that are correct are more likely to be correct as well
  • Trivial things don't have to be frequently re-implemented and tested

Parsing Roman Numerals in Haskell

Now going back to Haskell and the parsing library Parsec, here is an implementation of the Roman Numeral Kata in Haskell (I'm not a Haskell expert so there might be room for improvement):

sat :: String -> (a -> Bool) -> ParsecT s u m a -> ParsecT s u m a
sat msg predicate parser = parser >>= (\x -> if predicate x then parserReturn x else parserFail msg)

strictDecr :: Ord a => ParsecT s u m [a] -> ParsecT s u m [a]
strictDecr =
  sat msg (\xs -> and (zipWith (>) xs (drop 1 xs)))
    where msg = "unexpected order of values\nexpected strictly decreasing values"

romPrimCombiVal :: ParsecT [Char] u Identity Integer
romPrimCombiVal = 
  choice [
    (\_ -> 4) <$> (try $ string "IV"),
    (\_ -> 9) <$> (try $ string "IX"),
    (\_ -> 40) <$> (try $ string "XL"),
    (\_ -> 90) <$> (try $ string "XC"),
    (\_ -> 400) <$> (try $ string "CD"),
    (\_ -> 900) <$> (try $ string "CM"),
    sat "unexpected repetitions of symbol `I`\nexpected symbol to appear 3 times at most" (<= 3) $ sum <$> many1 ((\_ -> 1) <$> (char 'I')),
    sat "unexpected repetitions of symbol `X`\nexpected symbol to appear 3 times at most" (<= 30) $ sum <$> many1 ((\_ -> 10) <$> (char 'X')),
    sat "unexpected repetitions of symbol `C`\nexpected symbol to appear 3 times at most" (<= 300) $ sum <$> many1 ((\_ -> 100) <$> (char 'C')),
    sum <$> many1 ((\_ -> 1000) <$> (char 'M')),
    (\_ -> 5) <$> (char 'V'),
    (\_ -> 50) <$> (char 'L'),
    (\_ -> 500) <$> (char 'D')]

romNum :: ParsecT [Char] u Identity Integer
romNum = do 
  ns <- strictDecr $ many1 romPrimCombiVal
  return $ sum ns

Here is the complete code on GitHub.