Parsers and Builders as Prisms 2019.12.19

Serialization and deserialization are tedious tasks, often riddled with boiler-plate and bug-prone code. There's a plethora of tools, such as parser combinator libraries, which aim to assist us in some of these tasks, and it seems that new tools never stop popping up. This hints that there are probably no well known satisfactory solutions to these problems, which probably means that most of these tools are ad-hoc rather than principled high-quality solutions.

Are there no good solutions out there? Actually, there are!

Back when I was programming in Python, I have used Construct to great satisfaction. But for a long time now I have been using Haskell, and have found no equivalent to it, which I've actually wanted for more than 10 years! So perhaps I should start looking into making one? Hence this post.

In this post we'll introduce Construct's declarative approach and then discuss how to implement an equivalent principled solution in Haskell, based on optics.

Construct: declarative parsers and builders

Construct is declarative.

What do I mean by that? A declarative implementation of a parser should describe the format, not how to parse it!

From this description, the computer will automatically figure out how to parse. Not only that, it will also know how to build, and ideally even generate extra artifacts like documentation for the format!

Here's how Construct defines the format of IPv4 packets:

ipv4 = Struct(
    "header" / BitStruct(
        "version" / Const(4, Nibble),
        "length" / ExprAdapter(Nibble,
            decoder = lambda obj, ctx: obj * 4,
            encoder = lambda obj, ctx: obj // 4,
        ),
    ),
    ...
    "total_length" / Int16ub,
    ...
    "options" / Bytes(this.header.length - 20),
    "body" / Bytes(this.total_length - this.header.length),
)

In this simple declaration we created a parser, builder, and the data type for IPv4 packets!

Also, I'd like to highlight the richness of the format. It isn't a simple "struct" with fields of known types, but rather a "dependently-typed" one, where the size of the options field depends on the value of header.length! Haskell parser combinators libraries which support such dependencies tend to identify as "monadic parser combinators" libraries, in constrast to "applicative parser combinators" which don't support them.

Parsers and builders as prisms

What is the essence of parsing and building?

data Codec myType encoded = Codec
    { build :: myType -> encoded
    , parse :: encoded -> Maybe myType
    }

(Caveat: one may desire meaningful parse errors using Either ParseError rather of Maybe. We'll ignore this issue for now)

Looking for a principled solution, one may notice that this type is equivalent to Prism from the popular lens package!

‘There are only two hard things in Computer Science: cache invalidation and naming things.’ - Phil Karlton

Naming things is hard, and we want principled approaches and code re-use, so we'll choose to use the existing Prism rather than make an equivalent new ad-hoc type. Hopefully this will also enable enjoying the fruits of the existing lens eco-system.

Let's demonstrate with a simplified IP-like protocol:

> simplifiedIp # ((2, 3), "Hello World!")
"\a\NUL\f\NUL\STX\NUL\ETXHello World!"

> "\a\NUL\f\NUL\STX\NUL\ETXHello World!" ^? simplifiedIp
Just ((2,3),"Hello World!")

simplifiedIp :: Prism' ByteString ((Word16, Word16), ByteString)
simplifiedIp = takeSimplifiedIp . secondOnly ""

takeSimplifiedIp ::
    Prism' ByteString (((Word16, Word16), ByteString), ByteString)
takeSimplifiedIp =
    _Cons . firstOnly 7 . -- Remove the constant byte 7
    takeWord16 .          -- (bodyLen, rest)
    aside (takeWord16 . aside takeWord16) .
                          -- (bodyLen, (origin, (dest, rest)))

    -- Reordering (this is somewhat painful):
    aside retuple .       -- (bodyLen, ((origin, dest), rest))
    retuple .             -- ((bodyLen, (origin, dest)), rest)
    asideFirst swapped .  -- (((origin, dest), bodyLen), rest)
    from retuple .        -- ((origin, dest), (bodyLen, rest))

    aside takeBytes .     -- ((origin, dest), (body, remainder))
    retuple               -- (((origin, dest), body), remainder)

This uses some combinators from Control.Lens and some extra combinators defined below:

Parse-build prism combinators

firstOnly :: Eq e => e -> Prism' (e, a) a
firstOnly x = asideFirst (only x) . iso snd ((,) ())

secondOnly :: Eq e => e -> Prism' (a, e) a
secondOnly x = swapped . firstOnly x

asideFirst :: APrism s t a b -> Prism (s, e) (t, e) (a, e) (b, e)
asideFirst l = swapped . aside l . swapped

-- Tuple shuffling Iso
retuple ::
    Iso
    (a0, (a1, a2)) (b0, (b1, b2))
    ((a0, a1), a2) ((b0, b1), b2)
retuple =
    iso
    (\(w0, (w1, r)) -> ((w0, w1), r))
    (\((w0, w1), r) -> (w0, (w1, r)))

takeWord16 :: Prism' ByteString (Word16, ByteString)
takeWord16 = _Cons . aside _Cons . retuple . asideFirst (from word16Bytes)

word16Bytes :: Iso' Word16 (Word8, Word8)
word16Bytes =
    iso
    ((both %~ fromIntegral) . (`divMod` 256))
    (\(w1, w0) -> fromIntegral w1 * 256 + fromIntegral w0)

takeBytes ::
    Integral a =>
    Prism' (a, ByteString) (ByteString, ByteString)
takeBytes =
    prism'
    ( \(x, y) ->
        (fromIntegral (ByteString.length x), x <> y))
    (\(count, x) ->
        ByteString.splitAt (fromIntegral count) x <$
        guard (fromIntegral count <= ByteString.length x))

Conclusion

We've succesfully crafted prisms to parse and build our structure!

Now let's review some drawbacks of the presented approach:

I believe that both of these problems can be solved, resulting in a powerful and ergonomic principled solution. In my next post I'll describe an approach to add error reporting to our prisms.

Notes:

Discussion: