A place to be (re)educated in Newspeak

Saturday, December 11, 2010

Reflecting on Functional Programming

In this post, I wanted to make a case for reflection in the context of pure functional programming. I don’t know that pure functional languages should be different than other languages in this regard, but in practice they are: they generally do not have reflection support.

To demonstrate the utility of reflection, I’m going to revisit one of my favorite examples, parser combinators. In particular, we’ll consider how to implement executable grammars. Executable grammars are a special flavor of a parser combinator library that allows semantic actions to be completely separated from the actual grammar. I introduced executable grammars as part of the Newspeak project.

Consider the following grammar:

statement -> ifStatement | returnStatement
ifStatement -> ‘if’ expression ‘then’ expression ‘else’ expression
returnStatement -> ‘’return’ expression
expression -> identifier | number

In Newspeak, we’d write:

class G = ExecutableGrammar ( |
(* lexical rules for identifier, number, keywords elided *)
(* The actual syntactic grammar *)
statement = ifStatement | returnStatement.
ifStatement = if, expression, then, expression, else, expression.
returnStatement = returnSymbol, expression.
expression = identifier | number.
|)()

Now let’s define some semantic action, say, creating an AST. The Newspeak library let’s me do this in a subclass, by overriding the code for the production thus:

class P = G ()(
ifStatement = (
super ifStatement wrap:[:if :e1 :then :e2 :else :e3 |
IfStatementAST if: e1 then: e2 else: e3
].
)
returnStatement = (
super returnStatement wrap:[:return :e | ReturnStatementAST return: e].
)
)

No prior parser combinator library allowed me to achieve a similar separation of grammar and semantic action. In particular, I don’t quite see how to accomplish this in a functional language.

In the functional world, I would expect one function would define the actual grammar, and another would perform the semantic actions (in our example, build the AST). The latter function would transform the result of basic parsing as defined by the grammar, producing an AST as the result. We’d use pattern matching to define this function. I’d want to write something like:

makeAST =
fun ifStatement(ifKw, e1, thenKw, e2, elseKw, e3) =
IfStatementAST(makeAST(e1), makeAST(e2), makeAST(e3)) |
fun returnStatement(returnKw, e) = ReturnsStatementAST(makeAST(e)) |
fun identifier(id) = IdentifierAST(id) |
fun number(n) = NumberAST(id)

where makeAST maps a concrete parse tree into an abstract one. Which in this case looks pretty easy.

The question arises: where did the patterns ifStatement, returnStatement, number and identifier come from?

Presumably, our parser combinator library defined them based on our input grammar. The thing is, the library does not know the specifics of our grammar in advance. It cannot predefine data constructors for each conceivable production. Instead, it should create these data constructors dynamically each time it processes a specific grammar.

How does one create datatypes dynamically in a traditional functional language? I leave that as an exercise for the reader.

Ok, so while it is clear that creating datatypes on the fly would be very helpful here, it is also clear that it isn’t easy to do in the context of such languages. How would you describe the type of the library? The datatype it returns is created per grammar, and depends on the names of the grammar production functions. Not easy to characterize via Hindley-Milner. And yet, once the library created the datatype, we actually could utilize it in writing type safe clients.

Instead, our library will probably generate values of some generic datatype for parse trees. A possible representation is a pair, consisting of a tag of type string representing the name of the production used to compute the tree, and a list consisting of the elements of the tree, including vital information such as where in the input stream a given token was found and what string exactly represented it. We cannot elide such lexical information, because some users of our library will need it (say, pretty printers). Then I can write:

makeAST =
fun parsetree(“if”, [ifKw, e1, thenKw, e2, elseKw, e3]) =
IfStatementAST(makeAST(e1), makeAST(e2), makeAST(e3)) |
fun parsetree(“return”, [returnKw, e]) = ReturnsStatementAST(makeAST(e)) |
fun parsetree(“id”,[id]) = IdentifierAST(id) |
fun parsetree(“number”,[in]) = NumberAST(in)

Obviously, we’ve lost the type safety of the previous version. Ironically, the inability of the language to generate types dynamically forces code to be less statically type safe.

Now ask yourself - how does our combinator library produce values of type parsetree with an appropriate tag? For each parsetree value p(tag, elements), the tag is a string corresponding to the name of the production that was used to compute p. How does our library know this tag? The tag is naturally specified via the name of the production function in the grammar. To get at it, one would need some introspection mechanism to get the name of a function at run time. Of course, no such mechanism exists in a standard functional language. It looks like you’d have to force the user to specify this information redundantly as a string, in addition to the function name (you still need the function name so that other productions can refer to it).

You might argue that we don’t really need the string tags - just return a concrete parse tree and distinguish the cases by pattern matching. However, it isn’t generally possible to tell the parse tree for a number from that for an identifier without re-parsing. Even when you can tell parse trees apart, the resulting code is ugly and inefficient, as it is repeating some of the parser’s work.

We could approach the problem via staged execution, writing meta-program that statically transformed the grammar into a program that would provide us with the nice datatype constructors I suggested in the beginning. If one goes that route, you might as well define an external DSL based on BNF or PEGs.

So, I assert that reflection is essential to this task, and dynamic type generation would be helpful as well, which would require dependent types and additional reflective functionality. However, maybe I’ve overlooked something and there is some other way to achieve the same goal. I’m sure someone will tell me - but remember, the library must not burden the user by requiring redundant information or work, it must operate independent of the specifics of a given grammar, and it must keep semantic actions entirely separate.

In any case, I think there is considerable value in adding at least a measure of introspection, and preferably full reflection, to traditional functional languages, and interesting work to be done fleshing it out.