A place to be (re)educated in Newspeak

Saturday, January 13, 2007

Representation Independent Code

In most object oriented languages, replacing a field with a method requires you to track down the uses of that field and changing them from field accesses to method invocations. The canonical example is a class of points. You decided to change your representation from cartesian to polar coordinates, and now all the places you wrote ‘x’ have to be rewritten as ‘x()’.

This example isn’t so bad, because the odds are you already had an x() method, and you probably had the sense to avoid making the x field public. But maybe you made it protected (perhaps your language is smart enough to disallow public fields, but simple-minded enough to force them to always make them protected, like Smalltalk). If x is protected, you’ll need to find all the subclasses. Maybe you don’t have access to all of them, and you can never get rid of the field x.

Or maybe you make a stupid mistake, and made x public, perhaps in the mad rush toward a release. Won’t happen to you? Take a look at Java’s System.out and ask yourself how it got to be there. Now go find all the uses of x and change them. Even if you can, it’s pretty tiresome.

The fact is, given the ability to publicize a field, programmers will do so. Once that’s happened, tracking down the uses may be impossible, and in any case is a huge amount of work.

It would be nice if you didn’t have to worry about this sort of thing. If everyone using your object went through a procedural interface, for example. Smalltalk makes all uses outside of an object do that - but uses within the object, in its class and subclasses, are exempt. As for mainstream languages like Java and C# - they allow you to declare public fields; it’s your funeral.

About 20 years ago, Dave Ungar and Randy Smith introduced Self, which fixed this problem. All communication is through method calls (or synchronous message sends, if you will) - even the object’s own code works exclusively by sending messages to itself and other objects. Fields (slots in Selfish) are defined declaratively, and automatically define access methods. The only way to get or set a field is by invoking a method. So if you get rid of the field and replace it with a method that computes the value instead, no source code anywhere can tell the difference. The code is representation independent. Self’s syntax makes it very easy and natural to send a message/call a method - there is no overhead compared to accessing a field in other languages.

In C# they have a thing called properties, which is similar. Except that C# also has fields, and so it requires careful attention by the programmer to ensure representation independence. In other words, it cannot be relied upon to happen. I don’t know why the designers of C# chose to support both fields and properties. I should ask my friends at Microsoft (yes, I have a few; I’m very non-judgmental). In complex languages, there are always all kinds of strange gotchas and constraints.

There are of course other ways that languages can undermine representation independence. In particular, the type system can support class types that make code dependent on which class you use, rather than on what interface is supported. I don’t want to dive into that right now.

The point of this post is to draw attention to the importance of representation independence. If you are using C# or something else with such a construct, I’d suggest you make the best of it and use properties or their equivalent religiously. And future languages should follow Self’s lead and ensure representation independence.

Saturday, January 06, 2007

Parser Combinators

Many excellent ideas from functional programming apply to object oriented programming (functions are objects too, you know). Parser combinators in particular are a technique with a long history in the functional programming community.

This is a blog post, so I’m not going to give a proper bibliography of the idea; suffice to say that it goes back decades, and that Phil Wadler, and Erik Meijer, among others, have done important work in this area. I myself was inspired to look into the topic by Martin Odersky’s
Scala tutorial.

I’ve built a simple parser combinator library in Smalltalk, and it seems to work out very nicely. I thought it would be good to write about it here, coming at the topic from an OO perspective.

So, what are parser combinators exactly? The basic idea is to view the operators of BNF (or regular expressions for that matter) as methods that operate on objects representing productions of a grammar. Each such object is a parser that accepts the language specified by a particular production. The results of the method invocations are also such parsers. The operations are called combinators for rather esoteric technical reasons (and to intimidate unwashed illiterates wherever they might lurk).

To make this concrete, lets look at a fairly standard rule for identifiers:

id -> letter (letter | digit)*

using my combinator library in Smalltalk, one defines a subclass of CombinatorialParser, and inside it one writes

id := self letter, [(self letter | [self digit]) star]

Here, letter is a a parser that accepts a single letter; digit is a parser that accepts a single digit. Both are obtained by invoking a method on self ( this, for those unfortunates who don’t program in Smalltalk). The subexpression self letter | [self digit] invokes the method | on the parser that accepts a letter, with an argument that accepts a digit (ignore the square brackets for a moment). The result will be a parser that accepts either a letter or a digit.

tangent: No Virginia, Smalltalk does not have operator overloading. It simply allows method names using non-alphanumeric characters. These are always infix and all have the same fixed precedence. How can it be so simple? It’s called minimalism, and it’s not for everyone. Like Mies van der Rohe vs. Rococo.

The only detail I’ve glossed over is the brackets. The brackets denote closures, so that [self digit] is a closure that when applied, will yield a parser that accepts a digit. Why do I do this? Because grammar rules are often mutually recursive. If a production A is used in a production B and vice versa, one of them needs to be defined first (say, A), at which point the other (B) is not yet defined and yet must be referenced. Wrapping the reference to the other production in a closure delays its evaluation and gets round this problem. In a lazy language like Haskell this is not an issue - which is one key reason Haskell is very good at defining DSLs. However, Smalltalk’s closure syntax is very lightweight (lighter than lambdas in most functional languages!) so this is not a big deal. And Smalltalk’s infix binary methods and postfix unary methods give a very pleasing result overall.

We then invoke the method star on the result

(self letter | [self digit]) star

which yields a parser that accepts zero or more occurrences of either a letter or a digit.

In a syntax more people understand, this would look something like:

(1) letter().or( new DelayedParser(){ public Parser value(){ return digit();} }).star()

If Java had closures, it might look like this:

(2) letter().or({=> digit()}).star()

This is better, but either way, the goal of writing an executable grammar tends to get lost in the noise. Nevertheless, it seems most people prefer (1), and the vast majority of the rest prefer (2) over the “bizarre” Smalltalk syntax. Who knows what darkness lies in the hearts of men.

We pass this parser as an argument to the method , which we invoke on letter. The “,” method is the sequencing combinator (which is implicit in BNF). It returns a parser that first accepts the language of the receiver (target, in Javanese) and then accepts the language of its argument. In this case, this means the result accepts a single letter, followed by zero or more occurrences of either a letter or a digit, just as we’d expect. Finally, we assign this result to id, which will now represent the production for identifiers. Other rules can use it by invoking its accessor (i.e., self id).

The example also shows that this approach to parsing covers both lexers and parsers.

The lack of distinction between lexing and parsing is a bit of a problem. Traditionally, we rely on a lexer to tokenize the input. As it does that, it does away with whitespace (ignoring languages that make whitespace significant) and comments. This is easily dealt with by defining a new operator, tokenFor:, that takes a parser p and returns a new parser that skips any leading whitespace and comments and then accepts whatever p accepts. This parser can also attach start and end source indices to the result, which is very handy when integrating a parser into an IDE. From the point of view of higher level grammar productions, its useful to refer to a production identifier that produces such tokenized results:

identifier := self tokenFor: self id.

We would naturally do this for all the tokens in our language, and then define the syntactical grammar without concern for whitespace or comments, just as we would in traditional BNF. As an example, here’s the rule for the return statement in Smalltalk

returnStatement := self hat, [self expression].

Ok, so you can define a parser pretty much by writing down the grammar. However, just accepting a language isn’t all that useful. Typically, you need to produce an AST as a result. To address this, we introduce a new operator, wrapper: . The result of this operator is a parser that accepts the same language as the receiver. However, the result it produces from parsing differs. Instead of returning the tokens parsed, it processes these tokens using a closure which it takes as its sole parameter. The closure accepts the output of the parse as input, and yields some result - typically an abstract syntax tree.

returnStatement := self hat,  [self expression]

     wrapper:[:r :e  | ReturnStatAST new expr:e; start: r start; end: e end].  

The grammar production is still clearly separated, with the AST generation on a separate line. However, I would prefer to leave the grammar pristine. That’s easy - put all the AST generation code in a subclass, where the grammar production accessors are overridden, so:


^super returnStatement

    wrapper:[:r :e  | ReturnStatAST new expr:e; start: r start; end: e end].    

This is nice, for example, if you want to parse the same language and feed it to different back ends that each accept their own AST; or if you need to use the parser for a different purpose, like syntax coloring, but want to share the grammar. Another nice thing about this approach is that one can factor out language extensions very cleanly (especially if you can use mixins). It's one of the benefits of embedding a DSL in a general purpose language - your DSL inherits all the features of the host language. In this case, it inherits inheritance.

So what’s not to like? Well, one could imagine more efficient approaches to parsing. In Smalltalk, one usually parses one method at a time, and methods tend to be short. Even though I'm using Squeak, which isn't all that fast, and parses a method on every keystroke to do syntax coloring, it's perfectly acceptable. For large methods, the delay can be noticeable though. However, there are ways to tune things to improve performance. We have our methods ...

Another problem is left recursion, as in:

expr -> expr + expr | expr * expr | id

In such cases one has to refactor the grammar. I don’t see this as a big deal, and in principle, the parser could refactor itself dynamically to solve the problem; this is one of things that one can do relatively easily in Smalltalk, that tends to be harder in other languages.

In summary, parser combinators are really cool. They work out beautifully in Smalltalk. I had a blast implementing them and using them. Most important, they are a great example of how object orientation and functional programming synergize. If you want to learn more, there’s a lot of literature out there, mainly in the Haskell world.