A place to be (re)educated in Newspeak

Sunday, August 31, 2008

Foreign functions, VM primitives and Mirrors

An issue that crops up in systems based on virtual machine is: what are the primitives provided by the VM and how are they represented?

One answer would be that those are simply the instructions constituting the virtual machine language (often referred to as byte codes). However, one typically finds that there are some operations that do not fit this mold. An example would be the defineClass() method, whose job is to take a class definition in JVML (Java Virtual Machine Language) and install it into the running JVM. Another would be the getClass() method that every Java object supports.

These operations cannot be expressed directly by the high level programming languages running on the VM, and no machine instruction is provided for them either. Instead, the VM provides a procedural interface. So while the Java platform exposes getClass(), defineClass() and the like, behind the scenes these Java methods invoke a VM primitive to do their job.

Why aren’t primitives supported by their own, dedicated virtual machine language instructions? One reason is there are typically too many of them, and giving each an instruction might disrupt the instruction set architecture (because you might need too many bits for opcodes, for example). It’s also useful to have an open ended set of primitives, rather than hardwiring them in the instruction set.

You won’t find much discussion of VM primitives in the Java world. Java provides no distinct mechanism for calling VM primitives. Instead, primitives are treated as native methods (aka foreign functions) and called using that mechanism. Indeed, in Java there is no distinction between a foreign function and a VM primitive: a VM primitive is foreign function implemented by the VM.

On its face, this seems reasonable. The JVM is typically implemented in a foreign language (usually C or C++) and it can expose any desired primitives as C functions that can then be accessed as native methods. It is very tempting to use one common mechanism for both purposes.

One of the goals of this post is to explain why this is wrong, and why foreign functions and VM primitives differ and should be treated differently.

Curiously, while Smalltalk defines no standardized FFI (Foreign Function Interface), the original specification defines a standard set of VM primitives. Part of the reason is historical: Smalltalk was in a sense the native language on the systems where it originated. Hence there was no need for an FFI (just as no one ever talks about an FFI in C), and hence primitives could not be defined in terms of an FFI and had to be thought of distinctly.

However, the distinction is useful regardless. Calling a foreign function requires marshaling of data crossing the interface. This raises issues of different data formats, calling conventions, garbage collection etc. Calling a VM primitive is much simpler: the VM knows all there is to know about the management of data passed between it and the higher level language.

The set of primitives is moreover small and under the control of the VM implementor. The set of foreign functions is unbounded and needs to be extended routinely by application programmers. So the two have different usability requirements.

Finally, the primitives may not be written in a foreign language at all, but in the same language in a separate layer.

So, I’d argue that in general one needs both an FFI and a notion of VM primitives (as in, to take a random example, Strongtalk). Moreover, I would base an FFI on VM primitives rather than the other way around. That is, a foreign call is implemented by a particular primitive (call-foreign-function).

Consider that native methods in Java are implemented with VM support; the JVM’s method representation marks native methods specially, and the method invocation instructions handle native calls accordingly.

The Smalltalk blue book’s handling of primitives is similar; primitive methods are marked specially and handled as needed by the method invocation (send) instructions.

It might be good to have one instruction, invokeprimitive, dedicated to calling primitives. Each primitive would have an identifying code, and one assumes that the set of primitives would never exceed some predetermined size (8 bits?). That would keep the control of the VM entirely within the instruction set.

It is good to have a standardized set of VM primitives, as Smalltalk-80 did. It makes the interface between the VM and built in libraries cleaner, so these libraries can be portable. We discussed doing this for the JVM about nine or ten years ago, but it never went anywhere.

If primitives aren’t just FFI calls, how does one invoke them at the language level? Smalltalk has a special syntax for them, but I believe this is a mistake. In Newspeak, we view a primitive call as a message send to the VM. So it is natural to reify the VM via a VM mirror that supports messages corresponding to all the primitives.

A nice thing abut using a mirror in this way, is that access to primitives is now controlled by a capability (the VM mirror), so the standard object-capability architecture handles access to primitives just like anything else.

To get this to really work reliably, the low level mirror system must prohibit installation of primitive methods by compilers etc.

Another desirable propery of this scheme is that you can emulate the primitives in a regular object for purposes of testing, profiling or whatever. It's all a natural outgrowth of using objects and message passing throughout.