Tuesday, April 10, 2012

A musing on language design - transparency v. readability

People who spend a lot of time thinking about programming languages (and, I think, even just programming) inevitably end up thinking about language design. There is some truism/quote floating around (which I couldn't be bothered to find) about every language theorist having a pet language design secretly hidden in their deepest thoughts. Sometimes that pet language emerges, for better or worse. I admit to having such a pet, but also realise my imagination of a language is nowhere near a real language and the amount of work required to make it real is immense. Also I think real language design is difficult and not very exciting -- too many 'little' decisions about whether an int should convert into a float, or how to handle data transfer between big-endian and little-endian systems, which make me want to bleed from my eyes.

ANYWAY, my musing was on transparency v. readability, which is a much more interesting design decision for a language, at the philosophical/fundamental level. At some point a language designer must aim for readability or 'write-ability'. Generally speaking, 'serious' programming languages (e.g., Java, C#) go for the former, scripting languages go for the latter. Of course everyone wants both, but there are some fundamental trade-offs and you have to privilege one or the other.

One place that this decision manifests is in the 'transparency' (I'm sure there's better word for this) of entities in your language. At one end of the spectrum is Java, where everything must be spelt out, arrays are very different from array-like objects (e.g., ArrayLists), primitives (e.g., int) are very different from primitive-like objects (e.g., BigDecimal), there is no operator overloading, but there are no surprises. C# makes things a little easier with delegates and getters and setters, but basically subscribes to the Java philosophy.

C++ is an interesting case, transparency is much more important here, operator overloading is very flexible and allows user-defined objects to be used (almost) exactly like built-in types, such as ints, arrays, and pointers. Unfortunately, there are a lot of rough edges and it is rare that the programmer can actually use objects like built-in types without having to think about it. Also, being allowed to override assignment etc. can lead to some really evil bugs.

There is a tradition, starting with Smalltalk, and present in most OO-languages to some extent, of "everything is an object". In practice this means there are no built-in types and instead one has object versions of things like arrays, coupled with some form of trickiness to make them easier to use, ideally as easy to use as built-in types should be. Such trickiness is often used to replace things like loops and other control structures with objects too. In theory it means the language can be transparently extended with new control structures. In practice, the trickiness is often difficult to understand, means there are multiple ways to get the same result, and makes the language more complex. Java uses a similar idea to support for each loops.

Scripting languages often take a similar approach, see Python and Javascript. Here many things (e.g., arrays, dictionaries) are ordinary objects, but not quite - the trickiness rears its head so that arrays can be used like a programmer might expect, and you end up with layers of hidden methods and meta-methods, and so forth, which make the language easy to use for casual programmers and allow lots of tricksy programming for those willing to get their hands dirty. This might be a good approach, but this kind of transparency is hardly elegant (well, I guess it kind of is, or neat at least), and the languages are far from simple, or easy to understand in an holistic way.

These scripting languages and those in the Smalltalk tradition (Self, Grace) kind of fill the other end of the spectrum. However, as you can see, it is not really a linear scale from readability to transparency, more different approaches to the same problem, on some multi-dimensional axes.

Is there an optimal solution? Probably not, after all, different languages are used for different purposes. As scripting languages are more and more used for large programs and not just small scripts, I think that a focus on transparency might turn out to have been a poor decision. But there are many other decisions in scripting language design which will turn out to be poor in this situation too.

Perhaps a good question to ask is, is there a problem? Do we really need elegant languages? Could we have a language which was not elegant, i.e., many special cases, lots of different classes of entities, little unification, perhaps that lacked certain kins of transparency, but that was pleasant to write programs in and to read? (Is this in fact, Java or C++) Possibly not, having universal concepts make a language fundamentally easier to understand, but it is also possible that we have gone too far (I'm not definitely saying we have). Are there corpus studies that have investigated how often programmers actually need to mimic built-in types? Is there some compromise level of operator overloading that makes things easier than any of the current positions?

No comments: