Karsten Wagner's Blog: November 2006

Tuesday, November 28, 2006

References violate the principle of encapsulation

Every language with allow the use of direct references to objects also makes encapsulation and thus code-reuse much more unlikely. And because every OOP-language has those references thats the reason why encapsulation in OOP doesn't work the way it's advertised by its promoters.

Let's elaborate on this: Any reference is a direct connection to a certain object. That's quite convenient: If we have a deep hierarchy of objects we can simple retrieve a reference to one of those 'deep' objects (lets call it 'B') to an outside object O, and now O is able to directly manipulate B without having to walk thru the hierarchy for every operation.

But what if our object O changes B directly but it's required that some other object C notice it, because it needs to reflect changes to B somehow? Think of a Gui for example: 'A' may be a Button and 'C' some container. If we change the text of 'B', 'C' has to be notified to layout its children to give 'B' enough space to display it's text.

But because 'O' can modify 'B' directly, 'C' won't notice it and our Gui won't work as expected. Sure, as seasoned programmers we already know a solution: The Observer-Pattern. Let 'B' send some notifications for any interesting change to other objects which then can do what they have to do.

But is this really a good solution? It works, but everybody who've done it on a certain scale knows how tricky it can get: We need notification-events, receivers, registration/unregistration, prevent cycles add lots of additional code for every method which has to send a notification etc. It's a real mess. They even invented aspect oriented programming because of this problem, but AOP is like trying to extinguish a fire with gasoline: Instead of solving the roots of the problem they put another layer of complexity on top of it.

So what's the root of the problem? I've outlined it above: It happens only because we have the ability to modify objects directly without its container having to notice it. And thats because of the existence of direct references.

Sure, we gain some convenience in the beginning because references allow us to talk to a object directly - but as a result we need huge amounts of additional code for notifications, observer, doc/view-patterns etc. And it also opens up lots of opportunities to make mistakes. Lots of. Also code reuse get much more difficult because we have much more dependencies on the internal structure of the problem then really necessary. And this is all because the use of references violates the principle of encapsulation.

Is there a solution to this problem? Sure: Getting rid of references altogether! They are simply to tempting.

It's like every low-level mechanism: They give you more power then needed and because of this they can and will be abused. Without references one always has to ask the container via some key for a certain contained object. And you will never get a reference to the object, you will only get the value of the object - which isn't mutable and thus can't be misused. Changing an object would always involve its container and so there would be no need for observers, notifications etc. It may be a little more difficult in the beginning but on the long run the profit would be huge. And with certain concepts build right into the programming language, most unpleasantcies can be avoided. But to solve a problem, it's necessary to first notice that it even exists.

Some people will say now: I as the programmer want all the freedom because I a able use it in the right way. Even if this is a correct way of thinking, having references in the language will in fact limit your freedom nonetheless. Why? Because if there are references in the language the compiler has to follow certain rules and can't do certain optimization which are necessary to create performant code without using references! In other words: By giving you references a language also forces you to use them because alternatives are to expensive. So your freedom is in fact limited in either way. And I for myself would rather be limited to a solid and more error-proof way of programming instead of a risky and error-prone one.

(About other problems with references and how to solve certain problems without them, I wrote about here).

Sunday, November 26, 2006

Why pointers and references are a bad thing

References and pointers are quite common in the world of programming languages. References are something like the address of some data on the heap. Pointers are quite like references, but with additional operations of changing the address (by incrementing or decrementing it or even by arbitrarily creating it from some integer value). So a reference is a limited kind of pointer: One with a fixed value. Or in other words: A reference is an immutable pointer.

Pointers are especially bad, because by changing their address, they can point to invalid positions in memory and read or overwrite data which shouldn't be read or overwritten. That's quite ugly from a security point of view and it's the reason why C- and C++-programs often have security holes and allow exploits like buffer-overflows etc. (yes, In know you can prevent this by solid programming, but most programmers simply don't do this because it requires more work and would also often make a program slower).

So lets concentrate on references. First: What's the most fundamental concept in programming language? It's a 'value'. A value can be a simple integer value, a float value or, a string value or the value of a reference. A value has an identity on it's own: The value is the identity. And a value can't changed in itself, a '10' is always a '10'. Things can have a value but the value always is a value on its own, you can't change a value on itself you can only change the value of a variable or a field of a data structure. References on the other hand are a special kind of value, the value of a reference is an address of some object on the heap. It's a less fundamental concept and so we can ask: Is this concept really necessary, useful or even harmful? I suspect the latter: It does more harm then good if a language allows to have explicit references (or pointers, which are even worse).

But whats the bad thing with them? As long as you can't manipulate them and they always point to valid data - but only under an additional assumption: You have a language with automatic memory management - in other words, you have garbage collection (gc). Without gc memory could be freed and then new allocated, even when there are still references referencing it - and then you would have the same problem as with pointers: The possibility to incorrectly overwrite data. So references always require gc if you want to have their security advantages.

References can easily create cyclic structures. Just create two blocks of data, A and B, where each have a reference to the other. If you then want to iterate over this data structure you run into an infinite loop, because you're always jumping from A to B and then back from B to A. Because of this every algorithm which follows reference chains can run into a loop and this has to be examined or prevented in some way.

With gc (which is such kind of algorithm) this means that the most simple way of doing gc ('reference-counting') won't work in the general case, because you have possibly cyclic references. This means that its hard to create some kind of 'local' gc. 'Local' means that the gc should only examine structures which are used in the current situation and not having to scan the whole memory. But by allowing cyclic references this becomes impossible.

And this becomes a big problem if we have huge amounts of data, especially if this data isn't in main memory but on some external memory (on the hard-drive or somewhere on the net). But even with data in main memory there are problems: Todays memory architectures allows huge throughput (lots GB per second) but the access latencies are poor. So doing random access in main memory is a comparatively much more expensive operation compared to ten years ago.

Of course there are methods to prevent those problems to a certain degree ('generational gc') but it remains difficult: Everybody who as worked with a large Java-based desktop-application knows the pauses if the system has to do a full-gc. And if we want to scale gc up to objects in a database this problems gets much worse. Also the trend to more and more concurrency makes global general gc much more difficult and time consuming too.

So are references just bad because of gc? No. There are other problems:

(Small remark: I'm talking here about the view the programmer has on the problem via his programming language, not about the real implementation the compiler does behind the curtains. On the hardware level references are unavoidable, but that doesn't matter if the programmer doesn't see this because it's hidden by the programming language he uses)

A big problem with references is the additional identity they give every object. If you write the following (in Java):


   Integer v1 = new Integer(10);
   Integer v2 = new Integer(10);

and check if v1 == v2 afterwards, you will get a 'false'! Even if both v1 and v2 are Integer-objects with the same value a comparison with == returns false. To check if they have the same value, you have to write v1.equals(v2). And even if your favorite language doesn't allow the same with integers, they will probably do it too if you create a class like


  class Point {
    int x, y;
  }

and then do the same thing as above with two identical valued Points.

You could now say: "Of course it is this way. We're comparing the objects themself and not the values of the objects". But why should there even be the possibility to have a difference? Only because we have references. Because of references we can decouple the object from its value and compare them as differently. If we don't have a reference-value we can't compare then and thus the only way to compare objects is by comparing their values.

There are many situations when this difference really don't make sense. The Integer-example from above if one of those. But also with other structures often you don't want this kind of differentiation. For example the 'Point'-class above: A point should be a position on a 2d-grid (maybe on a canvas to draw a pixel there), it simple don't makes sense to have two different points which points to the same position. Just like it doesn't makes sense to have to different Integers with identical values.

The situation will get even worse if we allow mutation (=changing). Then we could do something like this:


   Integer v1 = new Integer(10);
   v1.value = 20;   // (not possible in Java)

On this point it becomes clear that a 'Integer' isn't a value but it's a container of a value. That's fine, but what if we don't want a container of a value but the value itself? In Java we only have a fixed number of those 'values itself': The primitive types int, float, etc. Other languages are a bit more flexible and even allow the creation of new value types, but they also allow the create of reference based 'Objects'. And this leads to the aliasing problem:

It we write


  Integer v1 = new Integer(10);
  Integer v2 = v1;
  v1.value = 20;

then v2.value is also 20 now. v2 is only an alias for v1 and because of this every change to v1 also changes v2 and vice versa. And this is bad, because it's non-local: Things can happen at other parts of the program without direct relationship to each other. We explicitly have to prevent this:


  Integer v1 = new Integer(10);
  Integer v2 = v1.clone();  // now v2 isn't influenced by changes to v1 anymore
  v1.value = 20;

In simple programs this isn't a problem because its easy to see what happens, but with increasing complexity it creates more and more sources for hard to find errors. And it makes it hard for the compiler to do certain kinds of optimizations because its hard to assume that some data wouldn't change under certain operations.

This behavior is possible in every language which allows mutation and has references. Only pure functional languages like Haskell are immune to this problem (Ocaml as an example for an impure functional language has the same problem too). But Haskell does two both things: Disallowing mutation and having no references. While immutability has it's advantages it's not necessary to eliminate the aliasing problem, just removing references is enough. And this alone still gives several benefits:

There is no more need for a general gc

We only need one simple concept of equality, so no more 10 != 10

No more null-references (with the accompanying errors)

But can we really get rid of references? We can't write


  p = new Point(2, 3)
  p.x = 10

anymore, because values can't be changed (a '10' is a '10', it makes no sense to change a '10' to a '12' for example).

But thats in fact a good one. We just have to define our data in an other way: We can't modify values, but we can create new values based on the old value:


  p = new Point(2, 3)
  p2 = p
  p = p.x <- 10

The big difference is that p2 is still Point(2, 3) even if p is Point(10, 3) at the end. The aliasing is gone.

(The new syntax 'p = p.x <- 10' means that a new Point is created by copying all values of p except for 'x' and using the new value '10' for x, returning the new Point)

If you think that this is much more expensive in terms or runtime efficiency, you're wrong: Most of those cases could be optimized internally by the compiler to the old fashioned structure-mutation. The reason simply is that the compiler hasn't to bother if there could be aliasing. Getting rid of references allows the compiler to create those references internally by himself and besides solving the aliasing problem which is hard to overview for the programmer. It's now the task for the compiler which is much more able to do it and won't makes any mistakes.

Removing references from the language would thus open lots of optimization opportunities for the compiler while making the life of the programmer more easy. But is it all really so easy? Aren't references simply required to tackle certain kinds of problems?

What's for example with a list? Can a list be implemented without having references? The answer is yes. In fact it's quite similar to the well known implementation even if each node is a value now:


value Node {
    int value;
    Node tail;
}

Creating a list is simple, we just create a big value where each 'tail' is another value. But how to insert a new element somewhere in between? We have to create a new list now and can't do clever (but also often dangerous and hard to read) pointer-change-tricks like in the well known double-linked-list insertion. We have to rely on the optimization abilities of the compiler instead - but we get much more clean code.

And the advantage is that it's impossible to accidentally mix two lists in a way we don't want: Because the whole list is a single value it's 'written in stone' now. Totally unchangeable. So we can be sure, that no other part of the program is able to change our list, even if we put it in some lengthy and hard to understand algorithm. This is again a huge advantage for the programmer who wants to understand a program he didn't wrote and also for the compiler to do optimizations (especially in the context of concurrent programming).

In fact most algorithms can easily been rewritten in such a way that they are only working with values. And because we still allow the mutation of variables here, we can still implement imperative algorithms directly.

But there is one thing which seems to be impossible now: Creating cyclic structures.

How to create a cyclic structure if we only have values? Such a structure would require an infinite amount of memory if we try to implement it like the list-node above: We would get infinite recursion.

In Haskell the way out is lazy evaluation. Because of this we could really create a 'virtual infinite value', the compiler would simply evaluate only the parts of the value which is currently required. But is it possible even without having lazy evaluation (which has certain other disadvantages)?

The answer is yes. It's not as easily as with a language which supports reference, but since cyclic structures are not necessary all the time, the time and effort isn't that big also. And don't forget the advantages mentioned before. The trick is to tag values with IDs:


value Node {
   int id;
   Value value;
   int next;
}

value Graph {
   Map lookup;  // a array would work too, here
   int act_id;
}

function (Graph, Id) createId(Graph g) {
   new_id = g.act_id;
   g = g.act_id <- g.act_id + 1;
   return (g, Id);
}

function (Graph, Node) createNode(Graph g, Value value) {
   g, new_id = createId(g)
   node = new Node(new_id, value, 0)
   return (g, node);
}

function Graph addLink(Graph g, Node from, Node to) {
   return g.lookup <- put(g.lookup, from.id, to.id)
}

function Graph removeLink(Graph g, Node from) {
   return g.lookup <- remove(g.lookup, from.id)
}

(The above syntax is very explicit and again Java-like to make it better understandable for people who aren't accustomed to the more compact syntax of other languages. But those who are shouldn't have any problems with the syntax too - even if they consider it much to laborious).

The above creates a Graph which can contain cycles. We don't use a 'next Node' anymore, but use a 'id' now to reference other nodes. Also each Node stores it's own id so it can be queried if necessary. And we need a map to store the Node for each id and to look it up later. Let's create a simple cycle between two nodes with the values '100' and '110' now:


g = new Graph
g, n1 = createNode(g, 100)
g, n2 = createNode(g, 110)
g = addLink(g, n1, n2)
g = addLink(g, n2, n1)

Now we have a cyclic structure without using any references. The chain is simply followed by using the IDs of the nodes. And the IDs are created automatically by the call to createNode. The above looks a bit ugly because I explicitly had to chain the Graph g thru all the calls, but this could be easily simplified by adding some syntactic sugar to the language. I don't wrote the above to win a code-beauty-context, it's just to show that it's still possible to create all kinds of data-structures without having references at all.

One problem remains: With the above structure we have the gc back in the game. If you use the above structure in a general way you also have to remove Nodes from the graph. As long as you only remove single ones there is still no problem, but if you have a 'start-node' and want that the graph automatically 'forgets' nodes which aren't reachable by this start-node then you need some kind of gc.

But: The gc-problem is now a local thing. Gc in the above case is totally constrained to this Graph-structure and has to be solved in the context of a certain Graph alone. We don't require a general solution anymore, which opens new windows of opportunity for better optimization. In this case we can write a simple gc-function and call it in every call to our 'removeLink' function. But we could also use a background thread to do it. Or call it based on some heuristics etc. We could also put all those linking/gc-stuff into some lib and mostly forget about it, maybe have some parameters in the constructor to choose an appropriate gc scheme. Or best to add some compiler support for this kind of problem which would allow the compiler not only to create the gc-function on its own but also to use this additional knowledge to do better optimizations. Its possible to do whatever suits the problem now instead of having only a single general gc which has to work for all use cases with the same algorithm.

So I hope I was able to give you some thought about one of the most standard features in modern programming languages and why it could be a good thing to question it's practicality. Of course there still lots to say about this topic, but I'll stop here for now.

Monday, November 20, 2006

Rich languages vs. minimal languages

A common pattern in the design of programming languages seems to be to designing languages as 'slim' as possible. The idea is to create as few abstractions as possible directly in the language and then to implement all the useful things the programmer needs in the libraries. While there are obviously some differences in how strict this design pattern is applied, I think that it has to be scrutinized. Lets look at some examples:

In Java the language only implements rather simple collection types: Array with a fixed length. So all usable collections have to be created in the language itself which leads to lots of problems, inconsistencies and performance penalties. On the other hand, Java implements some things other languages use to implement in libraries directly in the language. A good example here is the 'synchronized' directive which makes the use of a certain concurrency pattern much easier to apply.

In Common Lisp nearly all features of the language are implemented in the language itself, but many of those features are part of the standard so they are available as a base to every programmer. The problem is still that those features are designed in a way that they can be implemented in the language itself (via macros) which leads to the well known s-expr-syntax which is hard to read for most programmer and one of the main reason, Lisp is criticized often.

Languages like Ruby and Python on the other hand have a very rich feature set build in directly into the language. I suspect that this richness is the main reason why those languages are considered more expressive by many, and not other features like object orientation or dynamic typing. In those languages you have good syntactic support for many common things like collection handling, regular expressions, big integers etc.

Then there is Haskell which is based on a very restricted central idea (a pure, lazy functional language) but extends that quite cleverly with lots of syntactic sugar and allows free definable operator definition. In Haskell you can not only overload existing operators you can also create arbitrary new operators with custom binding priorities. But things like the imperative looking support syntax for monads or the compact list comprehension syntax which both greatly enhance the usability of the language still needed to be implemented as special syntax because even with those free operator creation its not possible to express it as clearly in the base language directly.

I think that we can see a pattern here: Restricting the base language to strong can lead to limited expressiveness and to the need of lots of boilerplate code. Just compare a bit of Ruby:


 a = [1, 2, 3, 4, 5, 6]
b = a[2..-2]

with the similar Java code


 List a = new ArrayList();
 a.add(1);
 ... // some adds omitted
 a.add(6);
 List b = a.sublist(2, a.size() - 2);

The compactness and readability of the Ruby example results primarily from the rich syntax of the language concerning array use and not from the dynamic typing. Even in a statically typed language you could create something similar, for example by using type-inference. But even without someone could invent something like this:


 a:[int] = [1, 2, 3, 4, 5, 6]
 b:int = a[2..-2]

which is much more compact then the Java example, even with explicit type annotations.

There are lots of similar examples, so what could be the reason why language designers choose to not implement language support for something? Lets look at some possible reasons.

Learnability. If the language has a small core someone could be proficient in it much faster.

Clear and simple and maybe even 'provable correct' design. The language could be desigend much faster and also much easier. Also the language model may allow abstract theoretical reasoning about important things.

Better extensibility. Because all higher-level features are implemented in the language itself they could also be implemented by a 3rd party which allows the creation of new features the designer hasn't thought of or allow to extend existing features with user defined ones.

Better compilation. The compiler for a simple language is more simple and could thus be more sophisticated.

There are of course more reasons, but I hope I got the most mentioned here. So lets have a further look at them:

Learnability

Learning a language also means to learn the libraries. If you know Java without knowing at least the base-libs you know nothing. So if you put a feature in a lib or put it in the language itself don't really matter.

You may suspect that libs have the advantage to be better searchable with some documentation facility, which can even done automatically to some extend (thing about code completion in todays IDEs). But you could also create a similar documentation facility which includes language features. For example just type 'array' and the documentation would give a short info about the supported array-syntax in the language.

In fact there is no single reason why learning a languages to a degree which is necessary to do real-world-work would differ if you put more features directly in the language. In fact it may be even the opposite: Is it really more easy to know all what is necessary about Java collections then about Ruby-collections? I doubt that.

Clear and simple and maybe even 'provable correct' design.

While it's true, I ask: Are those properties of a language really necessary? To a certain degree and in certain domains: For sure. But in general? I doubt it. A 'industrial strength' language (i.e. a language which is created to do lots of real-world-work) has to ease the life of the programmer. It simply doesn't matter if the language has a nice, clear and simple design unless it really helps the programmer to accomplish his or her tasks.

Interesting theoretical properties of a language are for sure a valid thing for research languages but thats only a very limited domain - even if it looks much bigger to language designers. But for real world programming other properties are much more important, so we really have to differentiate here.

Better extensibility.

This seems to be the 'killer argument' for putting stuff in the libs instead of the language. But is this really true? Can you for example really extend Javas collections so easily? The answer is of course: No. You can of course write your on list-class, but you can't use it as freely as you want because you simply can't substitute your new list-class into existing code without modifying the source. Sure, you can exchange the system-classes with your own (as long as your own classes use the same interfaces) but thats not really sensible in most cases.

You may suspect that thats not a fault of the idea itself but only of a certain language (like Java). In Ruby for example you can substitute classes very easy at runtime. In Haskell it all depends on which module you import. And in Lisp you can always write a clever macro. So maybe its possible if you just use the right language? Yes, to a certain degree it is, but it comes at a price: You can introduce difficult to find bugs in already tested code, you have to design the language in a certain way to do it (which makes it harder to use) and you need dynamic typing or rather sophisticated type-systems (if you want static typing) to do it.

In the end it makes the language semantics more difficult to understand - and not because of usability purposes, but only to allow somebody to create things in the language which are already part of the language (even if implemented in the libs). And do we really need this kind of extensibility? You may think that having something is better then having it not, but all things come to a price and why do we want to pay this price for something we don't really need?

All the above extensibility schemes have their limitations. Even with Lisp-Macros you can't create real transparent continuation support. Even with Haskells clever type and module system they still had to create syntax to support monads in the way they have it now. And in Ruby you have to pay with slow execution and the loss of compiler-time validity checks which could lead to hard to find errors.

In fact most programmers never ever have to change the system libraries. Extend them? Maybe, to a certain degree. But its always possible to add functionality to a language by writing libraries. Maybe you can't use those extensions in the same way you can use the build in features, but is this such a problem? In fact it even has the advantage to make code better maintainable because it's always easy to distinguish between 'safe' language base-stuff and 'maybe faulty' self-written-stuff.

In the early days of computer science, the design of programming languages was new, new algorithms and data-structures were invented 'every day' and in those times it really seems to be risky and to prohibitive to put to much things directly into the language. So after having experienced the difficulties tointegrate new concepts in rigid languages like Fortran it's understandable that the trend went in the direction of creating better extensible languages. But the times have changed. When did you really have invented a new basic algorithm or data structure? All those basic use-cases are well known for some time now and in most domains they simply don't change this often.

So we simply don't need this much extensibility today if the language provides enough features on their own.

Better compilation.

In fact its the opposite: A program in a rich language can be much easier compiled to an efficient native program simply because the compiler has more knowledge about the program. Sure, the compiler gets bigger, but it don't have to guesstimate what the programmer really wanted and more optimizations are possible.

Fortran is still in use because the use of its array and matrix features which are build into the language allows easy vectorization and thus high-performance programs on otherwise hard-to-build-for computer architectures.

Also lots of novel optimization techniques are possible. What about a collection class which isn't fixed to a certain implementation but where the compile decides which implementation to use depending on usage patterns, maybe even with the possibility to change this implementation after profiling at runtime? This would be impossible if the collection-type is implemented in a lib, because lib-code is like every-code or the compiler and it simply don't know about lots of optimization opportunities. But if the collection is part of the language the compiler is able to chose how to implement it, depending on how the collection is used. May indexed accesses? Use a array-based implementation. Many insertions and deletes? Use a linked-list one.

This ideas here could and should be extended to more higher-level structures. Collections are very basic but also very often used and thus crucial for efficient implementation and performance.

But also things like concurrency, component based architectures, data-base-access/persistence etc. can really benefit from being part of the language. Instead of writing something like


 result = db.query("select name, value from data");
 name = result["name"];
 value = result["value"];

which is very error prone and inefficient to compile, we could have something like this:


 select name, value from db.data;   
 // name,value are stored directly into variables above

This would be easier to write, the compiler could check for syntax errors at compilation time, could automatically cache queries, do optimizations etc.

I believe that the minimality principle which is applied in most of of todays languages is in fact obsolete and should be questioned by everyone who starts to design a new programming language. Many languages did this to a certain degree but why stop there? Why not try to find all the common patterns programmers are doing and reimplementing each and every day and putting them directly into the language? Instead of researching of 'domain specific' languages, why not simply invent a language suited for the domain of every-day programming?