Pages

Sunday, November 26, 2006

Why pointers and references are a bad thing

References and pointers are quite common in the world of programming languages. References are something like the address of some data on the heap. Pointers are quite like references, but with additional operations of changing the address (by incrementing or decrementing it or even by arbitrarily creating it from some integer value). So a reference is a limited kind of pointer: One with a fixed value. Or in other words: A reference is an immutable pointer.

Pointers are especially bad, because by changing their address, they can point to invalid positions in memory and read or overwrite data which shouldn't be read or overwritten. That's quite ugly from a security point of view and it's the reason why C- and C++-programs often have security holes and allow exploits like buffer-overflows etc. (yes, In know you can prevent this by solid programming, but most programmers simply don't do this because it requires more work and would also often make a program slower).

So lets concentrate on references. First: What's the most fundamental concept in programming language? It's a 'value'. A value can be a simple integer value, a float value or, a string value or the value of a reference. A value has an identity on it's own: The value is the identity. And a value can't changed in itself, a '10' is always a '10'. Things can have a value but the value always is a value on its own, you can't change a value on itself you can only change the value of a variable or a field of a data structure. References on the other hand are a special kind of value, the value of a reference is an address of some object on the heap. It's a less fundamental concept and so we can ask: Is this concept really necessary, useful or even harmful? I suspect the latter: It does more harm then good if a language allows to have explicit references (or pointers, which are even worse).

But whats the bad thing with them? As long as you can't manipulate them and they always point to valid data - but only under an additional assumption: You have a language with automatic memory management - in other words, you have garbage collection (gc). Without gc memory could be freed and then new allocated, even when there are still references referencing it - and then you would have the same problem as with pointers: The possibility to incorrectly overwrite data. So references always require gc if you want to have their security advantages.

References can easily create cyclic structures. Just create two blocks of data, A and B, where each have a reference to the other. If you then want to iterate over this data structure you run into an infinite loop, because you're always jumping from A to B and then back from B to A. Because of this every algorithm which follows reference chains can run into a loop and this has to be examined or prevented in some way.

With gc (which is such kind of algorithm) this means that the most simple way of doing gc ('reference-counting') won't work in the general case, because you have possibly cyclic references. This means that its hard to create some kind of 'local' gc. 'Local' means that the gc should only examine structures which are used in the current situation and not having to scan the whole memory. But by allowing cyclic references this becomes impossible.

And this becomes a big problem if we have huge amounts of data, especially if this data isn't in main memory but on some external memory (on the hard-drive or somewhere on the net). But even with data in main memory there are problems: Todays memory architectures allows huge throughput (lots GB per second) but the access latencies are poor. So doing random access in main memory is a comparatively much more expensive operation compared to ten years ago.

Of course there are methods to prevent those problems to a certain degree ('generational gc') but it remains difficult: Everybody who as worked with a large Java-based desktop-application knows the pauses if the system has to do a full-gc. And if we want to scale gc up to objects in a database this problems gets much worse. Also the trend to more and more concurrency makes global general gc much more difficult and time consuming too.

So are references just bad because of gc? No. There are other problems:

(Small remark: I'm talking here about the view the programmer has on the problem via his programming language, not about the real implementation the compiler does behind the curtains. On the hardware level references are unavoidable, but that doesn't matter if the programmer doesn't see this because it's hidden by the programming language he uses)

A big problem with references is the additional identity they give every object. If you write the following (in Java):

Integer v1 = new Integer(10);
Integer v2 = new Integer(10);

and check if v1 == v2 afterwards, you will get a 'false'! Even if both v1 and v2 are Integer-objects with the same value a comparison with == returns false. To check if they have the same value, you have to write v1.equals(v2). And even if your favorite language doesn't allow the same with integers, they will probably do it too if you create a class like

class Point {
int x, y;
}

and then do the same thing as above with two identical valued Points.

You could now say: "Of course it is this way. We're comparing the objects themself and not the values of the objects". But why should there even be the possibility to have a difference? Only because we have references. Because of references we can decouple the object from its value and compare them as differently. If we don't have a reference-value we can't compare then and thus the only way to compare objects is by comparing their values.

There are many situations when this difference really don't make sense. The Integer-example from above if one of those. But also with other structures often you don't want this kind of differentiation. For example the 'Point'-class above: A point should be a position on a 2d-grid (maybe on a canvas to draw a pixel there), it simple don't makes sense to have two different points which points to the same position. Just like it doesn't makes sense to have to different Integers with identical values.

The situation will get even worse if we allow mutation (=changing). Then we could do something like this:

Integer v1 = new Integer(10);
v1.value = 20; // (not possible in Java)

On this point it becomes clear that a 'Integer' isn't a value but it's a container of a value. That's fine, but what if we don't want a container of a value but the value itself? In Java we only have a fixed number of those 'values itself': The primitive types int, float, etc. Other languages are a bit more flexible and even allow the creation of new value types, but they also allow the create of reference based 'Objects'. And this leads to the aliasing problem:

It we write

Integer v1 = new Integer(10);
Integer v2 = v1;
v1.value = 20;

then v2.value is also 20 now. v2 is only an alias for v1 and because of this every change to v1 also changes v2 and vice versa. And this is bad, because it's non-local: Things can happen at other parts of the program without direct relationship to each other. We explicitly have to prevent this:

Integer v1 = new Integer(10);
Integer v2 = v1.clone(); // now v2 isn't influenced by changes to v1 anymore
v1.value = 20;

In simple programs this isn't a problem because its easy to see what happens, but with increasing complexity it creates more and more sources for hard to find errors. And it makes it hard for the compiler to do certain kinds of optimizations because its hard to assume that some data wouldn't change under certain operations.

This behavior is possible in every language which allows mutation and has references. Only pure functional languages like Haskell are immune to this problem (Ocaml as an example for an impure functional language has the same problem too). But Haskell does two both things: Disallowing mutation and having no references. While immutability has it's advantages it's not necessary to eliminate the aliasing problem, just removing references is enough. And this alone still gives several benefits:

  • There is no more need for a general gc

  • We only need one simple concept of equality, so no more 10 != 10

  • No more null-references (with the accompanying errors)


But can we really get rid of references? We can't write

p = new Point(2, 3)
p.x = 10

anymore, because values can't be changed (a '10' is a '10', it makes no sense to change a '10' to a '12' for example).

But thats in fact a good one. We just have to define our data in an other way: We can't modify values, but we can create new values based on the old value:

p = new Point(2, 3)
p2 = p
p = p.x <- 10

The big difference is that p2 is still Point(2, 3) even if p is Point(10, 3) at the end. The aliasing is gone.

(The new syntax 'p = p.x <- 10' means that a new Point is created by copying all values of p except for 'x' and using the new value '10' for x, returning the new Point)

If you think that this is much more expensive in terms or runtime efficiency, you're wrong: Most of those cases could be optimized internally by the compiler to the old fashioned structure-mutation. The reason simply is that the compiler hasn't to bother if there could be aliasing. Getting rid of references allows the compiler to create those references internally by himself and besides solving the aliasing problem which is hard to overview for the programmer. It's now the task for the compiler which is much more able to do it and won't makes any mistakes.

Removing references from the language would thus open lots of optimization opportunities for the compiler while making the life of the programmer more easy. But is it all really so easy? Aren't references simply required to tackle certain kinds of problems?

What's for example with a list? Can a list be implemented without having references? The answer is yes. In fact it's quite similar to the well known implementation even if each node is a value now:

value Node {
int value;
Node tail;
}

Creating a list is simple, we just create a big value where each 'tail' is another value. But how to insert a new element somewhere in between? We have to create a new list now and can't do clever (but also often dangerous and hard to read) pointer-change-tricks like in the well known double-linked-list insertion. We have to rely on the optimization abilities of the compiler instead - but we get much more clean code.

And the advantage is that it's impossible to accidentally mix two lists in a way we don't want: Because the whole list is a single value it's 'written in stone' now. Totally unchangeable. So we can be sure, that no other part of the program is able to change our list, even if we put it in some lengthy and hard to understand algorithm. This is again a huge advantage for the programmer who wants to understand a program he didn't wrote and also for the compiler to do optimizations (especially in the context of concurrent programming).

In fact most algorithms can easily been rewritten in such a way that they are only working with values. And because we still allow the mutation of variables here, we can still implement imperative algorithms directly.

But there is one thing which seems to be impossible now: Creating cyclic structures.

How to create a cyclic structure if we only have values? Such a structure would require an infinite amount of memory if we try to implement it like the list-node above: We would get infinite recursion.

In Haskell the way out is lazy evaluation. Because of this we could really create a 'virtual infinite value', the compiler would simply evaluate only the parts of the value which is currently required. But is it possible even without having lazy evaluation (which has certain other disadvantages)?

The answer is yes. It's not as easily as with a language which supports reference, but since cyclic structures are not necessary all the time, the time and effort isn't that big also. And don't forget the advantages mentioned before. The trick is to tag values with IDs:

value Node {
int id;
Value value;
int next;
}

value Graph {
Map lookup; // a array would work too, here
int act_id;
}

function (Graph, Id) createId(Graph g) {
new_id = g.act_id;
g = g.act_id <- g.act_id + 1;
return (g, Id);
}

function (Graph, Node) createNode(Graph g, Value value) {
g, new_id = createId(g)
node = new Node(new_id, value, 0)
return (g, node);
}

function Graph addLink(Graph g, Node from, Node to) {
return g.lookup <- put(g.lookup, from.id, to.id)
}

function Graph removeLink(Graph g, Node from) {
return g.lookup <- remove(g.lookup, from.id)
}

(The above syntax is very explicit and again Java-like to make it better understandable for people who aren't accustomed to the more compact syntax of other languages. But those who are shouldn't have any problems with the syntax too - even if they consider it much to laborious).

The above creates a Graph which can contain cycles. We don't use a 'next Node' anymore, but use a 'id' now to reference other nodes. Also each Node stores it's own id so it can be queried if necessary. And we need a map to store the Node for each id and to look it up later. Let's create a simple cycle between two nodes with the values '100' and '110' now:

g = new Graph
g, n1 = createNode(g, 100)
g, n2 = createNode(g, 110)
g = addLink(g, n1, n2)
g = addLink(g, n2, n1)

Now we have a cyclic structure without using any references. The chain is simply followed by using the IDs of the nodes. And the IDs are created automatically by the call to createNode. The above looks a bit ugly because I explicitly had to chain the Graph g thru all the calls, but this could be easily simplified by adding some syntactic sugar to the language. I don't wrote the above to win a code-beauty-context, it's just to show that it's still possible to create all kinds of data-structures without having references at all.

One problem remains: With the above structure we have the gc back in the game. If you use the above structure in a general way you also have to remove Nodes from the graph. As long as you only remove single ones there is still no problem, but if you have a 'start-node' and want that the graph automatically 'forgets' nodes which aren't reachable by this start-node then you need some kind of gc.

But: The gc-problem is now a local thing. Gc in the above case is totally constrained to this Graph-structure and has to be solved in the context of a certain Graph alone. We don't require a general solution anymore, which opens new windows of opportunity for better optimization. In this case we can write a simple gc-function and call it in every call to our 'removeLink' function. But we could also use a background thread to do it. Or call it based on some heuristics etc. We could also put all those linking/gc-stuff into some lib and mostly forget about it, maybe have some parameters in the constructor to choose an appropriate gc scheme. Or best to add some compiler support for this kind of problem which would allow the compiler not only to create the gc-function on its own but also to use this additional knowledge to do better optimizations. Its possible to do whatever suits the problem now instead of having only a single general gc which has to work for all use cases with the same algorithm.

So I hope I was able to give you some thought about one of the most standard features in modern programming languages and why it could be a good thing to question it's practicality. Of course there still lots to say about this topic, but I'll stop here for now.

5 comments:

Ricky Clarkson said...

Your approach is similar to the story of Haskell if you ignore monads. How do you send a message from one thread to another?

Your ID-based algorithm is actually an implementation of references, and therefore I don't think you've solved a problem, you've just taken it out of the hands of the compiler.

I assume that by removing references, your data structures look like plain structs in C, with no *s. E.g.:

[code]
struct PointHolder
{
struct Point point;
};
[/code]

In C, the size of PointHolder needs to be defined at compile time. In C, if you define a recursive data structure, gcc complains about 'incomplete type'.

Therefore, PointHolder can only realistically deal with one implementation of Point, or the compiler needs to know the maximum size of a Point. I can't envisage another way of implementing this, other than pointers.

In Haskell, there is an Eq type class (similar to an interface). Any data structure that is in the Eq type class can be compared for equality using ==.

A function that takes an Eq polymorphically - hence, an instance of any type that 'implements' Eq - does that need to know the size in advance?

In other words, maybe you do still need references, and not as an ignorable implementation detail, but as an essential for polymorphism.

I hope I'm wrong, and that you can explain why, I like learning new stuff.

Karsten Wagner said...

I've told, that there's more to tell. I just wanted to outline the idea of getting rid of references without requiring non-mutability. So lots of details were left open.

> How do you send a message from one thread to another?

If we have mutable global variables we can simply store a message queue somewhere. If stores are atomic in the language a thread simply can do

messages, msg = pop(messages)

This will pop one message from the queue and set the global which stores the queue to the new value.

> Your ID-based algorithm is actually an implementation of references

Kind of. Think more along the ways of foreign keys in a RDBMS.

> you've just taken it out of the hands of the compiler

Partly. And that is in fact good, because it makes certain kinds of behavior explicit and solves the gc-problem (and yes, gc IS a problem especially if you think about the coming computer architectures).

> your data structures look like plain structs in C, with no *s.

Nearly, I would also allow variants (like in the ML type-system). But there are really no "*s".

> Therefore, PointHolder can only realistically deal with one implementation of Point, or the compiler needs to know the maximum size of a Point.

That's not really a problem: The compiler/the VM can look 'in the future'. Also (as I wrote): Don't mix-up the internal implementation with the semantic model of the language. Internally the compiler will of course use references.

Haskell is a language which also only allows values, so it's along the lines. But Haskell also requires immutability which forces certain changes of the programming style. I'm thinking about ways to find something in between (languages like OCaml and Erlang did that too, but I'm thinking about taking a slightly different route),

> not as an ignorable implementation detail, but as an essential for polymorphism.

There are other ways to allow polymorphism. This article only was about getting the basics of rid of references, later I will write about doing polymorphism without references (end even without variants - but my prototype implementation of my language has variants because it's quite convenient for certain problems).

My ideas are going in fact more into the declarative/relational direction (combined with non-pure functional programming). My next article will probably be about relational data-structures and how it's possible to solve some (in OOP) hard problems quite easily with them.

Anonymous said...

Are you aware of the way the .NET garbage collector examines "roots" and prevents cyclical processing by keeping track of the objects it has processed? By using metadata tables that track which registers and stack locations refer to objects (on the heap), the GC isn't susceptible to the limitations you posted that seem to imply that GC is broken by definition.

By the way, I like thinking how a language should look, semantically. Your post reminds me of the following article which I recently read: http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx

Karsten Wagner said...

I know of lots of gc algorithms (what you describe is common for every gc-algorithm), but they are always quite general and generality always loses against more specific solutions.

To detect which objects can be freed in principle the gc has to scan every object on the heap. And with a big heap this will take some time. With generational gc this is faster, because the gc will scan younger objects more frequently but those algorithms have other disadvantages (they require a write barrier for example which can costs performance). And even a generational gc has to scan the whole heap from time to time (which will read every object from virtual memory if the heap is bigger then conventional memory). Also frequent scans of the young generation thrash the cpu-caches an reduce performance further.

(there is an interesting gc algorithm from IBM called Metronome who tries to prevent some of the mentioned problems, but it requires some programmer intervention and isn't totally transparent then most gc algorithms).

The article you mentioned is right and goes into the same direction as my thinking. And I try to look even a bit further.

Anonymous said...

Amazing! You are thinking exactly along the same lines as I've been thinking recently. Maybe it's a sign of the times. Do you know of any languages that embody these ideas already? I've been toying with making my own language but I haven't got anything substantial yet. Now that I see others feel similar too I'm more motivated to work on it more. If you are interested we could work on such a language together and see where it leads in practice.

My email is magnus#smartelectronix.com (replace the #).