Karsten Wagner's Blog: 2006

Saturday, December 30, 2006

The problem of dealing with data-structures

It seems to be a 'solved problem': How to represent and use different kinds of data in programming languages. But in fact there's lot of unnecessary complexity and lots of artificial difficulties if you really think about it.

Lets just look at an example (It would look similar if I use the Hindley-Milner type system but it's just about the basic principles not about a certain implementation. I think that Java is common knowledge today and so I'll use a Java-like syntax).

Let's start with the following basic structure:

   
class Customer {
 String first_name;
 String surname;
 
 Address[] addresses;
 Invoice[] invoices;
}

class Address {
 String postcode;
 String city;
 String street;
}

class Invoice {
 Date date;
 Date date_payed;
 Currency value;
}

This allows us to represent 'Customers' with 'Addresses' and 'Invoices' in our system. The above structure is very clean and easy to understand - but also totally useless. Why? Because we also need ways to query the structure above. For example we want to access all invoices in the system to create reminders. Or we want to access all customers to do some mailing. So we have to create some containers:

   
class Database {
 Customer[] customers;
 Invoice[] invoices;
 Address[] addresses;
}

This 'Database'-object stores the data and allows us to access them if we want. But it also creates unnecessary ('accidental') complexity. Why? Because we have a rather artificial additional data structure only to access those data we already have declared. If we now do a


   c = new Customer()

somewhere we have to put c into 'Database.customers'. And if we destroy a Customer, we have to remove it from 'Database'. This creates some complexity which is very easy to handle in this simple example, but could be the source of more difficult to handle problems in big and more complex applications. But this is it's only the beginning:

Next we want to find a Customer for a certain invoice. We could simply do a search in Database.customers but that would be much to slow in a real application. So we add a back-link to each 'Invoice':


class Invoice {
 Date date;
 Date date_payed;
 Currency value;
 
 Customer customer;  // the back-link for quick access
}

This is an additional rather artificial addition to our structure. We need it only for performance reason, and it leads to new potential problems, because it has to be maintained to. But lets continue further:

Next we want to find all customers in a certain city. We could search Database.addresses to find the matching address and then Database.customers to find the customer for those addresses, but again in most situations this is simply not fast enough. So we have to extend our data structure once more:


class Address {
 String postcode;
 String city;
 String street;
 
 // back links. We use an array because we want to allow
 // multiple Customers for each address
 Customer[] customers; 
}

class Database {
 Customer[] customers;
 Invoice[] invoices;
 Address[] addresses;
 
 // a map to find all addresses for a given city
 Map city_lookup;
}

Those are only the declarations. You can easily imagine the additional code in your application to maintain those structures if you add or delete Customers, Addresses etc. Now scale this simple example to a real-world-application with hundreds of those classes, lots of different views and access-paths to you data structures and you see the problem. And I've not even talked about making the above structures persistent.

The funny thing is that all we really needed to describe our data is in the topmost data structure. It gives a complete description of the problem, but for performance and organization purposes we had to add lots of additional things which blur and complicate our once concise and self explaining structure. So all those additions create 'accidental complexity': Complexity which isn't inherent in the problem but only required for the specific implementation.

You may say that this is a contrived problem and would be handled by a SQL database anyway, instead of representing the structures in the way I did it here. But most data in programs is represented in a similar way. Let it be graphical elements of a GUI or a word-processor, the data in a 3D-modeler or the virtual world in a computer game: We always have data structures which are in principle quite simple but have lots of back links, indices, maps etc to provide access paths to the data.

In object oriented programming this kind of structuring the data is part of the paradigm: Objects have slots and those slots contain references to other data. All objects are intermingled in this way and objects without an access path are 'dead' and will be removed by the garbage collector (or spam the heap in a language without one).

This kind of structuring is so widely used, that its hard to even question it or think about a better way to representing those data. But lets give it a try.

In fact we already know another way: The above mentioned SQL database. It stores data not in structures with references but in relations. SQL is just a certain language to access and manage those relations, but it's in no way the only way to do this. A database stores data on external memory to be able to scale up to huge amounts of data, but we can also store local client data in relations. Just let me convert our example into a relational based one:


relation Customer {
 Id customer_id;
 String first_name;
 String surname;
}

relation Address {
 Id customer_id;
 String postcode;
 String city;
 String street;
}

relation Invoice {
 Id customer_id;
 Date date;
 Date date_payed;
 Currency value;
}

(The above could've been normalized a bit further to make it more flexible, but for example purposes I let it this way).

The structure looks very similar, but it allow all kinds of accesses we could do with our much more complex non-relational structure above - and lots more: We could simple create joins between those relations and query all things we want to know: For example all customers at a given city/street/postcode, all customers with open invoices in a given city, all invoices payed in a given date range. And we don't have to maintain additional back links, maps etc. because this is all done by the programming language which could use indices or other optimizations on all data which is queried often. Automatically if we use something like runtime profiling or static global optimization or by manually by giving hits to the compiler.

The reason for this is, that the above relational structure doesn't contains fixed access paths. Instead of that the access paths are created in the moment we query the structure. This is obviously much more flexible than building the query right into the data structure. And while a single data structure can only be organized in a single way, we can write arbitrary kinds of queries.

So by using a relational structure instead of a referential ones we solved lots of problems. A 'big one' is the infamous O/R-mapper problem: Everybody who tried to use a relational database with a object oriented language knows about it. The idea to simply change the way the database works and use a OODBMS hasn't worked out, RDBMS are still state of the art. And because of the reasons mentioned above I suspect that they stay there (unless we find a new 'third way' to structure data). By changing the basic model of representing data to a relational one, we have also solved the O/R-mismatch.

But is it really possible to represent ALL kinds of data in a relational way? I think it is. I looked at many problems I've worked on in the past and every problem was at least equally and often much easier representable in a relational way. But one thing have to change: The query language. SQL simply isn't up to the task anymore. It maybe never was but while there was much development in the field of object-oriented and functional programming, relational stuff was a bit neglected in the last years. So SQL stayed at the top and became something like a synonym for querying relational data structures.

But this hasn't to be this way. SQL is only a programing language designed to solve this task. And in fact it misses lots of important things. Lets have a relation


parent-of(parent-id, child-id)

Now write a SQL query which decides if a given child-id is a child of a given parent-id, even if only indirect over some 'hops'. This is called the transitive closure of a relation and isn't solvable in general SQL (only a few RDBMS seem to support recursive queries which are needed to solve the problem). This kind of problem is rather common in general algorithms, another one is to sort elements in a topological order based on a similar relation (In fact most graph algorithms are useful on relations and should be easy to express in a useful query language).

So using relations instead of objects can give real advantages but we need language support to make is possible to not only query a RDBMS more easily but also represent all structured data in a full relational way. This way we no only solve the O/R impedance mismatch and the problems with references (I've written about this here and here), we also simplify the way we are programming: Data structures are the base of every algorithm and with a more powerful way of representing data structures it should be possible to gain lots of productivity.

Wednesday, December 27, 2006

My Haskell experiences

I've known Haskell for some time now. The first time I've used it, it was with a system named Gofer after I read a book about the implementation of a compiler for the Miranda language (there were no Miranda system available at this time because it was a commercial system). Gofer was quite complex for the time and also quite useless for practical application (especial considering the performance of the PCs at this times, I used it on a 8MHz Atari ST). So I went back to C (and later C++) but tried out functional languages from time to time again. The one I used most was Caml (later called OCaml), but I had certain problems with it and abandoned it after some frustrating experiences.

But now Haskell seems to be en vogue and after playing around with it, doing some toy programming and trying things out, I decided that I have to do something 'real' in it. For some time now I casually work on a system for compiler creation based on a customizable parser front end, a template engine for output, an rule based language with forward inference for doing the semantic stuff and a graphical Browser to work with the system comfortably. Because I reached a point where small incremental changes weren't enough anymore, I decided to do a bigger rewrite. It's written in Java but I thought it would be a good idea to rewrite it in Haskell to learn more about the language.

So for some weeks I've done quite a bit of Haskell programming, produced about 2000 lines working code and made some experiences with Haskell I'd like to share.

Where Haskell really shines is pattern matching: Creating tree walkers or subexpression-matchers is a piece of cake, and if you're writing a compiler, this is really useful. Also the easy and powerful list-handling is quite an asset. Things which would require tens of lines in Java are done in 2-3 lines of Haskell code. No more clumsy visitor patterns, not boring List-object creation with complicated insertions etc. All this is really smooth walking and where you can use it good, it really gives a lot productivity.

Lazy evaluation is the icing on the cake because it allows you to write code in very natural way which would lead to rather bad performance in other languages because calculations could easily calculate much stuff which isn't really required there. To avoid this you often have to create quite artificial structures instead of being able to simply reuse old code. In Haskell you simply create a infinite or big finite lists, even if you require maybe only the first two elements. So in a non-lazy language it's always necessary to have some criteria to abort the evaluation while in Haskell you simply write like it is most natural and then let the function which request the data choose how much it wants to get. Caching inclusive.

Also using a fast REPL (read-eval-print-loop) to test out new code (I used Hugs, because of it's good integration into the Windows environment) can be quite nice. That's also a big plus for languages like Lisp and with Haskell it's quite the same. One bothersome thing with Hugs is that it doesn't allows you to test code out, as long as it finds a single error in one of your modules, even if the part of code your test doesn't touch. Thats a result of the compilation model and type inference and partly understandable. But in contrast to using the REPL with Lisp it's still a bit annoying.

Also in contrast with Ocaml the Haskell type system is much more powerful and allows easy overloading of functions. In Ocaml I always had to invent lots of names or use qualified names for functions because there was no overloading possible (but don't know for sure it this is still true because the last time I used Ocaml was in 2002 and maybe they made some advance since then).

Haskell has type inference and while this can really make code more compact because the lack of necessary type annotations it also has its downside: A new line somewhere in your code can easily create compiler errors somewhere else. Type inference works by creating lots of constraints and the inferencer then tries to solve them. So if an error occurs, the compiler flags this error somewhere in the dependency chain and often at places where everything is totally correct while the real error is somewhere else. If you develop code incrementally this is not as bad because you know what you have written since the last run and can look at the right spot it to figure out the error. But if you write longer parts of code without testing the program in between, it can take you quite a bit of time to find out where the error really is.

Haskell has a solution to it: Type annotations. Simply declare the type you want a function to have and things are working much more easy. So if you want to do something bigger than 20 lines in Haskell: Use type annotations, at least for all top level functions. Luckily it's easy to put annotations into your code later, because you can query the inferenced type of a function each time your code compiles and simply paste it into your program (but sometimes you should rewrite it a bit to support a bit more general typing). The code isn't as small as without annotations, but thats better than doing lots of searching around for the source of compilation errors.

The real problems with Haskell start to show up in the moment you require some kind of state. And yes, you will need it: Many things can be done in pure functional style, but I've often encountered situations where a single function deep down in the call hierarchy suddenly needed some informations from the outside. And in some cases they even needed to modify those informations. Then things starts to get ugly.

In Ocaml you can simply declare global, mutable variables and so the problem is a non issue. But in Haskell it's impossible because it don't allows mutation. Even if our small block of code only needs to read some state from the outside, it's not directly possible in Haskell anymore. In my project I often found a nice functional solutions to a problem, but later on, after I put the (tested and good working) code into action I discovered that some function 'deep-down' suddenly required access to outside data and I had to do lots of rewrites because of this.

Haskell has a solution to this 'outside context problem': Excessive use of Monads. While a bit strange in the beginning, after using monads a bit, you get quite fast accustomed to them. But with more complex situations, the usage patterns are also getting more and more complicated.

To use outside state I often used the StateT monad. This monad allows you to create your own state data, modify it like in an imperative language and it can also wrap an other monads into it (for example the IO monad). So if you use this monad you can access your state at will, even modify it and also have access to IO functionality to write out informations while your program is executing.

(This is another big issue I had with Haskell: Simply create some debugging output. In most language you can simply write to some logging-stream to see what your program has done. This is often even more useful than using a debugger, because you can examine the history of the program execution and see what lead to a certain error. But in Haskell you need access to the IO monad to write data out and this requires you to change the way you write your programs)

Also the conciseness of Haskell stops in the moment you want to access or change state. Even Java code is more concise than Haskell code which has to change things. I know that this is probably intended to force programmers to write programs in a functional style. But compared to the otherwise clear and compact syntax of Haskell its really a pain.

Monads can encapsulate state into some kind of data. In fact all monads are simply data constructors with a few required functions. So most monads work by returning itself from functions. It requires a bit backward thinking, because the state-monad for example creates a chain of closures which are executed later by evaluating the value of the monad. So to use state you don't have to give this state to a function, the monad shows up only in the result type. While this is quite clever (and also is the reason why things like backtracking and continuations can be implemented as monads) from a practical point of view, it also means you have to rewrite your code:

Lets have two simple, small functions:


giveElements :: (a -> Bool) -> [[a]] -> [[a]]
giveElements condition = filter (detectElement condition)

detectElement :: (a -> Bool) -> [a] -> Bool
detectElement = any

The above seems a bit useless but I've extracted from it's surrounding code to show the problem. The 'giveElements' function is able to extract certain elements from a list and returns the filtered list. The 'detectElement' function does the job of deciding which elements are in and which are not. It does this by checking if any element of a list fulfills a certain condition.

So short so nice, but after some work on your program you discover that your 'condition' function required access to outside state to do it's job. Maybe the function has to do some a small modification of some data or should simply write a short message into a log. To allow this, you have to rewrite the above functions to use a monad. For this example lets simply use the IO monad here. Now the functions become:


giveElements :: (a -> IO Bool) -> [[a]] -> IO [[a]]
giveElements condition = filterM (detectElement condition)

detectElement :: (a -> IO Bool) -> [a] -> IO Bool
detectElement c els = do
      r <- mapM c els
      return $ or r

The first function still looks quite similar, only the 'filter' function is replaced by 'filterM' now. 'filterM' is the monadic version of filter and gives the monad back from the condition. So without filterM the state won't get up from the detectElement function (and of course there would be a type error).

detectElement looks a bit more different. The reason is that there is no monadic version for 'any'. So I use a monadic map (the mapM function) to create a list of Bools by applying the condition to each element and then check if at least one of the results it true. This doesn't requires more evaluations of c than the first version because of lazy evaluation, but it still required a complete rewrite of my code. It's still readable, but not as much as before.

And there are some other dreadfulnesses: The '$' after the 'return' is necessary because if you simply write "return or r" it would think or and r are both parameters to return (because 'return' is no part of the Haskell syntax but an ordinary function), so we need the '$' to change this. I could also write "return (or r)" but thats a bit longer and with more complex expressions can easily look like you're programming in Lisp. And you can't easily put the 'r' into the expression below. That's because the '<-' is no assignment but a special syntax which creates a closure with r as parameter.

All this has to be considered if you start to use monads. Only because I wanted to access the IO state from my condition-function, I had to rewrite my little program substantially. And if there would be no filterM if even get even more difficult. I had for example to use a similar code with sets instead of lists - but there is no filterM for sets, only a normal filter. And you can't 'lift' (Haskell slang for 'transform a function by using a combinator') a filter function to a monadic one. Map or fold functions can be lifted to monadic forms because they take a function which returns an arbitrary value, which can be used to put the monad in. But if a filter simply returns a Bool and then there is no way to return the monad from within. So I had to convert the set to a list, filter it with filterM and then build a set again - which has worse runtime characteristics than directly filter it.

And if you have more complex functions the above translation process from a non-monadic into a monadic-one is much more difficult and IMO it also destroy the clarity of the code. Also state monads have bad runtime characteristics because of their 'inside-out' evaluation which can fill up the stack quickly. You can use IORefs instead - but then you can't simply transform your program to using continuations anymore for example. And if you put a IO monad into a state-monad, you have to lift all IO-accessing functions via liftIO.

And whats the point of having a pure functional language, going thru lots of trouble to maintain the purity, only to get things like IORef which you tried to get rid of in the first place? If you design a program to use IO monads from the beginning everywhere, then you have fewer problem maintaining and expanding it - but then you're using Haskell like a normal imperative language. Whats the point of Haskell then?

Another bad one with Haskell is documentation: While you find lots of introductions all over the net, it get quite difficult to find informations if you're beyond the basics and start do do real work. The 'official' docs are simply horrible: No examples, no real descriptions how things work or should be used. And finding tutorials for the advanced stuff becomes quite difficult. You can get lucky and find a solution to a problem buried in some of the pages of the Haskell-wiki, a newsgroup or a research paper - but it will take its time to find it and sometimes there is no such solution. And this is only about 'language stuff', not how to solve a very specific situations in some obscure framework.

In retrospective: Haskell is a nice language, but if you need state handling then it gets ugly. That even kills the language for me. Whats the use of concise and compact code if you have to rewrite it later on into a much more ugly form if you discover that you need state? This is no problem for small programs, so I've never ran into those problems with my experiments before. But with a bigger program things changed. So my program still isn't implemented completely and I will probably abandon the Haskell version and do a rewrite in Java again.

Its not a problem of functional programming in itself, its a Haskell thing: While monads are quite powerful in principle IMO they are a pain to use in practice. Because monads show up in the return type of functions, the easy composition of functions which is the prime reason for the power of functional programming only works as long as all those functions use (compatible) monads, and sadly many functions simply don't exists in this form.

With Ocaml for example you won't get this kind of problem, but there you have unlimited mutation which has it's own problems. But there are other pure concepts like for example uniqueness typing (Clean) or transparently chaining state-parameters thru calls (Mercury). And maybe with more time to get accustomed to the use of monads I would maybe see things differently. But since I have no unlimited time and need to get things done in a programming language, I consider my Haskell experiment as a failure - but only a partial one because it was nonetheless fun and brought lots of insights.

So what to do with Haskell? Maybe it would be an idea to put monads automatically in every return type. If I write something like:


  func :: a -> b

then the compiler would creates internally


  func :: Monad m => a -> m b

and also rewrites all bodies to do/return-syntax. I don't know, if this would really always work and would really solve all the problems but I suspect that by hiding the use of monads better and letting the compiler decide where he can drop the monad constructors safely you would get you 'clear and normal' looking programs back even if they use Monads everywhere. I'm far from a Haskell expert, so take this with a grain of salt. But since Haskell already provides some syntactic support for monads, why not go one step further? If there wasn't the problem with state (and documentation - but thats only a question of effort) Haskell could be a very nice and maybe even the 'next big' thing.

Tuesday, November 28, 2006

References violate the principle of encapsulation

Every language with allow the use of direct references to objects also makes encapsulation and thus code-reuse much more unlikely. And because every OOP-language has those references thats the reason why encapsulation in OOP doesn't work the way it's advertised by its promoters.

Let's elaborate on this: Any reference is a direct connection to a certain object. That's quite convenient: If we have a deep hierarchy of objects we can simple retrieve a reference to one of those 'deep' objects (lets call it 'B') to an outside object O, and now O is able to directly manipulate B without having to walk thru the hierarchy for every operation.

But what if our object O changes B directly but it's required that some other object C notice it, because it needs to reflect changes to B somehow? Think of a Gui for example: 'A' may be a Button and 'C' some container. If we change the text of 'B', 'C' has to be notified to layout its children to give 'B' enough space to display it's text.

But because 'O' can modify 'B' directly, 'C' won't notice it and our Gui won't work as expected. Sure, as seasoned programmers we already know a solution: The Observer-Pattern. Let 'B' send some notifications for any interesting change to other objects which then can do what they have to do.

But is this really a good solution? It works, but everybody who've done it on a certain scale knows how tricky it can get: We need notification-events, receivers, registration/unregistration, prevent cycles add lots of additional code for every method which has to send a notification etc. It's a real mess. They even invented aspect oriented programming because of this problem, but AOP is like trying to extinguish a fire with gasoline: Instead of solving the roots of the problem they put another layer of complexity on top of it.

So what's the root of the problem? I've outlined it above: It happens only because we have the ability to modify objects directly without its container having to notice it. And thats because of the existence of direct references.

Sure, we gain some convenience in the beginning because references allow us to talk to a object directly - but as a result we need huge amounts of additional code for notifications, observer, doc/view-patterns etc. And it also opens up lots of opportunities to make mistakes. Lots of. Also code reuse get much more difficult because we have much more dependencies on the internal structure of the problem then really necessary. And this is all because the use of references violates the principle of encapsulation.

Is there a solution to this problem? Sure: Getting rid of references altogether! They are simply to tempting.

It's like every low-level mechanism: They give you more power then needed and because of this they can and will be abused. Without references one always has to ask the container via some key for a certain contained object. And you will never get a reference to the object, you will only get the value of the object - which isn't mutable and thus can't be misused. Changing an object would always involve its container and so there would be no need for observers, notifications etc. It may be a little more difficult in the beginning but on the long run the profit would be huge. And with certain concepts build right into the programming language, most unpleasantcies can be avoided. But to solve a problem, it's necessary to first notice that it even exists.

Some people will say now: I as the programmer want all the freedom because I a able use it in the right way. Even if this is a correct way of thinking, having references in the language will in fact limit your freedom nonetheless. Why? Because if there are references in the language the compiler has to follow certain rules and can't do certain optimization which are necessary to create performant code without using references! In other words: By giving you references a language also forces you to use them because alternatives are to expensive. So your freedom is in fact limited in either way. And I for myself would rather be limited to a solid and more error-proof way of programming instead of a risky and error-prone one.

(About other problems with references and how to solve certain problems without them, I wrote about here).

Sunday, November 26, 2006

Why pointers and references are a bad thing

References and pointers are quite common in the world of programming languages. References are something like the address of some data on the heap. Pointers are quite like references, but with additional operations of changing the address (by incrementing or decrementing it or even by arbitrarily creating it from some integer value). So a reference is a limited kind of pointer: One with a fixed value. Or in other words: A reference is an immutable pointer.

Pointers are especially bad, because by changing their address, they can point to invalid positions in memory and read or overwrite data which shouldn't be read or overwritten. That's quite ugly from a security point of view and it's the reason why C- and C++-programs often have security holes and allow exploits like buffer-overflows etc. (yes, In know you can prevent this by solid programming, but most programmers simply don't do this because it requires more work and would also often make a program slower).

So lets concentrate on references. First: What's the most fundamental concept in programming language? It's a 'value'. A value can be a simple integer value, a float value or, a string value or the value of a reference. A value has an identity on it's own: The value is the identity. And a value can't changed in itself, a '10' is always a '10'. Things can have a value but the value always is a value on its own, you can't change a value on itself you can only change the value of a variable or a field of a data structure. References on the other hand are a special kind of value, the value of a reference is an address of some object on the heap. It's a less fundamental concept and so we can ask: Is this concept really necessary, useful or even harmful? I suspect the latter: It does more harm then good if a language allows to have explicit references (or pointers, which are even worse).

But whats the bad thing with them? As long as you can't manipulate them and they always point to valid data - but only under an additional assumption: You have a language with automatic memory management - in other words, you have garbage collection (gc). Without gc memory could be freed and then new allocated, even when there are still references referencing it - and then you would have the same problem as with pointers: The possibility to incorrectly overwrite data. So references always require gc if you want to have their security advantages.

References can easily create cyclic structures. Just create two blocks of data, A and B, where each have a reference to the other. If you then want to iterate over this data structure you run into an infinite loop, because you're always jumping from A to B and then back from B to A. Because of this every algorithm which follows reference chains can run into a loop and this has to be examined or prevented in some way.

With gc (which is such kind of algorithm) this means that the most simple way of doing gc ('reference-counting') won't work in the general case, because you have possibly cyclic references. This means that its hard to create some kind of 'local' gc. 'Local' means that the gc should only examine structures which are used in the current situation and not having to scan the whole memory. But by allowing cyclic references this becomes impossible.

And this becomes a big problem if we have huge amounts of data, especially if this data isn't in main memory but on some external memory (on the hard-drive or somewhere on the net). But even with data in main memory there are problems: Todays memory architectures allows huge throughput (lots GB per second) but the access latencies are poor. So doing random access in main memory is a comparatively much more expensive operation compared to ten years ago.

Of course there are methods to prevent those problems to a certain degree ('generational gc') but it remains difficult: Everybody who as worked with a large Java-based desktop-application knows the pauses if the system has to do a full-gc. And if we want to scale gc up to objects in a database this problems gets much worse. Also the trend to more and more concurrency makes global general gc much more difficult and time consuming too.

So are references just bad because of gc? No. There are other problems:

(Small remark: I'm talking here about the view the programmer has on the problem via his programming language, not about the real implementation the compiler does behind the curtains. On the hardware level references are unavoidable, but that doesn't matter if the programmer doesn't see this because it's hidden by the programming language he uses)

A big problem with references is the additional identity they give every object. If you write the following (in Java):


   Integer v1 = new Integer(10);
   Integer v2 = new Integer(10);

and check if v1 == v2 afterwards, you will get a 'false'! Even if both v1 and v2 are Integer-objects with the same value a comparison with == returns false. To check if they have the same value, you have to write v1.equals(v2). And even if your favorite language doesn't allow the same with integers, they will probably do it too if you create a class like


  class Point {
    int x, y;
  }

and then do the same thing as above with two identical valued Points.

You could now say: "Of course it is this way. We're comparing the objects themself and not the values of the objects". But why should there even be the possibility to have a difference? Only because we have references. Because of references we can decouple the object from its value and compare them as differently. If we don't have a reference-value we can't compare then and thus the only way to compare objects is by comparing their values.

There are many situations when this difference really don't make sense. The Integer-example from above if one of those. But also with other structures often you don't want this kind of differentiation. For example the 'Point'-class above: A point should be a position on a 2d-grid (maybe on a canvas to draw a pixel there), it simple don't makes sense to have two different points which points to the same position. Just like it doesn't makes sense to have to different Integers with identical values.

The situation will get even worse if we allow mutation (=changing). Then we could do something like this:


   Integer v1 = new Integer(10);
   v1.value = 20;   // (not possible in Java)

On this point it becomes clear that a 'Integer' isn't a value but it's a container of a value. That's fine, but what if we don't want a container of a value but the value itself? In Java we only have a fixed number of those 'values itself': The primitive types int, float, etc. Other languages are a bit more flexible and even allow the creation of new value types, but they also allow the create of reference based 'Objects'. And this leads to the aliasing problem:

It we write


  Integer v1 = new Integer(10);
  Integer v2 = v1;
  v1.value = 20;

then v2.value is also 20 now. v2 is only an alias for v1 and because of this every change to v1 also changes v2 and vice versa. And this is bad, because it's non-local: Things can happen at other parts of the program without direct relationship to each other. We explicitly have to prevent this:


  Integer v1 = new Integer(10);
  Integer v2 = v1.clone();  // now v2 isn't influenced by changes to v1 anymore
  v1.value = 20;

In simple programs this isn't a problem because its easy to see what happens, but with increasing complexity it creates more and more sources for hard to find errors. And it makes it hard for the compiler to do certain kinds of optimizations because its hard to assume that some data wouldn't change under certain operations.

This behavior is possible in every language which allows mutation and has references. Only pure functional languages like Haskell are immune to this problem (Ocaml as an example for an impure functional language has the same problem too). But Haskell does two both things: Disallowing mutation and having no references. While immutability has it's advantages it's not necessary to eliminate the aliasing problem, just removing references is enough. And this alone still gives several benefits:

There is no more need for a general gc

We only need one simple concept of equality, so no more 10 != 10

No more null-references (with the accompanying errors)

But can we really get rid of references? We can't write


  p = new Point(2, 3)
  p.x = 10

anymore, because values can't be changed (a '10' is a '10', it makes no sense to change a '10' to a '12' for example).

But thats in fact a good one. We just have to define our data in an other way: We can't modify values, but we can create new values based on the old value:


  p = new Point(2, 3)
  p2 = p
  p = p.x <- 10

The big difference is that p2 is still Point(2, 3) even if p is Point(10, 3) at the end. The aliasing is gone.

(The new syntax 'p = p.x <- 10' means that a new Point is created by copying all values of p except for 'x' and using the new value '10' for x, returning the new Point)

If you think that this is much more expensive in terms or runtime efficiency, you're wrong: Most of those cases could be optimized internally by the compiler to the old fashioned structure-mutation. The reason simply is that the compiler hasn't to bother if there could be aliasing. Getting rid of references allows the compiler to create those references internally by himself and besides solving the aliasing problem which is hard to overview for the programmer. It's now the task for the compiler which is much more able to do it and won't makes any mistakes.

Removing references from the language would thus open lots of optimization opportunities for the compiler while making the life of the programmer more easy. But is it all really so easy? Aren't references simply required to tackle certain kinds of problems?

What's for example with a list? Can a list be implemented without having references? The answer is yes. In fact it's quite similar to the well known implementation even if each node is a value now:


value Node {
    int value;
    Node tail;
}

Creating a list is simple, we just create a big value where each 'tail' is another value. But how to insert a new element somewhere in between? We have to create a new list now and can't do clever (but also often dangerous and hard to read) pointer-change-tricks like in the well known double-linked-list insertion. We have to rely on the optimization abilities of the compiler instead - but we get much more clean code.

And the advantage is that it's impossible to accidentally mix two lists in a way we don't want: Because the whole list is a single value it's 'written in stone' now. Totally unchangeable. So we can be sure, that no other part of the program is able to change our list, even if we put it in some lengthy and hard to understand algorithm. This is again a huge advantage for the programmer who wants to understand a program he didn't wrote and also for the compiler to do optimizations (especially in the context of concurrent programming).

In fact most algorithms can easily been rewritten in such a way that they are only working with values. And because we still allow the mutation of variables here, we can still implement imperative algorithms directly.

But there is one thing which seems to be impossible now: Creating cyclic structures.

How to create a cyclic structure if we only have values? Such a structure would require an infinite amount of memory if we try to implement it like the list-node above: We would get infinite recursion.

In Haskell the way out is lazy evaluation. Because of this we could really create a 'virtual infinite value', the compiler would simply evaluate only the parts of the value which is currently required. But is it possible even without having lazy evaluation (which has certain other disadvantages)?

The answer is yes. It's not as easily as with a language which supports reference, but since cyclic structures are not necessary all the time, the time and effort isn't that big also. And don't forget the advantages mentioned before. The trick is to tag values with IDs:


value Node {
   int id;
   Value value;
   int next;
}

value Graph {
   Map lookup;  // a array would work too, here
   int act_id;
}

function (Graph, Id) createId(Graph g) {
   new_id = g.act_id;
   g = g.act_id <- g.act_id + 1;
   return (g, Id);
}

function (Graph, Node) createNode(Graph g, Value value) {
   g, new_id = createId(g)
   node = new Node(new_id, value, 0)
   return (g, node);
}

function Graph addLink(Graph g, Node from, Node to) {
   return g.lookup <- put(g.lookup, from.id, to.id)
}

function Graph removeLink(Graph g, Node from) {
   return g.lookup <- remove(g.lookup, from.id)
}

(The above syntax is very explicit and again Java-like to make it better understandable for people who aren't accustomed to the more compact syntax of other languages. But those who are shouldn't have any problems with the syntax too - even if they consider it much to laborious).

The above creates a Graph which can contain cycles. We don't use a 'next Node' anymore, but use a 'id' now to reference other nodes. Also each Node stores it's own id so it can be queried if necessary. And we need a map to store the Node for each id and to look it up later. Let's create a simple cycle between two nodes with the values '100' and '110' now:


g = new Graph
g, n1 = createNode(g, 100)
g, n2 = createNode(g, 110)
g = addLink(g, n1, n2)
g = addLink(g, n2, n1)

Now we have a cyclic structure without using any references. The chain is simply followed by using the IDs of the nodes. And the IDs are created automatically by the call to createNode. The above looks a bit ugly because I explicitly had to chain the Graph g thru all the calls, but this could be easily simplified by adding some syntactic sugar to the language. I don't wrote the above to win a code-beauty-context, it's just to show that it's still possible to create all kinds of data-structures without having references at all.

One problem remains: With the above structure we have the gc back in the game. If you use the above structure in a general way you also have to remove Nodes from the graph. As long as you only remove single ones there is still no problem, but if you have a 'start-node' and want that the graph automatically 'forgets' nodes which aren't reachable by this start-node then you need some kind of gc.

But: The gc-problem is now a local thing. Gc in the above case is totally constrained to this Graph-structure and has to be solved in the context of a certain Graph alone. We don't require a general solution anymore, which opens new windows of opportunity for better optimization. In this case we can write a simple gc-function and call it in every call to our 'removeLink' function. But we could also use a background thread to do it. Or call it based on some heuristics etc. We could also put all those linking/gc-stuff into some lib and mostly forget about it, maybe have some parameters in the constructor to choose an appropriate gc scheme. Or best to add some compiler support for this kind of problem which would allow the compiler not only to create the gc-function on its own but also to use this additional knowledge to do better optimizations. Its possible to do whatever suits the problem now instead of having only a single general gc which has to work for all use cases with the same algorithm.

So I hope I was able to give you some thought about one of the most standard features in modern programming languages and why it could be a good thing to question it's practicality. Of course there still lots to say about this topic, but I'll stop here for now.

Monday, November 20, 2006

Rich languages vs. minimal languages

A common pattern in the design of programming languages seems to be to designing languages as 'slim' as possible. The idea is to create as few abstractions as possible directly in the language and then to implement all the useful things the programmer needs in the libraries. While there are obviously some differences in how strict this design pattern is applied, I think that it has to be scrutinized. Lets look at some examples:

In Java the language only implements rather simple collection types: Array with a fixed length. So all usable collections have to be created in the language itself which leads to lots of problems, inconsistencies and performance penalties. On the other hand, Java implements some things other languages use to implement in libraries directly in the language. A good example here is the 'synchronized' directive which makes the use of a certain concurrency pattern much easier to apply.

In Common Lisp nearly all features of the language are implemented in the language itself, but many of those features are part of the standard so they are available as a base to every programmer. The problem is still that those features are designed in a way that they can be implemented in the language itself (via macros) which leads to the well known s-expr-syntax which is hard to read for most programmer and one of the main reason, Lisp is criticized often.

Languages like Ruby and Python on the other hand have a very rich feature set build in directly into the language. I suspect that this richness is the main reason why those languages are considered more expressive by many, and not other features like object orientation or dynamic typing. In those languages you have good syntactic support for many common things like collection handling, regular expressions, big integers etc.

Then there is Haskell which is based on a very restricted central idea (a pure, lazy functional language) but extends that quite cleverly with lots of syntactic sugar and allows free definable operator definition. In Haskell you can not only overload existing operators you can also create arbitrary new operators with custom binding priorities. But things like the imperative looking support syntax for monads or the compact list comprehension syntax which both greatly enhance the usability of the language still needed to be implemented as special syntax because even with those free operator creation its not possible to express it as clearly in the base language directly.

I think that we can see a pattern here: Restricting the base language to strong can lead to limited expressiveness and to the need of lots of boilerplate code. Just compare a bit of Ruby:


 a = [1, 2, 3, 4, 5, 6]
b = a[2..-2]

with the similar Java code


 List a = new ArrayList();
 a.add(1);
 ... // some adds omitted
 a.add(6);
 List b = a.sublist(2, a.size() - 2);

The compactness and readability of the Ruby example results primarily from the rich syntax of the language concerning array use and not from the dynamic typing. Even in a statically typed language you could create something similar, for example by using type-inference. But even without someone could invent something like this:


 a:[int] = [1, 2, 3, 4, 5, 6]
 b:int = a[2..-2]

which is much more compact then the Java example, even with explicit type annotations.

There are lots of similar examples, so what could be the reason why language designers choose to not implement language support for something? Lets look at some possible reasons.

Learnability. If the language has a small core someone could be proficient in it much faster.

Clear and simple and maybe even 'provable correct' design. The language could be desigend much faster and also much easier. Also the language model may allow abstract theoretical reasoning about important things.

Better extensibility. Because all higher-level features are implemented in the language itself they could also be implemented by a 3rd party which allows the creation of new features the designer hasn't thought of or allow to extend existing features with user defined ones.

Better compilation. The compiler for a simple language is more simple and could thus be more sophisticated.

There are of course more reasons, but I hope I got the most mentioned here. So lets have a further look at them:

Learnability

Learning a language also means to learn the libraries. If you know Java without knowing at least the base-libs you know nothing. So if you put a feature in a lib or put it in the language itself don't really matter.

You may suspect that libs have the advantage to be better searchable with some documentation facility, which can even done automatically to some extend (thing about code completion in todays IDEs). But you could also create a similar documentation facility which includes language features. For example just type 'array' and the documentation would give a short info about the supported array-syntax in the language.

In fact there is no single reason why learning a languages to a degree which is necessary to do real-world-work would differ if you put more features directly in the language. In fact it may be even the opposite: Is it really more easy to know all what is necessary about Java collections then about Ruby-collections? I doubt that.

Clear and simple and maybe even 'provable correct' design.

While it's true, I ask: Are those properties of a language really necessary? To a certain degree and in certain domains: For sure. But in general? I doubt it. A 'industrial strength' language (i.e. a language which is created to do lots of real-world-work) has to ease the life of the programmer. It simply doesn't matter if the language has a nice, clear and simple design unless it really helps the programmer to accomplish his or her tasks.

Interesting theoretical properties of a language are for sure a valid thing for research languages but thats only a very limited domain - even if it looks much bigger to language designers. But for real world programming other properties are much more important, so we really have to differentiate here.

Better extensibility.

This seems to be the 'killer argument' for putting stuff in the libs instead of the language. But is this really true? Can you for example really extend Javas collections so easily? The answer is of course: No. You can of course write your on list-class, but you can't use it as freely as you want because you simply can't substitute your new list-class into existing code without modifying the source. Sure, you can exchange the system-classes with your own (as long as your own classes use the same interfaces) but thats not really sensible in most cases.

You may suspect that thats not a fault of the idea itself but only of a certain language (like Java). In Ruby for example you can substitute classes very easy at runtime. In Haskell it all depends on which module you import. And in Lisp you can always write a clever macro. So maybe its possible if you just use the right language? Yes, to a certain degree it is, but it comes at a price: You can introduce difficult to find bugs in already tested code, you have to design the language in a certain way to do it (which makes it harder to use) and you need dynamic typing or rather sophisticated type-systems (if you want static typing) to do it.

In the end it makes the language semantics more difficult to understand - and not because of usability purposes, but only to allow somebody to create things in the language which are already part of the language (even if implemented in the libs). And do we really need this kind of extensibility? You may think that having something is better then having it not, but all things come to a price and why do we want to pay this price for something we don't really need?

All the above extensibility schemes have their limitations. Even with Lisp-Macros you can't create real transparent continuation support. Even with Haskells clever type and module system they still had to create syntax to support monads in the way they have it now. And in Ruby you have to pay with slow execution and the loss of compiler-time validity checks which could lead to hard to find errors.

In fact most programmers never ever have to change the system libraries. Extend them? Maybe, to a certain degree. But its always possible to add functionality to a language by writing libraries. Maybe you can't use those extensions in the same way you can use the build in features, but is this such a problem? In fact it even has the advantage to make code better maintainable because it's always easy to distinguish between 'safe' language base-stuff and 'maybe faulty' self-written-stuff.

In the early days of computer science, the design of programming languages was new, new algorithms and data-structures were invented 'every day' and in those times it really seems to be risky and to prohibitive to put to much things directly into the language. So after having experienced the difficulties tointegrate new concepts in rigid languages like Fortran it's understandable that the trend went in the direction of creating better extensible languages. But the times have changed. When did you really have invented a new basic algorithm or data structure? All those basic use-cases are well known for some time now and in most domains they simply don't change this often.

So we simply don't need this much extensibility today if the language provides enough features on their own.

Better compilation.

In fact its the opposite: A program in a rich language can be much easier compiled to an efficient native program simply because the compiler has more knowledge about the program. Sure, the compiler gets bigger, but it don't have to guesstimate what the programmer really wanted and more optimizations are possible.

Fortran is still in use because the use of its array and matrix features which are build into the language allows easy vectorization and thus high-performance programs on otherwise hard-to-build-for computer architectures.

Also lots of novel optimization techniques are possible. What about a collection class which isn't fixed to a certain implementation but where the compile decides which implementation to use depending on usage patterns, maybe even with the possibility to change this implementation after profiling at runtime? This would be impossible if the collection-type is implemented in a lib, because lib-code is like every-code or the compiler and it simply don't know about lots of optimization opportunities. But if the collection is part of the language the compiler is able to chose how to implement it, depending on how the collection is used. May indexed accesses? Use a array-based implementation. Many insertions and deletes? Use a linked-list one.

This ideas here could and should be extended to more higher-level structures. Collections are very basic but also very often used and thus crucial for efficient implementation and performance.

But also things like concurrency, component based architectures, data-base-access/persistence etc. can really benefit from being part of the language. Instead of writing something like


 result = db.query("select name, value from data");
 name = result["name"];
 value = result["value"];

which is very error prone and inefficient to compile, we could have something like this:


 select name, value from db.data;   
 // name,value are stored directly into variables above

This would be easier to write, the compiler could check for syntax errors at compilation time, could automatically cache queries, do optimizations etc.

I believe that the minimality principle which is applied in most of of todays languages is in fact obsolete and should be questioned by everyone who starts to design a new programming language. Many languages did this to a certain degree but why stop there? Why not try to find all the common patterns programmers are doing and reimplementing each and every day and putting them directly into the language? Instead of researching of 'domain specific' languages, why not simply invent a language suited for the domain of every-day programming?

Wednesday, September 13, 2006

Disadvantages of user-extensible languages

A lot of languages are going the way of giving as much user extensibility as possible. Lisp is the obvious forerunner: With it's flexible macro facility, it's possible to create nearly every imaginable extension to the language. Look for example at the loop facility in CommonLisp: It's a 'language inside a language' (sometimes called 'DSL' - domain specific language). In Lisp the 'price' for this extensibility are 's-exprs', a very simple syntax which is in fact like entering a program as a syntax-tree with lot's of parenthesizes.

Newer languages tried to have comparable extensibility and also a rich syntax (for example Nemerle), others tried other ways without using macros. There are lot's of ways to give some extensibility, for example Smalltalk used 'blocks' (a kind of closure with additional non-local-return) to create all well known control-constructs (like if-then-else, while-loops etc.), others use extensive operator definition and overloading (like Haskell) or use reflection to provide some degree of user extensibility (like Java).

On the first sight, user extensibility seems like a good idea. But like everything, it's a two edged sword and I'll try to show some of the disadvantages here.

Of course there also advantages and it's always a question of the domain you're using a language for: If you use a language in a big team you have other requirements then if your working alone or in a small group (up to 3 people). If you write some small 'one-time-use'-tool, you also have other requirements then writing a application which has be used and maintained for years. And if you're doing prototyping or experimenting with new language features and abstractions you also have other requirements then if you're working on yet another business application.

So what's the big disadvantages now?

Language design is hard, so writing extensions to an existing language is hard too.

Many people think that the main difficulty of designing languages is building a syntax. In fact that's the easiest part. The main problem is to get a set of language features which aren't conflicting, which all solve different problems (no overlap) and which are easy to use and understand. With language extension, creating the syntax is easy, but integrating the new language features into a common framework is as difficult as usual. So don't expect that an extensible language makes language design more easy. But of course it's easy to create bad or useless extensions. In a 'from scratch' language where you have to build a whole compiler first, this would lead to a certain selection of programmers who are able to do it. But with an easy extensible language, this kind of selection won't happen and thus most extensions would be poorly thought out, useless and even detrimental to the language.

Language-extensions can (and will) conflict if you have to use 3rd party code

If a language is designed by a communication team of programmers (or even a single programmer) conflicts won't happen easily, because the designers have a complete overview over the language. But with user extension this overview is missing, because extensions will created independently. This could easily lead to conflicts if you want to use 3rd party software in your code which uses similar but slightly different language extensions as you do. Even if the language provides a certain 'isolation' feature to prevent conflicts you still have the problem that two parts of code uses similar looking but semantically different language extensions.

Extensible languages provides the language designers with an excuse to 'under-design' them

As a application programmer you want to have a complete language, not a 'language construction kit'. But with extensible languages it's tempting to the language designer to leave the language 'simple on purpose' to allow the users to create the extensions they want. But creating those extensions is still hard, especially if they are created by independent programmers. So a incomplete language is of no use for an application developer and it's more likely that the language is really of a piece.

Language-extension creates another 'thinking-level': A programmer always have to consider writing a language extension to solving a certain problem

If a languages supports language extensions, it's always very tempting to write a new abstraction to solve a problem. While this is (in theory) a good idea, in fact it often leads to increased development times and less maintainable code. Why? Because creating new abstractions is much harder than just solve a certain special problem. Many programmers underestimate the difficulties and try it nonetheless - with often really bad results. If your abstraction isn't well enough thought out, it's not reusable and it's harder to understand than just a single solution to a problem. Sure, sometimes it really works fine, but often you know it only in the end and when it's to late. So while this is something somebody with lots of time can try, if can be a death blow to a project with time constraints.

Language extensions have to be learned like 'build-in'-features of the language. So having much extensions makes a language hard to learn. And language extensions have to be remembered or learned if you have to maintain code.

Sure, you also have to learn normal libs and frameworks. And we all know that learning a framework can be quite hard. But if language extensions come into play it's even getting harder because real language extensions have possibilities to change the semantics of otherwise well known constructs. This can obfuscate the real meaning of an operation and leads to harder to read code. And if you read code you always have to consider the active language extensions and 'parse' them in your head too. Also to use an extension you have to learn it first. Just take a look a CommonLisp's loop-facility: While it's relatively easy to read and understand it's quite hard to learn all it's possibilities and features. If, on the other hand you have some code which does the same as a language extension, this code is often directly readable, even if it's maybe a bit longer.

With build in language features this isn't a problem, because those features are limited and have to be learned once (or maybe some more later if the language gets some official extensions), but with a user extensible language the number of extensions is unbounded and can increase quite rapidly if multiple teams use their own extensions.

If you're a 'lone wolf'-programmer all those disadvantages don't apply to you. But if you have to work in (maybe big) teams or have to use lots of 3rd party code, you will get some of those problems if you use an extensible language. I suspect (but can not prove) that the rigidity of the Java language is the prime reason why there is so much 3rd party code right now - sure, the code looks sometimes ugly but on the other hand Java really leads the way in how to do something which leads to better fitting software pieces. But with better extensibility this effect would fade because then there would be more ways to solve a certain problem and all those solutions won't fit together.

And what's the way out now? Do we really have to live with those boring, inflexible languages for ever (at least for 'productive languages'. Languages for prototyping, academic purposes etc are a different breed)?

I think the way is to provide languages with as much as is needed to solve the problems the language is designed to. Make the language 'as rich as necessary'. Ada is a example which shows the advances of this approach. Also Java has it's parts where this approach is clearly visible - and quite well working. But of course a language designed with this principle in mind has to be updated more often to satisfy the needs of the programmers. The difference is that those updates and extensions are well thought out and they are can created with the whole language in mind by people who now their job. This leads more probably to sensible extensions, even if it depends on the language designer or the community which way they want to go.

Friday, August 25, 2006

OOP is dead (part 3)

Before I take an in depth look at another severe problem with object-oriented-programming I want to talk about why I believe that OOP its really dead. Hey, Java is alive and kicking, C# (.NET/mono) also and still there's lot's of C++ programming. So how could OOP be dead then?

Sure, I used exaggeration to make my point more obvious. But there is more: Have you looked at the actual development in recent program language design? Some well known Java-guys propose to add closures. C# gets closures too, and there is this LINQ thing which is also totally un-OOP. And C++? Look at the recent trends in template-meta-programming (boost++) which is heading directly away from OOP too. And we have those 'new' languages, like Ruby and Python. Sure they are OOP to a big degree, but they also incorporate lots of functional aspects into their design. Same for really new languages like Scala which is only OOP to a certain degree anymore.

Another trend is meta-programming. While I think that meta-programming is a bad idea, it shows that there are limitations in a language one can only overcome with meta-programming. In Java for example we have annotations and reflection which allows to change then semantics of the language. Other languages have other means to accomplish the same. But while meta-programming is quite natural in a language like Lisp, its hard and cumbersome in a language like Java. So if people really go the way to do it, there have to be some obvious problems which aren't solvable with the language itself.

So why would people make those decisions if OOP is really alive? It's not that people abandon languages like C++ or Java, but they start to use them in an increasing non-OO way. And this means that OOP starts to lose importance.

And on the other hand we look at the ever increasing number of frameworks for Java which only try to solve problems which only exists because of the limitations of OOP. Frameworks of sometimes absurd size and complexity considering what problems they try to solve. Ok, maybe that's not a problem with OOP but only with Java. But I really doubt it. I've designed several OOP-languages over the last 20 years and each and every time I wasn't able to solve certain severe problems. Sure, you can make things more easy than in Java, but it's always only a gradual increase not a real big one.

what's the most important point in programming? Creating compact code? Creating easy maintainable programs? Having a very expressive language which allows elegant solutions to common problems? Sure all those are nice to haves, but the paramount point simply is one thing: Code reuse. If you can reuse existing code, you don't have to write it, to debug it, to maintain it. It's the optimal productivity boost and always faster then to re-implement something, even if it's in the most elegant, compact and easy way.

So what's the problem with code reuse? Why is it so difficult? And why especially in OOP? And isn't one of the core ideas of OOP to make code-reuse much more easy? I thought so myself, but for some I doubt the validity of this claim. It simply hasn't worked. But why?

First: Dependencies.
In OOP you have to create lots of unnecessary dependencies just for performance reasons. Think of the cache-example from the last article: Adding simple caching created lots of new potential problems and pitfalls. And in fact it's even totally useless for the function of the program, it's only needed to improve performance. But because of the cache, you need more dependencies to notify the cache of changes in data structures on which the cached value depends. But each dependencies breaks opportunities for code reuse, because you don't only have to capture the 'code itself' but also all the dependencies if you want to make a reusable piece of code. That's the reason, most frameworks grow that fast: They simply have to provide all those dependencies.

Second: Extensibility.
Isn't that something in which OOP should really shine? With inheritance, which allows the creation of slightly extended classes from existing ones, OOP should be able to give lots of extensibility. That's the theory. In practice there are several reason why it often won't work:

The function-composition-problem I mentioned in part one.
In OOP the type of an object is determined at creation time. So if you have an extended version of some class you can only have the advantages if you also have control over the creations of objects. If object-creation isn't in your reach, you will only get objects of you e 'old uninteresting' base-class instead of your 'shiny-new' one with enhanced functionality. To overcome that problem the factory-pattern has been invented, but it creates additional complexity and isn't of any use if the object you want to extend isn't created by a factory. There's also the idea to make object-types changeable at runtime, but this approach has drawbacks which easily outweights the advantages.
The limitations of inheritance. Inheritance is a very simple thing, it can only describe 'is-a' relations. While this is perhaps the most important relation, it still only one. And if we want to have multiple 'is-a' relations ('multiple inheritance') we run into lots of difficult specification problems, the reason why some languages simply avoided multiple-inheritance or limited it to interface-inheritance. Also inheritance proposes a simply hierarchical structure of a program, where you can only add leaves to the tree. And you get lots of additional dependencies.
the limitations of overloading: Overloading only works as good the design of a class. To allow flexibility a class has to consists of very small methods which call each other in a way that you can add as much functionality as possible by overriding them. But in practice that's much more difficult then in theory. If you create more small methods, the code gets more complicated even if the added complexity is maybe never needed, because nobody will ever extend a class in those ways. And if you are to conservative in splitting up the class into small methods, then you could block extensibility later and have to refactor the base class. While refactoring is always possible, it has a big disadvantages: It can break contracts and thus existing code. Also it's against the idea of code reuse where you simply use existing code without having to modify it.
explicit data paths: This is a topic I will look at in detail in a later article. In short: In OOP you have to specify all data access-paths explicitly. If you have some data-structure you always have to specify some concrete structure how the data is structured. If this structure changes, often lots of code which access data has to be changed too. And a concrete structure which is usable for problem A isn't necessarily usable for problem B. So if you want to reuse the solution for problem A for problem B that won't work, because the data structures have changed, even if A and B otherwise have much in common.

For all those points there exists solutions. But in practice they often simply don't work. It's hard to tell why without looking at concrete examples, but I think the prime reason for it is over-specification: In OOP you have to write much more explicitly as is really useful. Often only for performance reasons. And over-specification leads to additional constraints and complexities which makes reuse much more improbable, because to fit into a new area, existing code don't only have to match the problem but also all those optimizations and tweaks which are necessary in OOP.

I think one of the reasons for that stems from the origins of OOP: Simulation. Each object there is some independent 'living thing' which has to be specified on it's own. Because of this OOP is very local, very 'to-the-metal' and it's hard to look at the overall structure of a program. Thus we have to code lots of things explicitly - which then leads to the described problems. So instead of OOP we need more abstract ways of programming, because abstraction allows the compiler to do much things automatically the programmer has to do now by himself. And by this the code is freed from unneccessary complexity which would not only make programming more easy but would also create more opportunities for code-reuse.

Wednesday, August 23, 2006

Java enhancements I would like to see

After some critique of the new closure proposal (because of it's redundancy) I want to talk about some Java extensions I would like to see. Those extensions complement Java in useful ways to make some often used idioms a bit more easy to use without promoting a totally different programming style.

Counter for the for-each-loop:

I often write something like this:


int i = 0;
for(Something s: data) {
   if (i++ > 0) out.print(", ");
   out.print(s);
}

While this works, it would be more easy that a for-each loop creates it's own counter on demand:


loop: for(Something s: data) {
   if (loop.count > 0) out.print(", ");
   out.print(s);
}

We can also extend this idea to supplying 'isLast' and 'isFirst' variables to make the above code more readable:


loop: for(Something s: data) {
   if (!loop.isFirst) out.print(", ");
   out.print(s);
}

A 'isLast' is much more seldom used but has it's benefits sometimes. While it's not so important, it's a bit more difficult to implement because we have to delay the evaluation of the loop-body by one element to know if the actual element is really the last one. But if it's required it's quite useful and could make loops much easier readable.

Parallel iteration

Sometimes you have to compare two collections, add them element-wise etc. So the new for-each loop is useless and you need to use the old explicit one. This could be easily changed with an additional modification to the for-each loop:


for(Something s1: data1; Something s2: data2) {
   ...
}

This loop will loop until if one of the iterators has no elements left. It's a really straight forward extensions and should be quite easy to implement.

'on-break' extension for loops

This one is to solve constant annoyance: Figuring out if a loop terminated normally or by a break.

Imagine you have something like


boolean has_error = false;
for(Something v: data) {
  if (v.containsError()) {
     has_error = true;
     break;
  }
}
if (has_error) {
  ...
}

This is a very common idiom for handling breaks: Simply setting a flag to know if a loop breaks. But you have to do it by hand and thus it's error prone (imagine, setting has_error = false by accident). Sometimes it's possible to do the stuff in if (has_error) { ... } in the loop-body, but there are lots of cases where this won't work. Also it won't work if the break is created by the compiler like above in parallel-iteration.

So I propose the following syntax:


for(Something v: data) {
  if (v.containsError()) break;
}
catch() {
   ... // executed on break only
}

This is really straightforward and don't use any new keyword.

Allow for each top iterate if supplied with an Iterator and not only an Iterable

Sometimes is useful to provide different kinds of iterators per collection. Maybe:


    Iterator forwardIterator();
    Iterator backwardIterator();
    Iterator depthFirstIterator();
    Iterator filterIterator(Filter filter);

In the moment we have to make all those Iterators self Iterable and add a


    Iterator iterator() { return this; }

method to the iterator. It would be better if you could use iterators directly in a for-each loop and write for example:


   for(Element e: tree.depthFirstIterator()) {
      ...
   }

even if depthFirstIterator() only gives a normal Iterator.

Make iterators 'Closable' and close them in the for-loop automatically.

This is very important if you use for example iterators to iterate over a result set from an SQL query. In the moment it often prevents the usage of the for-each loop, because you have no access to the iterator object there. The problem: You have to create a new version of the iterator interface to prevent breaking existing code. So we have


interface Iterator2 extends Iterator, Closeable { 
    void close();
}

and a matching Iterable2 definition. Also the for-each loop has to be aware of this and generates a call to close() for Iterator2 iterators. The Iterator2 thing could be avoided if the below 'default-methods for interfaces' extension is implemented.

A 'using' statement like in C#

We all know how annoying correct exception handling and proper closing is in some cases. To make this easier I propose using 'try' for a 'using-like' construct:


try [type] [var] = [expr] { [body] }

catch blocks can optionally added to dispatch on exceptions like usual, but the 'try'-block will close [var] properly without requiring all those nested try-catch-finally mumbo-jumbo.

A 'instanceof' with autocasting

While it's often bad style to write something like


if (x instanceof SomeClass) {
   SomeClass x1 = (SomeClass)x;
   ...
}

it's often unavoidable. This idiom has two problems: First you have to invent a new name for 'x' and second the compiler won't check if the cast is correct. Also it's bloat. So I propose this syntax:


for(x instanceof SomeClass) {
   ... // x is casted here to 'SomeClass' properly
}

This is very easy to implement, don't requires new keywords and solves the above problems.

And after all those (nice and easy) 'syntactic sugar' now to the more interesting, but also less easy ones:

Default implementations for interfaces

This is a bit controversial, but I think the benefits are bigger then the risks. Interfaces remains field-less, but the method can contain default implementations. If you create a class C which implements an interface I, all default method-implementations in I will be copied to the body of C unless C implements those method itself.

With this feature interfaces could be more rich without requiring too implement each and every method in each class which implements the interface. This would lead to better code reuse, because interfaces could be made more customizable. Also you get less legacy issues, it's no more problem to extend an interface without the need to update all classes who implements this interface.

Consumer/provider reversal ('yielding')

If you have to write a more complex iterator which iterates over recursive data structures you know the problem: You have to do many things by hand, the compiler does normally itself. You need to provide your own stack, store the state of the iteration etc. This is very cumbersome and error-prone.

This problem is most often avoided by using closures for iteration. With this the state of iteration is handled as usual and the code becomes much clearer. Lets look at a simple example using closures:


class Tree {
   E value;
   Tree left, right;

   interface ForEach {
      void eval(E value);
   }
   
   void forEach(ForEach func) {
      func.eval(value);
      if (left != null) left.forEach(func);
      if (right != null) right.forEach(func);
   }

   void forEachDepthFirst(ForEach func) {
      if (left != null) left.forEach(func);
      if (right != null) right.forEach(func);
      func.eval(value);
   }
   
   void print(final PrintWriter out) {
      final boolean[] is_first = new boolean[] { true };
      
      forEach(new ForEach() { public void eval(E value) {
         if (is_first[0]) is_first[0] = false;
         else out.print(", ");
         out.print(value);
      }});
   }
}

With this implementation we can easily iterate over the tree in two different ways (depth first and depth last). But it shows two problems:
- First we have to maintain a 'is_first' state to place the ", " correctly.
- But really worse: It won't work with parallel iteration

If we have had an iterator implementation for 'forEach' and 'forEachDepthFirst' we could for example write


void isEqual(Tree t1, Tree t2) {
   for(E v1: t1; E v2: t2) {
      if (v1.equals(v2)) break;
   }
   catch {
      return false;
   }
   return true;
}

The advantage with this approach is that it would work for every Iterable, but how would you solve this problem with using closures and 'forEach'? You have to provide additional implementations for two trees, three trees etc. And what if you want to compare the content of a List and a Tree element-wise? Here Iterators are much more powerful. And because they are a part of Java for some time now, I propose to make the implementation easier instead of switching to a different and even less powerful abstraction.

The idea is well known as 'yielding'. It simply allows to suspend the execution while saving the state for later resume. While this can implemented with threading, its a very expensive way to do it. So its better to do it by code rewriting directly in the compiler.

How would a iterator with yielding look? Consider the below method as part of the class above:


Iterator iterator() {
  yield value;
  if (left != null) for(E v: left) yield v;
  if (right != null) for(E v: right) yield v;
}

That's it. As easy as it gets. 'yield' will suspend the execution and provide the result via the 'next()' method of the Iterator to the consumer. The problem here is the use of a new keyword. This I propose using 'throw return' for this. While this isn't totally nice, it somehow captures the idea of yielding and is short enough. To make the method as yielding the compiler could interfere this himself, or we could use the 'throw' keyword to mark those methods explicitly:


public throw Iterator depthFirst() {
  if (left != null) for(E v: left) throw return v;
  if (right != null) for(E v: right) throw return v;
  throw return value;
}

With this extension, iterators are quite easy to implement, but I know that this extension is a bit difficult implement. But it could provide a huge gain in productivity and more people would provide useful implementations of iterators.

Including inheritance

This is a bit more complex but could bring lots of benefits in code reuse and maintenance. It's a bit similar to traits, and has quite simple semantics.

Lets have an example:


class MyListModel implements ListModel {
 public int getSize() { ... }
 public Object getElementAt(int index) { ... }

 public void addListDataListener(ListDataListener l) { ... }
 public void removeListDataListener(ListDataListener l) { ... }
}

The problem are those two method at the bottom. Often you want to use a standard implementation, and only want to implement the getSize and getElementAt method. To prevent this reimplementation there is a AbstractListModel class in the libs which have a standard implementation for addListDataListener and removeListDataListener. While this works often, it's problematic if MyListModel also should extends another class, or if you want a different standard implementation.

So I propose to simply include standard implementations for interfaces:


class DataListenerImpl implements ListModel {
 public void addListDataListener(ListDataListener l) { 
    listenerList.add(ListDataListener.class, l);
 }
 public void removeListDataListener(ListDataListener l) {
    listenerList.remove(ListDataListener.class, l);
 }
 
 private EventListenerList listenerList = new EventListenerList();
}

This class won't work on it's own, but with the new extenson you could include it in another one:


class MyListModel implements ListModel {
 import DataListenerImpl;

 public int getSize() { ... }
 public Object getElementAt(int index) { ... }
}

that's it, MyListModel now have the methods and fields from DataListenerImpl which satisfy the ListModel interface. The fun part of this extension is that it's possible to split up implementations into several topics and the include them where you want them. So maybe you write a class


class MapModelImpl implements ListModel {
 public int getSize() { map.size(); }
 public Object getElementAt(int index) { ... }

 private Map map;
}

you can then simply synthesize a new ListModel by writing


class MyListModel implements ListModel {
 import DataListenerImpl;
 import MapModelImpl;
}

This would work for much more complex cases then described above and the example is far from complete, but the idea should be clear. I can won't elaborate here on the details (constructor handling, collisions), maybe in a later post.

OOP is dead (part 2)

A second and even more severe reason why OOP has reached it's peak and more and more people are looking for alternatives is mutable state (I'll write only 'state' from now on). Most programmers don't even think much about state, they simply take it as common part of programming, but in fact it's the source of most bugs and difficulties in programming. So what's 'state'? It's the current value of all data which changes while the program executes. The code of the program and constants don't change and are thus not a part of state. But in OOP each object has local data which often is mutated by calling a method, so each object has its own state. All objects together plus the values of local variables form the state of the program.

What's the problem with state? Simple: It can change. And if it changes and the change isn't reflected accordingly in other parts of the state, we have a bug. Lets look at a simple example:


void calcSum(List<Integer> data) {
   int n = data.size();
   int sum = 0;
   for(int i = 0; i < n; i++) {
       sum += data.get(i);
   }
   return sum;
}

This code looks perfectly valid, but in fact contains a hard to find bug. Imagine while executing the above method, an other thread is removing one element from 'data'. The result will be a runtime error, because n is now bigger then data.size() and for i = n - 1 we get an array range overflow.

Of course every Java programmer knows how to prevent this kind of problem (putting the method into a synchronized(data)-block), but if shows how easy it is to make mistakes which will happen nearly never - but then sometimes in a critical moment. The reason is mutable state: The above method depends on the state of 'data' and if this object is changed somewhere else while the method is executing, we have a potential problem.

But that's not all. Lets look at another example: Caching, which is often required to get a better runtime performance.


class CustomerList {
   private List<Customer> customers;
   float average_age = -1;
   
   public void notifyChanged() {
       average_age = -1;
   }
   
   public float getAverageAge() {
      if (average_age < 0) average_age = calcAverageAge();
      return average_age;
   }

   private float calcAverageAge() {
      if (customers.size() > 0) {
          float sum = 0;
          for(Customer c: customers) sum += c.getAge();
          return sum/customers.size();
      }
      return 0;
   }
   
   public void addCustomer(Customer c) {
       customers.add(c);
       notifyChanged();
   }
}

The calculation of the average age of the customers is cached here in a field. If you add a customer the cache is cleared an will be calculated next time it's required. This idiom is quite useful and common, but it has it's dangers. The danger is that the state of the list isn't automatically reflected in the state of the 'average_age' field. If someone changes the age of a customer externally, the whole caching will fail because this class has no idea that a changed happened. There are several methods to avoid this problem, one is to simply add a reference to the containing CustomerList-object to every Customer-object and then add 'notifyChanged()' calls to each method which changes relevant data of the Customer-object.

But you see how this seemingly simple optimization increases the complexity of the program. First we only have the idea with the cache, then we discover that we need notifications, then we have to add some registration mechanism, that every Customer-object knows it's container. And if Customer can be part of multiple CustomerList-objects we need the full fledged observer pattern, have to identify each and every position in the code which can changed a cached value etc. Imagine you have to maintain a complex program which uses several of those mechanisms and also optimizations to have multiple notification messages (so that the average_age cache isn't cleared if only the address of a customer is changing). You see how complex our once simple program is getting.

All this trouble only has one reason: Mutable state. If you change one part of the state (the age of a customer) all depending parts of the state (here average_age) have to be updated too. In complex programs we can even get circular dependencies and also cross-linking. Using caches is only one part of the problem, the problem happens with each and every dependency in the program, for example consider a GUI-view which has to be updated each time the underlying data changes. And if the whole thing has to be multithreaded, you have to additional sprinkle lots of 'synchronized' blocks all over the program.

But I think most Java programmers know that for long. And also programmers who are using Python, Ruby, Smalltalk etc. It's not a problem with a certain language, it's the programming paradigm, it's object orientation. Object orientation means that you have to do much 'by hand'. Each object, each dependency has to be maintained explicitly by the programmer. Sure, with clever libraries and frameworks, all this is solvable, but you still have to do much per hand. Is this complexity really necessary?

The answer is no. But we need a different paradigm. Most programmers already now one: Relational database management systems. Those systems do lots of caching and index updating totally automatically in the background. You only has to give a SQL query and the system gets the answer. It's using caches, it's updating indices, optimizing the order of table queries and lots more. But the programmer simply has to provide a simple query. Sure, even the best RDBMS have certain limitations, but it shows a way to go.

But there are other ways to reduce the dangers of mutable state. One is to simply have none. But is it possible? Can we program without having mutable state? Yes, and many already know: I'm talking about referential transparency (most often used in functional programming). This simply means that there are no 'side-effects' in a program, thus there can't be any mutation of data. How does it works? We require that each and every call to a function always give the same result if we provide the same parameters to the call. A language with referential transparency solves most of the problems mentions here. Since data isn't mutable there can't be concurrent modifications and multithreading is as simple as it gets. And without modification the compiler can do caching on it's own, he simply has to look at the input parameters of a function and if those are the same as before, he also can return the same value as before.

Of course referential transparency has it's disadvantages: It's a bit hard to program with it, because you can't change data. But how do you do any interesting things with it? Imagine a simple interactive application which reads some data from the user, calculates something and writes the result on the screen. How could we write this in a language with referential transparency? (I use a Java-like syntax)


function int readData(IOState io);
function void writeData(IOState io, int data);
function int calc(int value);

function void littleApp(IOState io) {
   (int val, IOState new_io) = readData(io);
   if (val != 0) {
      int res = calc(val);
      IOState new_io2 = writeData(new_io, res);
      myLittleApp(new_io2);
   }
}

Those three functions above are defined somewhere else. The interesting part here is the 'IOState' parameter. It holds the state of the IO-subsystem which is used to read and write data from and to the terminal. Without this IOState, the 'readData' function would return always the same value, because with the same input each function in a referential transparent always has to return the same result. The same goes for the 'writeData' function, which uses IOState to chain the writing in the right order. The last call 'myLittleApp(new_io2)' loops the application until someone enters a '0'. Recursion is necessary (but cheap here, because the function it tail recursive and can be optimized away by a good compiler), because we can't change a 'variable' (this would be mutation and could lead to side-effects which are prohibited). Each value is only assigned once, so in fact we can only copy something and never change it.

But you see that it's really possible to write interactive programs without having a single mutation of data. Everything is happening by creating a (sometimes modified) copy of existing data. And if the user enters the same value twice, the compiler could safely return the earlier result, because a call to 'calc' only depends on a single parameter.

The above is the real appeal of functional programming, not the existence of closures or continuations. With referential transparency it's not only possible to remove lots of dangerous constructs, concurrency issues and have automatic caching, we also get better ways for testing (if each function only depends on it's parameters, testing is much easier, because we don't need to set up a complex background state) and the compiler can make lots of optimizations and even make the program multithreaded by itself.

All those advantages are paid with a different way to write programs, and some new difficulties which also have new means to solve them (google for 'Monads Haskell' if you want to know more). But it shows that the intrinsic problems of object oriented programming are in fact solvable if we use a different paradigm. Relational databases and functional programming are only two examples, there are more. But before I delve deeper into this, I will try to list other reasons why OOP is dead soon.

Also have look at Part 1 and Part 3 of this series.

Pages