Pages

Saturday, December 30, 2006

The problem of dealing with data-structures

It seems to be a 'solved problem': How to represent and use different kinds of data in programming languages. But in fact there's lot of unnecessary complexity and lots of artificial difficulties if you really think about it.

Lets just look at an example (It would look similar if I use the Hindley-Milner type system but it's just about the basic principles not about a certain implementation. I think that Java is common knowledge today and so I'll use a Java-like syntax).

Let's start with the following basic structure:
   
class Customer {
String first_name;
String surname;

Address[] addresses;
Invoice[] invoices;
}

class Address {
String postcode;
String city;
String street;
}

class Invoice {
Date date;
Date date_payed;
Currency value;
}

This allows us to represent 'Customers' with 'Addresses' and 'Invoices' in our system. The above structure is very clean and easy to understand - but also totally useless. Why? Because we also need ways to query the structure above. For example we want to access all invoices in the system to create reminders. Or we want to access all customers to do some mailing. So we have to create some containers:
   
class Database {
Customer[] customers;
Invoice[] invoices;
Address[] addresses;
}

This 'Database'-object stores the data and allows us to access them if we want. But it also creates unnecessary ('accidental') complexity. Why? Because we have a rather artificial additional data structure only to access those data we already have declared. If we now do a

c = new Customer()

somewhere we have to put c into 'Database.customers'. And if we destroy a Customer, we have to remove it from 'Database'. This creates some complexity which is very easy to handle in this simple example, but could be the source of more difficult to handle problems in big and more complex applications. But this is it's only the beginning:

Next we want to find a Customer for a certain invoice. We could simply do a search in Database.customers but that would be much to slow in a real application. So we add a back-link to each 'Invoice':

class Invoice {
Date date;
Date date_payed;
Currency value;

Customer customer; // the back-link for quick access
}

This is an additional rather artificial addition to our structure. We need it only for performance reason, and it leads to new potential problems, because it has to be maintained to. But lets continue further:

Next we want to find all customers in a certain city. We could search Database.addresses to find the matching address and then Database.customers to find the customer for those addresses, but again in most situations this is simply not fast enough. So we have to extend our data structure once more:

class Address {
String postcode;
String city;
String street;

// back links. We use an array because we want to allow
// multiple Customers for each address
Customer[] customers;
}

class Database {
Customer[] customers;
Invoice[] invoices;
Address[] addresses;

// a map to find all addresses for a given city
Map city_lookup;
}

Those are only the declarations. You can easily imagine the additional code in your application to maintain those structures if you add or delete Customers, Addresses etc. Now scale this simple example to a real-world-application with hundreds of those classes, lots of different views and access-paths to you data structures and you see the problem. And I've not even talked about making the above structures persistent.

The funny thing is that all we really needed to describe our data is in the topmost data structure. It gives a complete description of the problem, but for performance and organization purposes we had to add lots of additional things which blur and complicate our once concise and self explaining structure. So all those additions create 'accidental complexity': Complexity which isn't inherent in the problem but only required for the specific implementation.

You may say that this is a contrived problem and would be handled by a SQL database anyway, instead of representing the structures in the way I did it here. But most data in programs is represented in a similar way. Let it be graphical elements of a GUI or a word-processor, the data in a 3D-modeler or the virtual world in a computer game: We always have data structures which are in principle quite simple but have lots of back links, indices, maps etc to provide access paths to the data.

In object oriented programming this kind of structuring the data is part of the paradigm: Objects have slots and those slots contain references to other data. All objects are intermingled in this way and objects without an access path are 'dead' and will be removed by the garbage collector (or spam the heap in a language without one).

This kind of structuring is so widely used, that its hard to even question it or think about a better way to representing those data. But lets give it a try.

In fact we already know another way: The above mentioned SQL database. It stores data not in structures with references but in relations. SQL is just a certain language to access and manage those relations, but it's in no way the only way to do this. A database stores data on external memory to be able to scale up to huge amounts of data, but we can also store local client data in relations. Just let me convert our example into a relational based one:

relation Customer {
Id customer_id;
String first_name;
String surname;
}

relation Address {
Id customer_id;
String postcode;
String city;
String street;
}

relation Invoice {
Id customer_id;
Date date;
Date date_payed;
Currency value;
}

(The above could've been normalized a bit further to make it more flexible, but for example purposes I let it this way).

The structure looks very similar, but it allow all kinds of accesses we could do with our much more complex non-relational structure above - and lots more: We could simple create joins between those relations and query all things we want to know: For example all customers at a given city/street/postcode, all customers with open invoices in a given city, all invoices payed in a given date range. And we don't have to maintain additional back links, maps etc. because this is all done by the programming language which could use indices or other optimizations on all data which is queried often. Automatically if we use something like runtime profiling or static global optimization or by manually by giving hits to the compiler.

The reason for this is, that the above relational structure doesn't contains fixed access paths. Instead of that the access paths are created in the moment we query the structure. This is obviously much more flexible than building the query right into the data structure. And while a single data structure can only be organized in a single way, we can write arbitrary kinds of queries.

So by using a relational structure instead of a referential ones we solved lots of problems. A 'big one' is the infamous O/R-mapper problem: Everybody who tried to use a relational database with a object oriented language knows about it. The idea to simply change the way the database works and use a OODBMS hasn't worked out, RDBMS are still state of the art. And because of the reasons mentioned above I suspect that they stay there (unless we find a new 'third way' to structure data). By changing the basic model of representing data to a relational one, we have also solved the O/R-mismatch.

But is it really possible to represent ALL kinds of data in a relational way? I think it is. I looked at many problems I've worked on in the past and every problem was at least equally and often much easier representable in a relational way. But one thing have to change: The query language. SQL simply isn't up to the task anymore. It maybe never was but while there was much development in the field of object-oriented and functional programming, relational stuff was a bit neglected in the last years. So SQL stayed at the top and became something like a synonym for querying relational data structures.

But this hasn't to be this way. SQL is only a programing language designed to solve this task. And in fact it misses lots of important things. Lets have a relation

parent-of(parent-id, child-id)

Now write a SQL query which decides if a given child-id is a child of a given parent-id, even if only indirect over some 'hops'. This is called the transitive closure of a relation and isn't solvable in general SQL (only a few RDBMS seem to support recursive queries which are needed to solve the problem). This kind of problem is rather common in general algorithms, another one is to sort elements in a topological order based on a similar relation (In fact most graph algorithms are useful on relations and should be easy to express in a useful query language).

So using relations instead of objects can give real advantages but we need language support to make is possible to not only query a RDBMS more easily but also represent all structured data in a full relational way. This way we no only solve the O/R impedance mismatch and the problems with references (I've written about this here and here), we also simplify the way we are programming: Data structures are the base of every algorithm and with a more powerful way of representing data structures it should be possible to gain lots of productivity.

Wednesday, December 27, 2006

My Haskell experiences

I've known Haskell for some time now. The first time I've used it, it was with a system named Gofer after I read a book about the implementation of a compiler for the Miranda language (there were no Miranda system available at this time because it was a commercial system). Gofer was quite complex for the time and also quite useless for practical application (especial considering the performance of the PCs at this times, I used it on a 8MHz Atari ST). So I went back to C (and later C++) but tried out functional languages from time to time again. The one I used most was Caml (later called OCaml), but I had certain problems with it and abandoned it after some frustrating experiences.

But now Haskell seems to be en vogue and after playing around with it, doing some toy programming and trying things out, I decided that I have to do something 'real' in it. For some time now I casually work on a system for compiler creation based on a customizable parser front end, a template engine for output, an rule based language with forward inference for doing the semantic stuff and a graphical Browser to work with the system comfortably. Because I reached a point where small incremental changes weren't enough anymore, I decided to do a bigger rewrite. It's written in Java but I thought it would be a good idea to rewrite it in Haskell to learn more about the language.

So for some weeks I've done quite a bit of Haskell programming, produced about 2000 lines working code and made some experiences with Haskell I'd like to share.

Where Haskell really shines is pattern matching: Creating tree walkers or subexpression-matchers is a piece of cake, and if you're writing a compiler, this is really useful. Also the easy and powerful list-handling is quite an asset. Things which would require tens of lines in Java are done in 2-3 lines of Haskell code. No more clumsy visitor patterns, not boring List-object creation with complicated insertions etc. All this is really smooth walking and where you can use it good, it really gives a lot productivity.

Lazy evaluation is the icing on the cake because it allows you to write code in very natural way which would lead to rather bad performance in other languages because calculations could easily calculate much stuff which isn't really required there. To avoid this you often have to create quite artificial structures instead of being able to simply reuse old code. In Haskell you simply create a infinite or big finite lists, even if you require maybe only the first two elements. So in a non-lazy language it's always necessary to have some criteria to abort the evaluation while in Haskell you simply write like it is most natural and then let the function which request the data choose how much it wants to get. Caching inclusive.

Also using a fast REPL (read-eval-print-loop) to test out new code (I used Hugs, because of it's good integration into the Windows environment) can be quite nice. That's also a big plus for languages like Lisp and with Haskell it's quite the same. One bothersome thing with Hugs is that it doesn't allows you to test code out, as long as it finds a single error in one of your modules, even if the part of code your test doesn't touch. Thats a result of the compilation model and type inference and partly understandable. But in contrast to using the REPL with Lisp it's still a bit annoying.

Also in contrast with Ocaml the Haskell type system is much more powerful and allows easy overloading of functions. In Ocaml I always had to invent lots of names or use qualified names for functions because there was no overloading possible (but don't know for sure it this is still true because the last time I used Ocaml was in 2002 and maybe they made some advance since then).

Haskell has type inference and while this can really make code more compact because the lack of necessary type annotations it also has its downside: A new line somewhere in your code can easily create compiler errors somewhere else. Type inference works by creating lots of constraints and the inferencer then tries to solve them. So if an error occurs, the compiler flags this error somewhere in the dependency chain and often at places where everything is totally correct while the real error is somewhere else. If you develop code incrementally this is not as bad because you know what you have written since the last run and can look at the right spot it to figure out the error. But if you write longer parts of code without testing the program in between, it can take you quite a bit of time to find out where the error really is.

Haskell has a solution to it: Type annotations. Simply declare the type you want a function to have and things are working much more easy. So if you want to do something bigger than 20 lines in Haskell: Use type annotations, at least for all top level functions. Luckily it's easy to put annotations into your code later, because you can query the inferenced type of a function each time your code compiles and simply paste it into your program (but sometimes you should rewrite it a bit to support a bit more general typing). The code isn't as small as without annotations, but thats better than doing lots of searching around for the source of compilation errors.

The real problems with Haskell start to show up in the moment you require some kind of state. And yes, you will need it: Many things can be done in pure functional style, but I've often encountered situations where a single function deep down in the call hierarchy suddenly needed some informations from the outside. And in some cases they even needed to modify those informations. Then things starts to get ugly.

In Ocaml you can simply declare global, mutable variables and so the problem is a non issue. But in Haskell it's impossible because it don't allows mutation. Even if our small block of code only needs to read some state from the outside, it's not directly possible in Haskell anymore. In my project I often found a nice functional solutions to a problem, but later on, after I put the (tested and good working) code into action I discovered that some function 'deep-down' suddenly required access to outside data and I had to do lots of rewrites because of this.

Haskell has a solution to this 'outside context problem': Excessive use of Monads. While a bit strange in the beginning, after using monads a bit, you get quite fast accustomed to them. But with more complex situations, the usage patterns are also getting more and more complicated.

To use outside state I often used the StateT monad. This monad allows you to create your own state data, modify it like in an imperative language and it can also wrap an other monads into it (for example the IO monad). So if you use this monad you can access your state at will, even modify it and also have access to IO functionality to write out informations while your program is executing.

(This is another big issue I had with Haskell: Simply create some debugging output. In most language you can simply write to some logging-stream to see what your program has done. This is often even more useful than using a debugger, because you can examine the history of the program execution and see what lead to a certain error. But in Haskell you need access to the IO monad to write data out and this requires you to change the way you write your programs)

Also the conciseness of Haskell stops in the moment you want to access or change state. Even Java code is more concise than Haskell code which has to change things. I know that this is probably intended to force programmers to write programs in a functional style. But compared to the otherwise clear and compact syntax of Haskell its really a pain.

Monads can encapsulate state into some kind of data. In fact all monads are simply data constructors with a few required functions. So most monads work by returning itself from functions. It requires a bit backward thinking, because the state-monad for example creates a chain of closures which are executed later by evaluating the value of the monad. So to use state you don't have to give this state to a function, the monad shows up only in the result type. While this is quite clever (and also is the reason why things like backtracking and continuations can be implemented as monads) from a practical point of view, it also means you have to rewrite your code:

Lets have two simple, small functions:

giveElements :: (a -> Bool) -> [[a]] -> [[a]]
giveElements condition = filter (detectElement condition)

detectElement :: (a -> Bool) -> [a] -> Bool
detectElement = any

The above seems a bit useless but I've extracted from it's surrounding code to show the problem. The 'giveElements' function is able to extract certain elements from a list and returns the filtered list. The 'detectElement' function does the job of deciding which elements are in and which are not. It does this by checking if any element of a list fulfills a certain condition.

So short so nice, but after some work on your program you discover that your 'condition' function required access to outside state to do it's job. Maybe the function has to do some a small modification of some data or should simply write a short message into a log. To allow this, you have to rewrite the above functions to use a monad. For this example lets simply use the IO monad here. Now the functions become:

giveElements :: (a -> IO Bool) -> [[a]] -> IO [[a]]
giveElements condition = filterM (detectElement condition)

detectElement :: (a -> IO Bool) -> [a] -> IO Bool
detectElement c els = do
      r <- mapM c els
      return $ or r

The first function still looks quite similar, only the 'filter' function is replaced by 'filterM' now. 'filterM' is the monadic version of filter and gives the monad back from the condition. So without filterM the state won't get up from the detectElement function (and of course there would be a type error).

detectElement looks a bit more different. The reason is that there is no monadic version for 'any'. So I use a monadic map (the mapM function) to create a list of Bools by applying the condition to each element and then check if at least one of the results it true. This doesn't requires more evaluations of c than the first version because of lazy evaluation, but it still required a complete rewrite of my code. It's still readable, but not as much as before.

And there are some other dreadfulnesses: The '$' after the 'return' is necessary because if you simply write "return or r" it would think or and r are both parameters to return (because 'return' is no part of the Haskell syntax but an ordinary function), so we need the '$' to change this. I could also write "return (or r)" but thats a bit longer and with more complex expressions can easily look like you're programming in Lisp. And you can't easily put the 'r' into the expression below. That's because the '<-' is no assignment but a special syntax which creates a closure with r as parameter.

All this has to be considered if you start to use monads. Only because I wanted to access the IO state from my condition-function, I had to rewrite my little program substantially. And if there would be no filterM if even get even more difficult. I had for example to use a similar code with sets instead of lists - but there is no filterM for sets, only a normal filter. And you can't 'lift' (Haskell slang for 'transform a function by using a combinator') a filter function to a monadic one. Map or fold functions can be lifted to monadic forms because they take a function which returns an arbitrary value, which can be used to put the monad in. But if a filter simply returns a Bool and then there is no way to return the monad from within. So I had to convert the set to a list, filter it with filterM and then build a set again - which has worse runtime characteristics than directly filter it.

And if you have more complex functions the above translation process from a non-monadic into a monadic-one is much more difficult and IMO it also destroy the clarity of the code. Also state monads have bad runtime characteristics because of their 'inside-out' evaluation which can fill up the stack quickly. You can use IORefs instead - but then you can't simply transform your program to using continuations anymore for example. And if you put a IO monad into a state-monad, you have to lift all IO-accessing functions via liftIO.

And whats the point of having a pure functional language, going thru lots of trouble to maintain the purity, only to get things like IORef which you tried to get rid of in the first place? If you design a program to use IO monads from the beginning everywhere, then you have fewer problem maintaining and expanding it - but then you're using Haskell like a normal imperative language. Whats the point of Haskell then?

Another bad one with Haskell is documentation: While you find lots of introductions all over the net, it get quite difficult to find informations if you're beyond the basics and start do do real work. The 'official' docs are simply horrible: No examples, no real descriptions how things work or should be used. And finding tutorials for the advanced stuff becomes quite difficult. You can get lucky and find a solution to a problem buried in some of the pages of the Haskell-wiki, a newsgroup or a research paper - but it will take its time to find it and sometimes there is no such solution. And this is only about 'language stuff', not how to solve a very specific situations in some obscure framework.

In retrospective: Haskell is a nice language, but if you need state handling then it gets ugly. That even kills the language for me. Whats the use of concise and compact code if you have to rewrite it later on into a much more ugly form if you discover that you need state? This is no problem for small programs, so I've never ran into those problems with my experiments before. But with a bigger program things changed. So my program still isn't implemented completely and I will probably abandon the Haskell version and do a rewrite in Java again.

Its not a problem of functional programming in itself, its a Haskell thing: While monads are quite powerful in principle IMO they are a pain to use in practice. Because monads show up in the return type of functions, the easy composition of functions which is the prime reason for the power of functional programming only works as long as all those functions use (compatible) monads, and sadly many functions simply don't exists in this form.

With Ocaml for example you won't get this kind of problem, but there you have unlimited mutation which has it's own problems. But there are other pure concepts like for example uniqueness typing (Clean) or transparently chaining state-parameters thru calls (Mercury). And maybe with more time to get accustomed to the use of monads I would maybe see things differently. But since I have no unlimited time and need to get things done in a programming language, I consider my Haskell experiment as a failure - but only a partial one because it was nonetheless fun and brought lots of insights.

So what to do with Haskell? Maybe it would be an idea to put monads automatically in every return type. If I write something like:

func :: a -> b

then the compiler would creates internally

func :: Monad m => a -> m b

and also rewrites all bodies to do/return-syntax. I don't know, if this would really always work and would really solve all the problems but I suspect that by hiding the use of monads better and letting the compiler decide where he can drop the monad constructors safely you would get you 'clear and normal' looking programs back even if they use Monads everywhere. I'm far from a Haskell expert, so take this with a grain of salt. But since Haskell already provides some syntactic support for monads, why not go one step further? If there wasn't the problem with state (and documentation - but thats only a question of effort) Haskell could be a very nice and maybe even the 'next big' thing.