Pages

Saturday, December 30, 2006

The problem of dealing with data-structures

It seems to be a 'solved problem': How to represent and use different kinds of data in programming languages. But in fact there's lot of unnecessary complexity and lots of artificial difficulties if you really think about it.

Lets just look at an example (It would look similar if I use the Hindley-Milner type system but it's just about the basic principles not about a certain implementation. I think that Java is common knowledge today and so I'll use a Java-like syntax).

Let's start with the following basic structure:
   
class Customer {
String first_name;
String surname;

Address[] addresses;
Invoice[] invoices;
}

class Address {
String postcode;
String city;
String street;
}

class Invoice {
Date date;
Date date_payed;
Currency value;
}

This allows us to represent 'Customers' with 'Addresses' and 'Invoices' in our system. The above structure is very clean and easy to understand - but also totally useless. Why? Because we also need ways to query the structure above. For example we want to access all invoices in the system to create reminders. Or we want to access all customers to do some mailing. So we have to create some containers:
   
class Database {
Customer[] customers;
Invoice[] invoices;
Address[] addresses;
}

This 'Database'-object stores the data and allows us to access them if we want. But it also creates unnecessary ('accidental') complexity. Why? Because we have a rather artificial additional data structure only to access those data we already have declared. If we now do a

c = new Customer()

somewhere we have to put c into 'Database.customers'. And if we destroy a Customer, we have to remove it from 'Database'. This creates some complexity which is very easy to handle in this simple example, but could be the source of more difficult to handle problems in big and more complex applications. But this is it's only the beginning:

Next we want to find a Customer for a certain invoice. We could simply do a search in Database.customers but that would be much to slow in a real application. So we add a back-link to each 'Invoice':

class Invoice {
Date date;
Date date_payed;
Currency value;

Customer customer; // the back-link for quick access
}

This is an additional rather artificial addition to our structure. We need it only for performance reason, and it leads to new potential problems, because it has to be maintained to. But lets continue further:

Next we want to find all customers in a certain city. We could search Database.addresses to find the matching address and then Database.customers to find the customer for those addresses, but again in most situations this is simply not fast enough. So we have to extend our data structure once more:

class Address {
String postcode;
String city;
String street;

// back links. We use an array because we want to allow
// multiple Customers for each address
Customer[] customers;
}

class Database {
Customer[] customers;
Invoice[] invoices;
Address[] addresses;

// a map to find all addresses for a given city
Map city_lookup;
}

Those are only the declarations. You can easily imagine the additional code in your application to maintain those structures if you add or delete Customers, Addresses etc. Now scale this simple example to a real-world-application with hundreds of those classes, lots of different views and access-paths to you data structures and you see the problem. And I've not even talked about making the above structures persistent.

The funny thing is that all we really needed to describe our data is in the topmost data structure. It gives a complete description of the problem, but for performance and organization purposes we had to add lots of additional things which blur and complicate our once concise and self explaining structure. So all those additions create 'accidental complexity': Complexity which isn't inherent in the problem but only required for the specific implementation.

You may say that this is a contrived problem and would be handled by a SQL database anyway, instead of representing the structures in the way I did it here. But most data in programs is represented in a similar way. Let it be graphical elements of a GUI or a word-processor, the data in a 3D-modeler or the virtual world in a computer game: We always have data structures which are in principle quite simple but have lots of back links, indices, maps etc to provide access paths to the data.

In object oriented programming this kind of structuring the data is part of the paradigm: Objects have slots and those slots contain references to other data. All objects are intermingled in this way and objects without an access path are 'dead' and will be removed by the garbage collector (or spam the heap in a language without one).

This kind of structuring is so widely used, that its hard to even question it or think about a better way to representing those data. But lets give it a try.

In fact we already know another way: The above mentioned SQL database. It stores data not in structures with references but in relations. SQL is just a certain language to access and manage those relations, but it's in no way the only way to do this. A database stores data on external memory to be able to scale up to huge amounts of data, but we can also store local client data in relations. Just let me convert our example into a relational based one:

relation Customer {
Id customer_id;
String first_name;
String surname;
}

relation Address {
Id customer_id;
String postcode;
String city;
String street;
}

relation Invoice {
Id customer_id;
Date date;
Date date_payed;
Currency value;
}

(The above could've been normalized a bit further to make it more flexible, but for example purposes I let it this way).

The structure looks very similar, but it allow all kinds of accesses we could do with our much more complex non-relational structure above - and lots more: We could simple create joins between those relations and query all things we want to know: For example all customers at a given city/street/postcode, all customers with open invoices in a given city, all invoices payed in a given date range. And we don't have to maintain additional back links, maps etc. because this is all done by the programming language which could use indices or other optimizations on all data which is queried often. Automatically if we use something like runtime profiling or static global optimization or by manually by giving hits to the compiler.

The reason for this is, that the above relational structure doesn't contains fixed access paths. Instead of that the access paths are created in the moment we query the structure. This is obviously much more flexible than building the query right into the data structure. And while a single data structure can only be organized in a single way, we can write arbitrary kinds of queries.

So by using a relational structure instead of a referential ones we solved lots of problems. A 'big one' is the infamous O/R-mapper problem: Everybody who tried to use a relational database with a object oriented language knows about it. The idea to simply change the way the database works and use a OODBMS hasn't worked out, RDBMS are still state of the art. And because of the reasons mentioned above I suspect that they stay there (unless we find a new 'third way' to structure data). By changing the basic model of representing data to a relational one, we have also solved the O/R-mismatch.

But is it really possible to represent ALL kinds of data in a relational way? I think it is. I looked at many problems I've worked on in the past and every problem was at least equally and often much easier representable in a relational way. But one thing have to change: The query language. SQL simply isn't up to the task anymore. It maybe never was but while there was much development in the field of object-oriented and functional programming, relational stuff was a bit neglected in the last years. So SQL stayed at the top and became something like a synonym for querying relational data structures.

But this hasn't to be this way. SQL is only a programing language designed to solve this task. And in fact it misses lots of important things. Lets have a relation

parent-of(parent-id, child-id)

Now write a SQL query which decides if a given child-id is a child of a given parent-id, even if only indirect over some 'hops'. This is called the transitive closure of a relation and isn't solvable in general SQL (only a few RDBMS seem to support recursive queries which are needed to solve the problem). This kind of problem is rather common in general algorithms, another one is to sort elements in a topological order based on a similar relation (In fact most graph algorithms are useful on relations and should be easy to express in a useful query language).

So using relations instead of objects can give real advantages but we need language support to make is possible to not only query a RDBMS more easily but also represent all structured data in a full relational way. This way we no only solve the O/R impedance mismatch and the problems with references (I've written about this here and here), we also simplify the way we are programming: Data structures are the base of every algorithm and with a more powerful way of representing data structures it should be possible to gain lots of productivity.

11 comments:

Ricky Clarkson said...

I don't know whether this will affect the thoughts you expressed here, but it might be worth reading Chapter 3 of Practical Common Lisp, which discusses ways of writing data structures that can be queried. Perhaps not entirely applicable to Java, but worth understanding.

In general though, I don't think it's often worth duplicating data to gain performance - the normalised object model is the best. If normalisation works for databases, then it works for other programs.

Although it's obviously a daring step for code modelling the real world, if you make each of these objects immutable, including the Database, i.e., so that changes require a new object, then you will be forced into updating all the data at the same time.

In other words, should you modify an Invoice's datePaid, you'll need to create a new Invoice to hold that, create a new Customer, and create a new Database. This way the problems of duplicated references can be minimised without normalising (a dubious requirement).

Performance implications of this vary - it's more important to make a (correct) intuitive model than to go for the best performance.

Karsten Wagner said...

I know the book, but the example addresses a slightly different topic. And it's 'toy-code'.

Creating immutable Databases is something I thought about, but I don't think it's really practical. The reason is that databases are inherently mutable. You can model them in a pure functional way, but in the end you only simulate mutability with typical functional methods without gaining something and instead even blur the problem. Using transactions or atomic operations instead is much more appropriate for the problem.

And I have to disagree about the relative unimportance of performance. I certain applications you really can allow yourself to not care about it, but thats rare. Most often performance counts and than you have to find a appropriate solution which is fast enough. This is not about a constant factor 2 or even 3. Its about a O(n^2) algorithm compared to a O(n) one. And in many cases a transformation from an algorithm with mutation to one without can lead to a higher order of complexity.

But even if it's only a constant factor, it can be to much if the factor is 50 or even higher. Buying two database server instead of one may be affordable if you can reduce development and maintenance costs severely. But buying 50 instead of one is almost always impossible. And if we look in the field of shrink wrapped software it's even worse because most user simply won't by a faster computer to use your program. Java still have problems to make a breakthrough on the desktop because of this and typical Java programs are only slightly slower than a C++ equivalent.

Thomas Schilling said...

I guess you want an object store and (possibly) a built-in query language. Specifying relations might be put into a separate embedded language, though I don't think it's a bad idea to put the relations into the object itself.
Take a look at Franz Inc.'s Allegro Cache (http://www.franz.com/products/allegrocache/) and LINQ or MS' Power Shell. These are two embedded SQL-like query languages. Writing embedded query languages should be possible in most languages with some form of macro or meta-programming system. (I.e. Scheme, CL, Haskell, Ruby, Smalltalk, ...)

Unknown said...

There's always keeping things in an XML DOM. If you fancy using .NET, you can use DataSets - which you can query with a mini SQL-like language - and soon with LINQ - SQL-like syntax built into C# and VB.NET. This allows you to query both DataSets and XML DOM (and perhaps other stuff).

Karsten Wagner said...

@Thomas Schilling:
It's not about storing data on disk, its about representing them. In memory, on disk, where ever.

Writing query languages in the language itself (via macros or similar meta-programming facilities) can't really solve the mentioned problems, because without full program optimization the result would have rather poor performance. And how do you want to represent data in a path independent way?

@rikkus:
XML DOM is a simple hierachical structure. Many problems can't be put into a hierachy or need different hierachies depending on the algorithm you want to execute on the data. Hierachical structures aren't the solution, thats why relational databases wiped away the earlier hierachical databases long ago.

But for in-memory-data we still use this method of structuring instead of using the more powerful (but harder to efficiently implement) relational method.

Tony Morris said...

The reason is that databases are inherently mutable.
Here is your mistake.

Is there something 'inherently mutable' about your preferred revision control system?

Think about it some more.

Anonymous said...

Actually, it is not SQL missing the transitive closure feature. It is the relational algebra, the very foundation of RDBMs which does not permit this operation with its basic operators. That is actually the reason, why sql is not turing complete.

As you may see, SQL is not the reason why you cannot do this, the mathematical foundation is.

Of course, several RDBMs (Oracle, Postgres I think...) have left that behind and expanded their syntax so they can handle graphs and the like better.

There is a good reason why SQL does not include such functionality; it is not that it is "old" or sth.

I cannot see anything else that is wrong with SQL. It's one of the best thought through languages computer science has seen, imho. Due to it's beauty to express mathematical terms directly into your practical problem, it makes design on a very abstract level not only possible but also effective and even rewarding.

Anonymous said...

SQLite is an actively developed and maintained public-domain C library implementing a SQL-queried relational database. It's easy to incorporate into programs in most languages, and it supports in-memory databases (which can be trivially copied to and from disk databases in whole or in part). Extremely handy for desktop applications and any web application where you would otherwise use MySQL on a single machine.

Karsten Wagner said...

@dibblego:
I am continuing to think about it. Of course mutation can be seen as a series of transformations on some empty data structure. The question is, which way of looking at it makes programming more easy. While the mutation approach is more natural, the transformation approach has other advantages like atomic updates without the need for some additional 'artificial' facility.

@Justin:
I think it's time to move forward. SQL is a query language but I can't imagine how to do general programming in SQL, even if extended a bit. To make progress in this area we have to play with new ideas and languages instead of staying with the old ones.

I don't see that relational algebra forbids the transitive closure. Relational algebra also don't contain joins because they can easily defined on top of the basic operations. Likewise the transitive closure can be easily defined with a recursive definition on those basic operations. If your reasoning is correct then we shouldn't have a 'join' operator in SQL.

I don't know a good reason why they don't included it into SQL, so maybe you can explain it.

Anonymous said...

In the way of curious circumstances, LtUs current top post describes (and neatly solves, at least for java) the gripe you have.

I think the main reason, that this subject is a non-issue for most project is that either no full relationships are needed, i.e. the relationship is only used in one direction, or it gets too complicated, so that a database is used anyhow.

Anonymous said...

@Karsten

As I already said, I am far off rom calling SQL old or not enough, and I do not see any necessity to improve SQL apart from a standard transitive closure syntax.

Why is the transititve closure not in the relational algebra? Well, the relational algebra is a construct on which the computer scientists back in the 70s/80s build query languages. The chose the relational algebra, because it is a pure mathematical construct, on which you can work in a very formal way and can even proof about anything.

Three of its operations are cross product, projection and filter. Out of these three you can create the JOIN operation, thus the JOIN operation is part of the relational algebra.

But you cannot express the transitive closure as a single term in the relational algebra; pick up your favorite database book (I recommend 'Datenbanksysteme', by Alfons Kemper) and read up the part of the relational calculus.