Lets just look at an example (It would look similar if I use the Hindley-Milner type system but it's just about the basic principles not about a certain implementation. I think that Java is common knowledge today and so I'll use a Java-like syntax).
Let's start with the following basic structure:
class Customer {
String first_name;
String surname;
Address[] addresses;
Invoice[] invoices;
}
class Address {
String postcode;
String city;
String street;
}
class Invoice {
Date date;
Date date_payed;
Currency value;
}
This allows us to represent 'Customers' with 'Addresses' and 'Invoices' in our system. The above structure is very clean and easy to understand - but also totally useless. Why? Because we also need ways to query the structure above. For example we want to access all invoices in the system to create reminders. Or we want to access all customers to do some mailing. So we have to create some containers:
class Database {
Customer[] customers;
Invoice[] invoices;
Address[] addresses;
}
This 'Database'-object stores the data and allows us to access them if we want. But it also creates unnecessary ('accidental') complexity. Why? Because we have a rather artificial additional data structure only to access those data we already have declared. If we now do a
c = new Customer()
somewhere we have to put c into 'Database.customers'. And if we destroy a Customer, we have to remove it from 'Database'. This creates some complexity which is very easy to handle in this simple example, but could be the source of more difficult to handle problems in big and more complex applications. But this is it's only the beginning:
Next we want to find a Customer for a certain invoice. We could simply do a search in Database.customers but that would be much to slow in a real application. So we add a back-link to each 'Invoice':
class Invoice {
Date date;
Date date_payed;
Currency value;
Customer customer; // the back-link for quick access
}
This is an additional rather artificial addition to our structure. We need it only for performance reason, and it leads to new potential problems, because it has to be maintained to. But lets continue further:
Next we want to find all customers in a certain city. We could search Database.addresses to find the matching address and then Database.customers to find the customer for those addresses, but again in most situations this is simply not fast enough. So we have to extend our data structure once more:
class Address {
String postcode;
String city;
String street;
// back links. We use an array because we want to allow
// multiple Customers for each address
Customer[] customers;
}
class Database {
Customer[] customers;
Invoice[] invoices;
Address[] addresses;
// a map to find all addresses for a given city
Map city_lookup;
}
Those are only the declarations. You can easily imagine the additional code in your application to maintain those structures if you add or delete Customers, Addresses etc. Now scale this simple example to a real-world-application with hundreds of those classes, lots of different views and access-paths to you data structures and you see the problem. And I've not even talked about making the above structures persistent.
The funny thing is that all we really needed to describe our data is in the topmost data structure. It gives a complete description of the problem, but for performance and organization purposes we had to add lots of additional things which blur and complicate our once concise and self explaining structure. So all those additions create 'accidental complexity': Complexity which isn't inherent in the problem but only required for the specific implementation.
You may say that this is a contrived problem and would be handled by a SQL database anyway, instead of representing the structures in the way I did it here. But most data in programs is represented in a similar way. Let it be graphical elements of a GUI or a word-processor, the data in a 3D-modeler or the virtual world in a computer game: We always have data structures which are in principle quite simple but have lots of back links, indices, maps etc to provide access paths to the data.
In object oriented programming this kind of structuring the data is part of the paradigm: Objects have slots and those slots contain references to other data. All objects are intermingled in this way and objects without an access path are 'dead' and will be removed by the garbage collector (or spam the heap in a language without one).
This kind of structuring is so widely used, that its hard to even question it or think about a better way to representing those data. But lets give it a try.
In fact we already know another way: The above mentioned SQL database. It stores data not in structures with references but in relations. SQL is just a certain language to access and manage those relations, but it's in no way the only way to do this. A database stores data on external memory to be able to scale up to huge amounts of data, but we can also store local client data in relations. Just let me convert our example into a relational based one:
relation Customer {
Id customer_id;
String first_name;
String surname;
}
relation Address {
Id customer_id;
String postcode;
String city;
String street;
}
relation Invoice {
Id customer_id;
Date date;
Date date_payed;
Currency value;
}
(The above could've been normalized a bit further to make it more flexible, but for example purposes I let it this way).
The structure looks very similar, but it allow all kinds of accesses we could do with our much more complex non-relational structure above - and lots more: We could simple create joins between those relations and query all things we want to know: For example all customers at a given city/street/postcode, all customers with open invoices in a given city, all invoices payed in a given date range. And we don't have to maintain additional back links, maps etc. because this is all done by the programming language which could use indices or other optimizations on all data which is queried often. Automatically if we use something like runtime profiling or static global optimization or by manually by giving hits to the compiler.
The reason for this is, that the above relational structure doesn't contains fixed access paths. Instead of that the access paths are created in the moment we query the structure. This is obviously much more flexible than building the query right into the data structure. And while a single data structure can only be organized in a single way, we can write arbitrary kinds of queries.
So by using a relational structure instead of a referential ones we solved lots of problems. A 'big one' is the infamous O/R-mapper problem: Everybody who tried to use a relational database with a object oriented language knows about it. The idea to simply change the way the database works and use a OODBMS hasn't worked out, RDBMS are still state of the art. And because of the reasons mentioned above I suspect that they stay there (unless we find a new 'third way' to structure data). By changing the basic model of representing data to a relational one, we have also solved the O/R-mismatch.
But is it really possible to represent ALL kinds of data in a relational way? I think it is. I looked at many problems I've worked on in the past and every problem was at least equally and often much easier representable in a relational way. But one thing have to change: The query language. SQL simply isn't up to the task anymore. It maybe never was but while there was much development in the field of object-oriented and functional programming, relational stuff was a bit neglected in the last years. So SQL stayed at the top and became something like a synonym for querying relational data structures.
But this hasn't to be this way. SQL is only a programing language designed to solve this task. And in fact it misses lots of important things. Lets have a relation
parent-of(parent-id, child-id)
Now write a SQL query which decides if a given child-id is a child of a given parent-id, even if only indirect over some 'hops'. This is called the transitive closure of a relation and isn't solvable in general SQL (only a few RDBMS seem to support recursive queries which are needed to solve the problem). This kind of problem is rather common in general algorithms, another one is to sort elements in a topological order based on a similar relation (In fact most graph algorithms are useful on relations and should be easy to express in a useful query language).
So using relations instead of objects can give real advantages but we need language support to make is possible to not only query a RDBMS more easily but also represent all structured data in a full relational way. This way we no only solve the O/R impedance mismatch and the problems with references (I've written about this here and here), we also simplify the way we are programming: Data structures are the base of every algorithm and with a more powerful way of representing data structures it should be possible to gain lots of productivity.