Lateral Programming

Coding out of the box

Archive for the ‘databases’ Category

Designing databases for flexibility (II): Relationships

Posted by eutrilla on May 5, 2008

As mentioned in my previous post, the objective of this series of articles is to share some database design patterns that have helped me in the past to achieve, using a relational database as a backend, a reusable API for data persistence with a higher level of abstraction and flexibility, while still taking advantage of the tooling and previous knowledge of relational databases. In this second post I’ll try to cover a basic concept: the way entities are related to each other.

Relational vs. navigational databases

A common criticism about relational databases is that their structure is quite rigid and difficult to change. In normalized databases, every time an new relationship is defined between two entities, either one of them has to be modified to hold the PK of the other one (1..n and n..1 relationships) or a new table has to be created (n..n relationships). That, in itself, is not a lot of work. The problem is that these changes in the table structure may break many existing SQL queries, or require that new ones are created. The same comment applies for PL/SQLs and the application code that uses the data.

Also, conceptually, the relational approach has a very low level of abstration. The semantics of SQL statements are not directly related to the relationships between objects, but with the way these data can be stored.

Other database paradigms have this higher level of abstraction. These include Object Oriented Databases (OODBs) and Navigational Databases. The databases of this last group are created around the concepts of nodes and links between nodes, creating a graph. The main structural difference with relational databases is that, while an PK/FK can be established as any subset of columns of a table as long as they form a unique n-tuple, the relationships in navigational databases are stored in a known, fixed way so it is easy to get to a node from another.

Storing hierachical data in a relational structure

A common task that can be hard to implement in a relational database is the storage of hierarchical data, such as those of a tree. Most of the projects I have worked on required this functionality. It’s difficult to translate a tree into a ResultSet. A lot of data redundancy would be required to fit the tree into a table-like structure. In most cases, when the complete model has a lot of objects this preloading is just not a valid option, since it would take too long. We usually use lazy loading for the children of each node, which allows a shorter startup time and a better support to the update of each node.

To perform this lazy loading when using a typical relational schema, the GUI data model has to perform a different query for each object type, since it has to access a different set of tables. If new object types are added, or new relationships between them are defined, the code of this model also has to be modified.

A more flexible approach

This can be avoided if we define relationships between entities always in the same way, so the queries for all object types all look similar – or are indeed the same. This is achieved if all entity definitions and relationship declarations are stored in a single set of tables, no matter the type of each entity. For 1..n and n..1 relationships, this could be implemented using a single table, but it is advisable to provide support for the general n..n case, therefore using two different tables: one for entities (nodes) and one for relationships between them. In this way, the query used to get the children of an object is always the same, and filling up the tree becomes a simple task.

These two tables are, in fact, defining the vertices (nodes) and edges (relationships) of the object graph. The most important consequence is that the possible relationships between objects are not defined by the structure of the tables themselves, but by the data of the objects stored in them. If we have an object representing, say, a car, and we want to introduce a relationship with a driver, we can do it without modifying the structure of either one since we have provided a generic storage for any relationships. In a traditional model, we would either need to add a column to one of the tables storing the cars and drivers data, or create a new specific table for them. With these two tables describing entities and relationships, we only have to add rows to the relationship table. And since both tables are already connected by means of foreign keys, no special administration needs to be performed to ensure referential integrity: it is achieved by construction.

In this way, the flexibility of the database design has been improved by removing the logical object definition from the physical table structure, and moving it into user data. The tables don’t define the structure of the objects, but the purpose of each of its parts (in this case, the relationships).

Well, this last statement is only partially true, since no restrictions have been imposed on the data model yet. For instance, it would be possible to associate directly a car with a driver’s pet as long as both of them are stored in the database, no matter whether it makes sense or not. This is the  just the most simple case, the most dynamic and less restrictive, where relationships are defined by the own user on runtime.

Additional control and semantical checks can be enforced by a trigger or as part of a stored procedure used to insert the data in the physical tables. In order to keep it flexible, the valid relationships shouldn’t be hardcoded in the trigger itself, but stored in a separate metadata table. In this way it is possible to define new valid relationships just by adding rows into this metadata table, instead of by modifying the table structure itself. Again, purpose (how restrictions of relationships between entities are performed, the table structure) is separated from the data itself (which relationships are actually legal, the contents of the table).

It is perfectly possible to fill in this table during the database design phase and allow only read access to the database users, so the logical structure is fixed to the users but easily modifiable by an administrator with privileged access. This topic will be discussed in greater depth at a later post.

Performance issues

Not a long time ago I found this article and this other one by Hank Williams, where he seems to be thinking in the same line about the advantages of graph data models. In his opinion, though, it would be required an approach other than SQL to implement graph structures due to:

  • self-joins (tables linked to themselves) can not be optimised by the DB engine
  • SQL doesn’t provide a good abstraction

I do have my doubts with the first point. The only distinction between the ‘stardard’ model and the ‘graph’ model is that in the former, the DB engine has to look just in a subset of the entities in the complete model, whereas in the latter it has to look in the whole set. But it is in this field where relational databases excel. The use of indexes and stored procedures can heavily optimise the time on this type of queries. Nevertheless, some reduction on performance compared to a ‘traditional’ data model may be unavoidable. It is up to the database designer to  estimate the overhead time in the used RDBMS and weight flexibility against performance to decide whether this pattern is useful for the task at hand.

Regarding the second point, I’d say that it doesn’t matter. SQL is not required to provide an abstraction, as it is just the means to implement it. The trick is to make the database structure independent of the object structure, so that the code is reusable between different entity types. Then, even if low abstraction code has to be dealt with to read and write the data into the relational tables, it only has to be developed once. Later, other users can use the generic high abstraction layer created instead of accessing directly the DB using SQL queries (although they can, if required). The fact that the tables are always the same not only simplifies the code of this DAO layer, it also allows the use of stored procedures, which can improve performance noticeably.

Navigating a graph is an inherently recursive task. If the structure relation is fixed it can be coded in a normal SQL query, but in many cases it is needed to load a node to get the reference of a second, load it and get the reference of a third, and so on. Some RDBMSs, such as Oracle and SQL Server, have special constructs to simplify the SQL code, but they can’t be really optimised due to their procedural nature. Nevertheless, if needed there are some techniques that trade storage space and a bit more processing while writing data for a higher read speed, such as storing in a table the inferred (indirect) relationships between nodes, as depicted in this article.


Posted in databases | 3 Comments »

Designing databases for flexibility

Posted by eutrilla on April 27, 2008

In the last weeks I’ve noticed several posts about alternative database paradigms: CouchDb, SimpleDb, BigTable… All of them promise infinite scalability. This isn’t really important for me, since the projects I’ve worked for don’t require a big concurrent workload. Or at least, not that big that a  RDBMS can’t handle. But they have one more thing in common: an easier way to store and retrieve data, away from plain SQL.

I do agree that storing objects feels much more natural and easy to understand for the developers than adding rows to one or several tables. I’ve always felt unconfortable about the simplistic SQL statements. I had to spend a lot of time thinking about how tables were connected, instead of how the domain objects were related, so I tried to figure out how I could isolate the application code from the database structure. ORM frameworks simplified things, but not completely: they replace specific SQL queries by specific mappings.

Object-Oriented databases are an interesting concept. Forget about database schemas and ORM mappings, and design your own domain model as you want. Then simply store it all, just like if it was kept in memory, but being able to do queries on it. Tempting, isn’t it? Of course it has a price: they are really slow compared to RDBs.

The cloud databases follow the same lines, but instead of storing full-fledged OOD objects, they handle maps of key-value pairs, containing the named parameters of the object. I guess that this is the trick to improve performance: by limiting the format of the objects, it is possible to optimise the way the data is queried. Also, t the objects only store “simple” values, such as numbers and strings, but not other objects – each entry is limited in size. They may contain the identifier of another object, so you can retrieve it later, but a read operation doesn’t return all the objects stored in a tree, for instance. That’s a strong point, and at the same time a weakness, of OODBs: getting the whole tree is easy, but getting just the root node is comparatively slow, since you are retrieving the whole tree anyway.

The problem about these cloud-based DBs it that their field of application is restricted in general to Web services. In many other cases, we are stuck with relational databases, at least for a while. But the way they store data doesn’t require a distributed database. It can very well implemented over a relational database structure, using a code DAO layer to manage the SQL details. Depending on the features required it would require more or less work, but once finished it can be used for any type of object composed of key-value pairs. During the last years, I’ve implemented and used in my day job several of such database designs, with different functionality sets. They allowed us to speed up development and to make changes easily to the data model, including the creation of new entity types or the addition of properties to existing ones. And of course, it was still a relational database, so we could keep on doing specific SQL queries for critical or complex searches, and use all our usual SQL development tools.

Some say that relational databases are dead. I’d say that they are pretty alive and will be for some time at least, but that the way we use them may be not the best one in all cases, and that new approaches can be tried. The characteristics of RDBMSs may not be the best in terms of abstraction, but make them a good backend over which higher abstraction level databases can be built. In the next posts, I’ll try to share some of the solutions that I’ve found useful in the past.

Posted in databases | Tagged: | 2 Comments »