Designing databases for flexibility (II): Relationships
Posted by eutrilla on May 5, 2008
As mentioned in my previous post, the objective of this series of articles is to share some database design patterns that have helped me in the past to achieve, using a relational database as a backend, a reusable API for data persistence with a higher level of abstraction and flexibility, while still taking advantage of the tooling and previous knowledge of relational databases. In this second post I’ll try to cover a basic concept: the way entities are related to each other.
Relational vs. navigational databases
A common criticism about relational databases is that their structure is quite rigid and difficult to change. In normalized databases, every time an new relationship is defined between two entities, either one of them has to be modified to hold the PK of the other one (1..n and n..1 relationships) or a new table has to be created (n..n relationships). That, in itself, is not a lot of work. The problem is that these changes in the table structure may break many existing SQL queries, or require that new ones are created. The same comment applies for PL/SQLs and the application code that uses the data.
Also, conceptually, the relational approach has a very low level of abstration. The semantics of SQL statements are not directly related to the relationships between objects, but with the way these data can be stored.
Other database paradigms have this higher level of abstraction. These include Object Oriented Databases (OODBs) and Navigational Databases. The databases of this last group are created around the concepts of nodes and links between nodes, creating a graph. The main structural difference with relational databases is that, while an PK/FK can be established as any subset of columns of a table as long as they form a unique n-tuple, the relationships in navigational databases are stored in a known, fixed way so it is easy to get to a node from another.
Storing hierachical data in a relational structure
A common task that can be hard to implement in a relational database is the storage of hierarchical data, such as those of a tree. Most of the projects I have worked on required this functionality. It’s difficult to translate a tree into a ResultSet. A lot of data redundancy would be required to fit the tree into a table-like structure. In most cases, when the complete model has a lot of objects this preloading is just not a valid option, since it would take too long. We usually use lazy loading for the children of each node, which allows a shorter startup time and a better support to the update of each node.
To perform this lazy loading when using a typical relational schema, the GUI data model has to perform a different query for each object type, since it has to access a different set of tables. If new object types are added, or new relationships between them are defined, the code of this model also has to be modified.
A more flexible approach
This can be avoided if we define relationships between entities always in the same way, so the queries for all object types all look similar – or are indeed the same. This is achieved if all entity definitions and relationship declarations are stored in a single set of tables, no matter the type of each entity. For 1..n and n..1 relationships, this could be implemented using a single table, but it is advisable to provide support for the general n..n case, therefore using two different tables: one for entities (nodes) and one for relationships between them. In this way, the query used to get the children of an object is always the same, and filling up the tree becomes a simple task.
These two tables are, in fact, defining the vertices (nodes) and edges (relationships) of the object graph. The most important consequence is that the possible relationships between objects are not defined by the structure of the tables themselves, but by the data of the objects stored in them. If we have an object representing, say, a car, and we want to introduce a relationship with a driver, we can do it without modifying the structure of either one since we have provided a generic storage for any relationships. In a traditional model, we would either need to add a column to one of the tables storing the cars and drivers data, or create a new specific table for them. With these two tables describing entities and relationships, we only have to add rows to the relationship table. And since both tables are already connected by means of foreign keys, no special administration needs to be performed to ensure referential integrity: it is achieved by construction.
In this way, the flexibility of the database design has been improved by removing the logical object definition from the physical table structure, and moving it into user data. The tables don’t define the structure of the objects, but the purpose of each of its parts (in this case, the relationships).
Well, this last statement is only partially true, since no restrictions have been imposed on the data model yet. For instance, it would be possible to associate directly a car with a driver’s pet as long as both of them are stored in the database, no matter whether it makes sense or not. This is the just the most simple case, the most dynamic and less restrictive, where relationships are defined by the own user on runtime.
Additional control and semantical checks can be enforced by a trigger or as part of a stored procedure used to insert the data in the physical tables. In order to keep it flexible, the valid relationships shouldn’t be hardcoded in the trigger itself, but stored in a separate metadata table. In this way it is possible to define new valid relationships just by adding rows into this metadata table, instead of by modifying the table structure itself. Again, purpose (how restrictions of relationships between entities are performed, the table structure) is separated from the data itself (which relationships are actually legal, the contents of the table).
It is perfectly possible to fill in this table during the database design phase and allow only read access to the database users, so the logical structure is fixed to the users but easily modifiable by an administrator with privileged access. This topic will be discussed in greater depth at a later post.
Not a long time ago I found this article and this other one by Hank Williams, where he seems to be thinking in the same line about the advantages of graph data models. In his opinion, though, it would be required an approach other than SQL to implement graph structures due to:
- self-joins (tables linked to themselves) can not be optimised by the DB engine
- SQL doesn’t provide a good abstraction
I do have my doubts with the first point. The only distinction between the ‘stardard’ model and the ‘graph’ model is that in the former, the DB engine has to look just in a subset of the entities in the complete model, whereas in the latter it has to look in the whole set. But it is in this field where relational databases excel. The use of indexes and stored procedures can heavily optimise the time on this type of queries. Nevertheless, some reduction on performance compared to a ‘traditional’ data model may be unavoidable. It is up to the database designer to estimate the overhead time in the used RDBMS and weight flexibility against performance to decide whether this pattern is useful for the task at hand.
Regarding the second point, I’d say that it doesn’t matter. SQL is not required to provide an abstraction, as it is just the means to implement it. The trick is to make the database structure independent of the object structure, so that the code is reusable between different entity types. Then, even if low abstraction code has to be dealt with to read and write the data into the relational tables, it only has to be developed once. Later, other users can use the generic high abstraction layer created instead of accessing directly the DB using SQL queries (although they can, if required). The fact that the tables are always the same not only simplifies the code of this DAO layer, it also allows the use of stored procedures, which can improve performance noticeably.
Navigating a graph is an inherently recursive task. If the structure relation is fixed it can be coded in a normal SQL query, but in many cases it is needed to load a node to get the reference of a second, load it and get the reference of a third, and so on. Some RDBMSs, such as Oracle and SQL Server, have special constructs to simplify the SQL code, but they can’t be really optimised due to their procedural nature. Nevertheless, if needed there are some techniques that trade storage space and a bit more processing while writing data for a higher read speed, such as storing in a table the inferred (indirect) relationships between nodes, as depicted in this article.