Lateral Programming

Coding out of the box

Designing databases for flexibility (II): Relationships

Posted by eutrilla on May 5, 2008

As mentioned in my previous post, the objective of this series of articles is to share some database design patterns that have helped me in the past to achieve, using a relational database as a backend, a reusable API for data persistence with a higher level of abstraction and flexibility, while still taking advantage of the tooling and previous knowledge of relational databases. In this second post I’ll try to cover a basic concept: the way entities are related to each other.

Relational vs. navigational databases

A common criticism about relational databases is that their structure is quite rigid and difficult to change. In normalized databases, every time an new relationship is defined between two entities, either one of them has to be modified to hold the PK of the other one (1..n and n..1 relationships) or a new table has to be created (n..n relationships). That, in itself, is not a lot of work. The problem is that these changes in the table structure may break many existing SQL queries, or require that new ones are created. The same comment applies for PL/SQLs and the application code that uses the data.

Also, conceptually, the relational approach has a very low level of abstration. The semantics of SQL statements are not directly related to the relationships between objects, but with the way these data can be stored.

Other database paradigms have this higher level of abstraction. These include Object Oriented Databases (OODBs) and Navigational Databases. The databases of this last group are created around the concepts of nodes and links between nodes, creating a graph. The main structural difference with relational databases is that, while an PK/FK can be established as any subset of columns of a table as long as they form a unique n-tuple, the relationships in navigational databases are stored in a known, fixed way so it is easy to get to a node from another.

Storing hierachical data in a relational structure

A common task that can be hard to implement in a relational database is the storage of hierarchical data, such as those of a tree. Most of the projects I have worked on required this functionality. It’s difficult to translate a tree into a ResultSet. A lot of data redundancy would be required to fit the tree into a table-like structure. In most cases, when the complete model has a lot of objects this preloading is just not a valid option, since it would take too long. We usually use lazy loading for the children of each node, which allows a shorter startup time and a better support to the update of each node.

To perform this lazy loading when using a typical relational schema, the GUI data model has to perform a different query for each object type, since it has to access a different set of tables. If new object types are added, or new relationships between them are defined, the code of this model also has to be modified.

A more flexible approach

This can be avoided if we define relationships between entities always in the same way, so the queries for all object types all look similar – or are indeed the same. This is achieved if all entity definitions and relationship declarations are stored in a single set of tables, no matter the type of each entity. For 1..n and n..1 relationships, this could be implemented using a single table, but it is advisable to provide support for the general n..n case, therefore using two different tables: one for entities (nodes) and one for relationships between them. In this way, the query used to get the children of an object is always the same, and filling up the tree becomes a simple task.

These two tables are, in fact, defining the vertices (nodes) and edges (relationships) of the object graph. The most important consequence is that the possible relationships between objects are not defined by the structure of the tables themselves, but by the data of the objects stored in them. If we have an object representing, say, a car, and we want to introduce a relationship with a driver, we can do it without modifying the structure of either one since we have provided a generic storage for any relationships. In a traditional model, we would either need to add a column to one of the tables storing the cars and drivers data, or create a new specific table for them. With these two tables describing entities and relationships, we only have to add rows to the relationship table. And since both tables are already connected by means of foreign keys, no special administration needs to be performed to ensure referential integrity: it is achieved by construction.

In this way, the flexibility of the database design has been improved by removing the logical object definition from the physical table structure, and moving it into user data. The tables don’t define the structure of the objects, but the purpose of each of its parts (in this case, the relationships).

Well, this last statement is only partially true, since no restrictions have been imposed on the data model yet. For instance, it would be possible to associate directly a car with a driver’s pet as long as both of them are stored in the database, no matter whether it makes sense or not. This is the  just the most simple case, the most dynamic and less restrictive, where relationships are defined by the own user on runtime.

Additional control and semantical checks can be enforced by a trigger or as part of a stored procedure used to insert the data in the physical tables. In order to keep it flexible, the valid relationships shouldn’t be hardcoded in the trigger itself, but stored in a separate metadata table. In this way it is possible to define new valid relationships just by adding rows into this metadata table, instead of by modifying the table structure itself. Again, purpose (how restrictions of relationships between entities are performed, the table structure) is separated from the data itself (which relationships are actually legal, the contents of the table).

It is perfectly possible to fill in this table during the database design phase and allow only read access to the database users, so the logical structure is fixed to the users but easily modifiable by an administrator with privileged access. This topic will be discussed in greater depth at a later post.

Performance issues

Not a long time ago I found this article and this other one by Hank Williams, where he seems to be thinking in the same line about the advantages of graph data models. In his opinion, though, it would be required an approach other than SQL to implement graph structures due to:

  • self-joins (tables linked to themselves) can not be optimised by the DB engine
  • SQL doesn’t provide a good abstraction

I do have my doubts with the first point. The only distinction between the ‘stardard’ model and the ‘graph’ model is that in the former, the DB engine has to look just in a subset of the entities in the complete model, whereas in the latter it has to look in the whole set. But it is in this field where relational databases excel. The use of indexes and stored procedures can heavily optimise the time on this type of queries. Nevertheless, some reduction on performance compared to a ‘traditional’ data model may be unavoidable. It is up to the database designer to  estimate the overhead time in the used RDBMS and weight flexibility against performance to decide whether this pattern is useful for the task at hand.

Regarding the second point, I’d say that it doesn’t matter. SQL is not required to provide an abstraction, as it is just the means to implement it. The trick is to make the database structure independent of the object structure, so that the code is reusable between different entity types. Then, even if low abstraction code has to be dealt with to read and write the data into the relational tables, it only has to be developed once. Later, other users can use the generic high abstraction layer created instead of accessing directly the DB using SQL queries (although they can, if required). The fact that the tables are always the same not only simplifies the code of this DAO layer, it also allows the use of stored procedures, which can improve performance noticeably.

Navigating a graph is an inherently recursive task. If the structure relation is fixed it can be coded in a normal SQL query, but in many cases it is needed to load a node to get the reference of a second, load it and get the reference of a third, and so on. Some RDBMSs, such as Oracle and SQL Server, have special constructs to simplify the SQL code, but they can’t be really optimised due to their procedural nature. Nevertheless, if needed there are some techniques that trade storage space and a bit more processing while writing data for a higher read speed, such as storing in a table the inferred (indirect) relationships between nodes, as depicted in this article.

Posted in databases | 3 Comments »

Designing databases for flexibility

Posted by eutrilla on April 27, 2008

In the last weeks I’ve noticed several posts about alternative database paradigms: CouchDb, SimpleDb, BigTable… All of them promise infinite scalability. This isn’t really important for me, since the projects I’ve worked for don’t require a big concurrent workload. Or at least, not that big that a  RDBMS can’t handle. But they have one more thing in common: an easier way to store and retrieve data, away from plain SQL.

I do agree that storing objects feels much more natural and easy to understand for the developers than adding rows to one or several tables. I’ve always felt unconfortable about the simplistic SQL statements. I had to spend a lot of time thinking about how tables were connected, instead of how the domain objects were related, so I tried to figure out how I could isolate the application code from the database structure. ORM frameworks simplified things, but not completely: they replace specific SQL queries by specific mappings.

Object-Oriented databases are an interesting concept. Forget about database schemas and ORM mappings, and design your own domain model as you want. Then simply store it all, just like if it was kept in memory, but being able to do queries on it. Tempting, isn’t it? Of course it has a price: they are really slow compared to RDBs.

The cloud databases follow the same lines, but instead of storing full-fledged OOD objects, they handle maps of key-value pairs, containing the named parameters of the object. I guess that this is the trick to improve performance: by limiting the format of the objects, it is possible to optimise the way the data is queried. Also, t the objects only store “simple” values, such as numbers and strings, but not other objects – each entry is limited in size. They may contain the identifier of another object, so you can retrieve it later, but a read operation doesn’t return all the objects stored in a tree, for instance. That’s a strong point, and at the same time a weakness, of OODBs: getting the whole tree is easy, but getting just the root node is comparatively slow, since you are retrieving the whole tree anyway.

The problem about these cloud-based DBs it that their field of application is restricted in general to Web services. In many other cases, we are stuck with relational databases, at least for a while. But the way they store data doesn’t require a distributed database. It can very well implemented over a relational database structure, using a code DAO layer to manage the SQL details. Depending on the features required it would require more or less work, but once finished it can be used for any type of object composed of key-value pairs. During the last years, I’ve implemented and used in my day job several of such database designs, with different functionality sets. They allowed us to speed up development and to make changes easily to the data model, including the creation of new entity types or the addition of properties to existing ones. And of course, it was still a relational database, so we could keep on doing specific SQL queries for critical or complex searches, and use all our usual SQL development tools.

Some say that relational databases are dead. I’d say that they are pretty alive and will be for some time at least, but that the way we use them may be not the best one in all cases, and that new approaches can be tried. The characteristics of RDBMSs may not be the best in terms of abstraction, but make them a good backend over which higher abstraction level databases can be built. In the next posts, I’ll try to share some of the solutions that I’ve found useful in the past.

Posted in databases | Tagged: | 2 Comments »

Why use Spring… if you have a HashMap at hand?

Posted by eutrilla on April 7, 2008

Ok, let’s start with a disclaimer. There must be a thousand reasons to use Spring. It’s a terrific piece of work. And nobody can possibly complain about its price.

But it also has its drawbacks. Probably the first one that anybody could think of would be XML config files. Another one is, oddly enough, one of its strong points: it integrates an awful lot of functionality. Although this may be very good indeed if you’re using it all, for simple/medium apps Spring is simply too bloated.

Guice is a step in the right direction, I think. Just a DI container, no more. Google has also chosen to use annotations instead of XML config files, which helps to keep it manageable. But I still have the annoying thought that it shouldn’t be that hard to keep our software nice and clean, without the aid of a whole framework.

Why DI is not always good enough

Maybe the problem with both Spring and Guice is a flaw in the Dependency Injection pattern itself. It looks nice in the simple UML diagrams used to describe the concept, but it gets harder when the number of classes and components increases.

Let’s say that we’re developing a component, and we expect to use it in a DI container. We don’t want to expose too many implementation details, so we decide that all calls from the outside should be made using a façade. Now one of our classes needs to access a log service, and we would like it to be configurable. So we need a setter. Well, actually, we need at least two, one in the calling class and one in the façade, since the former shouldn’t be directly available as part of the public API. They could even be more if the façade doesn’t have directly a reference to the class in need of logging. The more complex the components are, the more likely is this to happen. At the end, we need some extra code to connect services and the classes where they are used.

But now, we have our shiny loosely-coupled component ready for use. So we insert it in our app, together with some tens of other modules. It turns out that all modules require access to the log service. So we have to repeat a very similar setLogger() call tens of times. No wonder DI containers require a LOT of configuration.

Isn’t there a simpler solution?

On the other hand, let’s suppose we forget for a while about Dependency Injection and try another pattern that targets exactly the same use case as DI: the Service Locator. If any of you don’t know what it looks like, I’ll let Martin Fowler introduce it. Maybe it’s not as trendy as DI at the moment, but it’s not anything really new. After all, Service Locator is the very same pattern behind J2EE.

Well, it is quite clear why DI is better, isn’t it? The Service Locator introduces a new dependency on the calling component, so it is not so loosely-coupled anymore. That’s a big problem, right? Well, maybe not.

As Martin Fowler points out in the article previously linked, a dynamic Service Locator can be implemented as just a simple HashMap with some accessors. And has quite a few interesting advantages:

  • It is generic, so it is not much a burden dependency-wise.
  • If a class inside your components requires access to a new service, you don’t need to change the other classes in the component to propagate a reference from the outside, as long as the service is registered in a static instance of the Service Locator.
  • Even if you can’t use a static instance, or you have different contexts which require different sets of services, you only need to propagate one object to access all them: the locator itself.
  • Now the configuration is reduced to create and register in the locator the different components during startup.
  • It is possible to hot-swap easily implementations of the services.
  • No extra framework is needed. Just a HashMap!

Ok, nothing is that nice or that easy

Of course I’m not trying to compare a plain HashMap with the whole Spring Framework, just with the DI-container part of it. And even so, it sure has some drawbacks:

  • Configuration (aka Creation and registration of service instances) is done in code. Some may argue that XML-based deployment is more flexible. Personally, I don’t find much harder to edit and compile an initialization manager class, and for complex projects code beats XML.
  • The dependency on the locator itself is spread through a good part of the classes that make up the application. Since it is so general, it’s not a big deal. If the Locator were part of the standard Java library, it would be almost like having a dependency to java.lang.String.
  • All components should use the same Service Locator class. Otherwise services would have to be registered several times, and design becomes messy. But since Locators can’t be that different from each other, it should be easy to transform one class into another using an adapter. Again, if such simple class was part of the standard Java library, it wouldn’t be a problem at all.
  • All service references require a cast. Annoying, but not that bad.
  • If components are not well documented, it’s hard to say which dependencies they have, whereas DI-adapted components have the list of dependencies right there in their public API.

But would it work well enough in *real* projects?

I can say that I’ve used this approach in several projects and found it really easy to implement and use. We even made some improvements over Martin Fowler’s initial implementation, though. The keys to store and retrieve the Service instances are no longer Strings, but the class of the desired interface itself. In that way, we avoid typing errors in the identifier and we can place an assert in the service registration method so we can be sure that the instance registered is indeed an implementation of the interface.

Quoting Martin Fowler’s article:

The choice between Service Locator and Dependency Injection is less important than the principle of separating service configuration from the use of services within an application.

To finish, I’d like to mention Netbean’s Lookup class and Java 6′s ServiceLoader class. A very interesting article by John O’Conner (Creating Extensible Applications with the Java Platform) explains the differences between both and how to use them. Again, this is Service Locator in action, but now these classes try to find the services in the classpath instead of needing to be configured during startup. So they are very convenient for plugin-style development, but provide less control over which/how many implementations of the services are used, and therefore target a slightly different use case. But seems to confirm that there are other, simpler options out there to get rid of spaguetti code.

Posted in architecture, java | Tagged: , | 14 Comments »

 
Follow

Get every new post delivered to your Inbox.