Perspectives on LedgerSMB: Building SOLID Databases: Open/Closed Principle

Introduction

Like the Single Responsibility Principle, the Open/Closed Principle is pretty easy to apply to object-relational design in PostgreSQL, very much unlike the Liskov Substitution Principle which will be the subject of next week's post. However, the Open/Closed principle applies only in a weaker way to relational design and hence object-relational design for much the same reason that the Liskov Substitution Principle applies in completely different ways.

This instalment thus begins the beginning of what is likely to be a full rotation going from similarity to difference and back to similarity again.

As is always the case, relations can be thought of as specialized "fact classes" which are then manipulated and used to create usable information. Because they model facts, and not behavior, design concerns often apply differently to them.

Additionally this is the first real taste of complexity in how object and relational paradigms often combine in an object-relational setup.

The Open/Closed Principle in Application Design

In application design and development the Open/Closed Principle facilitates code stability. The basic principle is that one should be able to extend a module without modifying the source code. In the LedgerSMB 1.4 codebase, for example, there are places where we prepare a screen for entering report criteria but wrap the functions which do so in other functions which set up report-specific input information. The function then is open to extension without modification and so the complexity of the function is limited to the most common aspects of generating report filter screens. Similarly customizations may simply inherit existing code interfaces and add new routines. This allows modified routines to exist without possibly interrupting other callers where needed.

In addition to code stability (meaning lack of rate of changes to code due to changing requirements), there are two other frequently-overlooked benefits to respecting the open-closed principle.

The first is that software tends to be quite complex and state management is a major source of bugs in virtually all software programs. When a frequently used subroutine is changed, it may introduce unexpected changes which are still compliant with the interface specification and test cases, but nonetheless introduce or uncover bugs elsewhere in the application. In essence the consequences of changing code are not always easily foreseen. The major problem here is that state often changes when interfaces are called, and consequently changing code behind a currently used interface means that there may be subtle state changes that are not adequately thought through. As the complexity of an interface grows, so too this problem grows.

Instead if interfaces are built for flexibility, either via accepting additional data which can be passed on to the next stage, or via inheritance, then state is more easily assured, bugs are less likely to surface, and the application can more quickly be adapted to changing business rules.

Problems Defining "Extension" and "Modification" in a db schema

Inside the database, state changes are well defined and follow very specific rules. Consequently it is not sufficient to define "modification" as a "change in source code." If it were, an "alter table add column... " would always qualify. But is this the case? or is it merely an extension? In general merely adding an optional field can't possibly pose the state problems seen in the application environments, but it does possibly pose some problems.

Defining modification and extension is thus a problem. In general some sorts of modification are more dangerous than others. For this reason we may look at these both in a strict view, where ideally tables are left alone and we add new joining tables to extend, and a more relaxed view where extension includes adding columns which do not change internal data constraints (i.e. where all previous insert and update statements would remain valid, and no constraints are relaxed).

In general, what makes a database nice in terms of relational design, is that one can prove of disprove whether a schema change is backwards compatible using math. If it is not backwards-compatible, then it is always modification. If it is backwards compatible, it is always extension when using the relaxed view.

Another point that must be born in mind is that relational math is quite capable of creating new synthetic relations based on the fact that relations are structurally transparent and encapsulation is typically weak, while state management issues are managed through a very well-developed framework of transaction management, locking, and, frequently, snapshot views. When you combine these techniques with declarative constraints on values, one has a very robust state engine which is based on a very different approach than object-oriented applications managing application behavior.

Problems with modifying table schemas

In a case where software may be managed via overlapping deployment cycles, certain problems can occur when extending tables by adding columns. This is because the knowledge of the deployment cycle typically only goes one way--- the extension team has knowledge of the base team's past cycles while the base team typically has no knowledge of the extending team's work. This is typical in cases where software is deployed and then customized. Typically adding fields to tables makes extension easy, but the cost is that major version upgrades of the base package may overwrite or clobber extensions or may fail. In essence a relaxed standard takes on risk that upgrades of the base package may not go so smoothly.

On the other hand, if the software is deployed via a single deployment cycle, as is typical of purely inhouse applications and commercial software, these problems are avoided, and extension by adding fields does not break anything.

The obvious problem here is that these categories are not mutually exclusive, however much they appear to be. The commercial software may be extended by a second team on-site, and therefore a single deployment cycle on one entity does not guarantee a single deployment cycle.

Object/Relational Interfaces and the OCP

Because object-relational interfaces can encapsulate data, and often can be made to run in certain security contexts (which can be made to cascade), the open/closed principle has a number of applications in ensuring testable database interfaces where security barriers are involved.

For example, we might have an automation system where computers connect to the database in various roles to pull job information. Rather than having separate accounts for each computer, we can assign them the same login role, but filter out the data they can see based on the client IP address. This would increase the level of containment in the event of a system compromise.

So we might create an interface of something like my_client() which instantiates the client information from the client IP address, and use that in various functions to filter. Consequently we might just run:

select * from get_jobs();

The problem of course with such a system is that of testability. We can't readily test the output because it depends on client IP address. So we might instead create a more flexible interface, available only to superusers, which accepts a client object which we can instantiate by name, integer id, or the like. In that case we may have a query that can be called by dba's and test suites like this, where 123 is the internal id of the client:

SELECT * FROM get_jobs(client(123));

The get_jobs() function for the production clients would now look like this:

CREATE OR REPLACE FUNCTION get_jobs() RETURNS SETOF jobs
LANGUAGE SQL SECURITY DEFINER AS $$

SELECT * FROM get_jobs(my_client());

$$;

We have essentially built an API which is open to extension for security controls but closed to modification. This means we can run test cases on the underlying database cases even on production (since these can be in transactions that roll back), and push tests of the my_client() interface to the clients themselves, to verify their proper setup.

A Functional Example: Arbitrary data type support in pg_message_queue

There are a few cases, however, where the open-closed principle has more direct applicability.

In pg_message_queue 0.1, only text, bytea, and xml queues were supported. Since virtually anything can be put in a text field, this was deemed to be sufficient at the time. However in 0.2, I wanted to be able to support JSON queues but in a way that would not preclude the extension from running on PostgreSQL 9.1. The solution was to return to the open/closed principle and build a system which could be extended easily for arbitrary types. The result was much more powerful than initially hoped for (and in fact now I am using queues with integer and ip address payloads).

In 0.1, the code looked like this:

CREATE TABLE pg_mq_base (
    msg_id bigserial not null,
    sent_at timestamp not null default now(),
    sent_by name not null default session_user,
    delivered_at timestamp
);

CREATE TABLE pg_mq_xml (
    payload xml not null,
    primary key (msg_id)
) inherits (pg_mq_base);

CREATE TABLE pg_mq_text (
    payload text not null,
    primary key (msg_id)
) inherits (pg_mq_base);

CREATE TABLE pg_mq_bytea (
    payload bytea not null,
    primary key (msg_id)
) inherits (pg_mq_base);

The approach here was to use table inheritance so that new queue types could be easily added. When queues are added a table is created like one of the other tables, including all indexes etc. The relevant portion of the pg_mq_create_queue function is:

EXECUTE 'CREATE TABLE ' || quote_ident(t_table_name) || '(
    like ' || quote_ident('pg_mq_' || in_payload_type ) || ' INCLUDING ALL
)';

The problem here is that while it was possible to extend this, one couldn't do so very easily without modifying the source code of the functions. In 0.2, we reduced the latter part to:

-- these are types for return values only. they are not for storage.
-- using tables because types don't inherit

CREATE TABLE pg_mq_text (payload text) inherits (pg_mq_base);
CREATE TABLE pg_mq_bin (payload bytea) inherits (pg_mq_base);

But the real change that the payload type for the queue. The table creation portion of pg_mq_create_queue is now:

EXECUTE 'CREATE TABLE ' || quote_ident(t_table_name) || '(
    like pg_mq_base INCLUDING ALL,
    payload ' || in_payload_type || ' NOT NULL
)';

This has the advantage of allowing payloads of any type known to PostgreSQL. We can have queues for mac addresses, internet addresses, GIS data, and even complex types if we want to do more object-relational processing on output.

This approach will become more important after 0.3 is out and we begin working on object-relational interfaces on pg_message_queue.

Conclusions

The Open/Closed principle is where we start to see a mismatch between object-oriented application programming and object-relational database design. It's not that the basic principle doesn't apply, but just that it does so in often strange and counter-intuitive ways.

Perspectives on LedgerSMB

Wednesday, January 30, 2013

Building SOLID Databases: Open/Closed Principle