Wednesday, January 30, 2013

Building SOLID Databases: Open/Closed Principle

Introduction


Like the Single Responsibility Principle, the Open/Closed Principle is pretty easy to apply to object-relational design in PostgreSQL, very much unlike the Liskov Substitution Principle which will be the subject of next week's post.  However, the Open/Closed principle applies only in a weaker way to relational design and hence object-relational design for much the same reason that the Liskov Substitution Principle applies in completely different ways.

This instalment thus begins the beginning of what is likely to be a full rotation going from similarity to difference and back to similarity again.

As is always the case, relations can be thought of as specialized "fact classes" which are then manipulated and used to create usable information.  Because they model facts, and not behavior, design concerns often apply differently to them.

Additionally this is the first real taste of complexity in how object and relational paradigms often combine in an object-relational setup.

The Open/Closed Principle in Application Design


In application design and development the Open/Closed Principle facilitates code stability.  The basic principle is that one should be able to extend a module without modifying the source code.  In the LedgerSMB 1.4 codebase, for example, there are places where we prepare a screen for entering report criteria but wrap the functions which do so in other functions which set up report-specific input information.  The function then is open to extension without modification and so the complexity of the function is limited to the most common aspects of generating report filter screens.  Similarly customizations may simply inherit existing code interfaces and add new routines.  This allows modified routines to exist without possibly interrupting other callers where needed.

In addition to code stability (meaning lack of rate of changes to code due to changing requirements), there are two other frequently-overlooked benefits to respecting the open-closed principle.

The first is that software tends to be quite complex and state management is a major source of bugs in virtually all software programs.  When a frequently used subroutine is changed, it may introduce unexpected changes which are still compliant with the interface specification and test cases, but nonetheless introduce or uncover bugs elsewhere in the application.  In essence the consequences of changing code are not always easily foreseen.  The major problem here is that state  often changes when interfaces are called, and consequently changing code behind a currently used interface means that there may be subtle state changes that are not adequately thought through.  As the complexity of an interface grows, so too this problem grows.

Instead if interfaces are built for flexibility, either via accepting additional data which can be passed on to the next stage, or via inheritance, then state is more easily assured, bugs are less likely to surface, and the application can more quickly be adapted to changing business rules.

Problems Defining "Extension" and "Modification" in a db schema


Inside the database, state changes are well defined and follow very specific rules.  Consequently it is not sufficient to define "modification" as a "change in source code."  If it were, an "alter table add column... " would always qualify.  But is this the case?  or is it merely an extension?  In general merely adding an optional field can't possibly pose the state problems seen in the application environments, but it does possibly pose some problems.

Defining modification and extension is thus a problem.  In general some sorts of modification are more dangerous than others.  For this reason we may look at these both in a strict view, where ideally tables are left alone and we add new joining tables to extend, and a more relaxed view where extension includes adding columns which do not change internal data constraints (i.e. where all previous insert and update statements would remain valid, and no constraints are relaxed).

In general, what makes a database nice in terms of relational design, is that one can prove of disprove whether a schema change is backwards compatible using math.  If it is not backwards-compatible, then it is always modification.  If it is backwards compatible, it is always extension when using the relaxed view.

Another point that must be born in mind is that relational math is quite capable of creating new synthetic relations based on the fact that relations are structurally transparent and encapsulation is typically weak, while state management issues are managed through a very well-developed framework of transaction management, locking, and, frequently, snapshot views.   When you combine these techniques with declarative constraints on values, one has a very robust state engine which is based on a very different approach than object-oriented applications managing application behavior. 

Problems with modifying table schemas


In a case where software may be managed via overlapping deployment cycles, certain problems can occur when extending tables by adding columns.  This is because the knowledge of the deployment cycle typically only goes one way--- the extension team has knowledge of the base team's past cycles while the base team typically has no knowledge of the extending team's work.  This is typical in cases where software is deployed and then customized.   Typically adding fields to tables makes extension easy, but the cost is that major version upgrades of the base package may overwrite or clobber extensions or may fail.  In essence a relaxed standard takes on risk that upgrades of the base package may not go so smoothly.

On the other hand, if the software is deployed via a single deployment cycle, as is typical of purely inhouse applications and commercial software, these problems are avoided, and extension by adding fields does not break anything.

The obvious problem here is that these categories are not mutually exclusive, however much they appear to be.  The commercial software may be extended by a second team on-site, and therefore a single deployment cycle on one entity does not guarantee a single deployment cycle.

Object/Relational Interfaces and the OCP


Because object-relational interfaces can encapsulate data, and often can be made to run in certain security contexts (which can be made to cascade), the open/closed principle has a number of applications in ensuring testable database interfaces where security barriers are involved.

For example, we might have an automation system where computers connect to the database in various roles to pull job information.  Rather than having separate accounts for each computer, we can assign them the same login role, but filter out the data they can see based on the client IP address.  This would increase the level of containment in the event of a system compromise.

So we might create an interface of something like my_client() which instantiates the client information from the client IP address, and use that in various functions to filter.   Consequently we might just run:

select * from get_jobs();

The problem of course with such a system is that of testability.  We can't readily test the output because it depends on client IP address.  So we might instead create a more flexible interface, available only to superusers, which accepts a client object which we can instantiate by name, integer id, or  the like.  In that case we may have a query that can be called by dba's and test suites like this, where 123 is the internal id of the client:

SELECT * FROM get_jobs(client(123));

The get_jobs() function for the production clients would now look like this:

CREATE OR REPLACE FUNCTION get_jobs() RETURNS SETOF jobs 
LANGUAGE SQL SECURITY DEFINER AS $$

SELECT * FROM get_jobs(my_client());

$$;


We have essentially built an API which is open to extension for security controls but closed to modification.  This means we can run test cases on the underlying database cases even on production (since these can be in transactions that roll back), and push tests of the my_client() interface to the clients themselves, to verify their proper setup.

A Functional Example:  Arbitrary data type support in pg_message_queue


There are a few cases, however, where the open-closed principle has more direct applicability.

In pg_message_queue 0.1, only text, bytea, and xml queues were supported.  Since virtually anything can be put in a text field, this was deemed to be sufficient at the time.   However in 0.2, I wanted to be able to support JSON queues but in a way that would not preclude the extension from running on PostgreSQL 9.1.  The solution was to return to the open/closed principle and build a system which could be extended easily for arbitrary types.  The result was much more powerful than initially hoped for (and in fact now I am using queues with integer and ip address payloads).

In 0.1, the code looked like this:

CREATE TABLE pg_mq_base (
    msg_id bigserial not null,
    sent_at timestamp not null default now(),
    sent_by name not null default session_user,
    delivered_at timestamp
);

CREATE TABLE pg_mq_xml (
    payload xml not null,
    primary key (msg_id)
) inherits (pg_mq_base);

CREATE TABLE pg_mq_text (
    payload text not null,
    primary key (msg_id)
) inherits (pg_mq_base);

CREATE TABLE pg_mq_bytea (
    payload bytea not null,
    primary key (msg_id)
) inherits (pg_mq_base);


The approach here was to use table inheritance so that new queue types could be easily added.  When queues are added a table is created like one of the other tables, including all indexes etc.  The relevant portion of the pg_mq_create_queue function is:

EXECUTE 'CREATE TABLE ' || quote_ident(t_table_name) || '(
    like ' ||  quote_ident('pg_mq_' || in_payload_type ) || ' INCLUDING ALL
  )';



The problem here is that while it was possible to extend this, one couldn't do so very easily without modifying the source code of the functions.  In 0.2, we reduced the latter part to:

-- these are types for return values only.  they are not for storage.
-- using tables because types don't inherit

CREATE TABLE pg_mq_text (payload text) inherits (pg_mq_base);
CREATE TABLE pg_mq_bin (payload bytea) inherits (pg_mq_base);


But the real change that the payload type for the queue.  The table creation portion of pg_mq_create_queue is now:

EXECUTE 'CREATE TABLE ' || quote_ident(t_table_name) || '(
    like pg_mq_base INCLUDING ALL,
    payload ' || in_payload_type || ' NOT NULL
  )';


This has the advantage of allowing payloads of any type known to PostgreSQL.  We can have queues for mac addresses, internet addresses, GIS data, and even complex types if we want to do more object-relational processing on output.

This approach will become more important after 0.3 is out and we begin working on object-relational interfaces on pg_message_queue.

Conclusions


The Open/Closed principle is where we start to see a mismatch between object-oriented application programming and object-relational database design.  It's not that the basic principle doesn't apply, but just that it does so in often strange and counter-intuitive ways.

Wednesday, January 23, 2013

Building SOLID Databases: Single Responsibility and Normalization

Introduction


This instalment will cover the single responsibility principle in object-relational design, and its relationship both to data normalization and object-oriented application programming.   While single responsibility is a fairly easy object-oriented principle to apply here, I think it is critical to explore in depth because it helps provide a clearer framework to address object-relational design.

As in later instalments I will be using snippets of code developed elsewhere for other areas.  These will not be full versions of what was written, but versions sufficient to show the basics of data structure and interface.

Relations and Classes:  Similarities


Objects and classes, in the surface, look deceptively similar, to the point where one can look at relations as sets of classes, and in fact this equivalence is the basis of object-relational database design.

Objects are data structures used to store state, which have identity and are tightly bound to interface.  Relations are data structures which store state, and if they meet second normal form, have identity in the form of a primary key.  Object-relational databases then provide interface and thus in an object-relational database, relations are classes, and contain sets of objects of a certain class.

Relations and Classes:  Differences


So similar are objects and classes in structure that a very common mistake is to simply use a relational database management system as a simple object store.  This tends to result in brittle database leading to a relatively brittle application.  Consequently many of us see this approach as something of an anti-pattern.

The reason why this doesn't work terribly well is not that the basic equivalence is bad but that relations and classes are used in very different ways.  On the application layer, classes are used to model (and control) behavior, while in the database, relations and tuples are used to model information.  Thus tying database structures to application classes in this way essentially overloads the data structures, turning the structures into reporting objects as well as behavior objects.

Relations thus need to be seen not only as classes but as specialized classes used for persistent storage and reporting.  Thus they have fundamentally different requirements than the behavior classes in the application and thus they have different reasons to change.  An application class typically changes when there is a need for a change in behavior, while a relation should only change when there is a change in data retention and reporting.

Relations have traditionally tended to be divorced from interface and this provides a great deal of power.   While classes tend to be fairly opaque, relations tend to be very transparent.  The reason here is that while both represent state information whether it is application state or other facts, objects traditionally encapsulate behavior (and thus act as building blocks of behavior), relations always encapsulate information and are building blocks of information.  Thus the data structures of relations must be transparent while object-oriented design tends to push for less transparency and more abstraction.

It is worth noting then that because these systems are designed to do different things, there are many DBA's who suggest encapsulating the entire database behind an API, defined by stored procedures.  The typical problem with this approach is that loose coupling of the application to the interface is difficult (but see one of the first posts on this blog for a solution).   When the db interface is tightly coupled to the application, then typically one ends up with problems on several levels, and it tends to sacrifice good application design for good relational database design.

Single Responsibility Principle in Application Programming


The single responsibility principle holds that every class should have a single responsibility which it should entirely encapsulate.  A responsibility is defined as a reason to change.  The canonical example of a violation of this principle is a class which might format and print a report.  Because both data changes and cosmetic changes may require a change to the class, this would be a violation of the principle at issue.   In an ideal world, we'd separate out the printing and the formatting so that cosmetic changes do not require changes when data changes are made and vice versa.

The problem of course with the canonical example is that it is not self-contained.  If you change the data in the report, it will almost certainly require cosmetic changes.  You can try to automate those changes but only within limits, and you can abstract interfaces (dependency inversion) but in the end if you change the data in the report enough, cosmetic changes will become necessary.

Additionally a "reason to change" is epistemologically problematic.  Reasons foreseen are rarely if ever atomic, and so there is a real question as far as how far one pushes this.  In terms of formatting a report, do we want to break out the class that adjusts to paper size so that if we want to move from US Letter to A4 we no longer have to change the rest of the cosmetic layout?

Perfect separation of responsibilities in that example is thus impossible, as it probably always is --- you can only change business rules to a certain point before interfaces must change, and when that happens the cascading flow of necessary changes can be quite significant.

The database is, however, quite different in that that responsibility of database-level code (including DDL and DML) is limited to the proposition that we should construct answers from known facts.  This makes a huge difference in terms of single responsibility, and it is possible to develop mathematical definitions for single responsibility.

Not only is this possible but it has been done.  All of the normal forms from third on up address single responsibility.

The Definition of Third Normal Form

Quoting Wikipedia,

Codd's definition states that a table is in 3NF if and only if both of the following conditions hold:
  • The relation R (table) is in second normal form (2NF)
  • Every non-prime attribute of R is non-transitively dependent (i.e. directly dependent) on every superkey of R.
A non-prime attribute is an attribute not part of a superkey.  In essence what third normal form states is that every relation must contain a superkey and values functionally and directly dependent on that superkey.

This will become more important as we look at how data anomilies dovetail with single responsibility.

Normalization and Single Responsibility


The process of database normalization is an attempt to create relational databases where data anomalies do not exist.  Data anomalies occur where modifying data either requires modifying other data to maintain accuracy (where no independent fact changes are recorded), or where existing data may project current or historical facts not in existence (join anomilies).

This process occurs by breaking down keys and superkeys, and their dependencies, such that data is tracked in smaller, self-contained units.  Beginning at third normal form, one can see relations are forming single responsibilities of managing data directly dependent on their superkeys.  From this point forward, relations' structures would change (assuming no decision to further decompose a relation into a higher normal form) if and only if a change is made to what data is tracked that is directly dependent on a superkey.

The responsibility of the database layer is the storage of facts and the synthesis of answers.  Since the storage relations themselves handle the first, normalization is a prerequisite to good object-relational design.

The one major caveat here however is that first normal form's atomicity requirement must be interpreted slightly differently in object-relational setups because more complex data structures can be atomic compared to a purely relational design.  In a purely relational database, the data types that can be used are relatively minor and therefore facts must be maximally decomposed.  For example we might store an IP address plus network mask as 4 ints for the address and an int for the network mask, or we might store as a single 32-bit int plus another int for the network mask but the latter poses problems of display that the former does not.  In an object-relational database, however, we might store the address as an array of 4 ints for IP v4 or, if we need better performance we might build a custom type.  If storage is not a problem but ease of maintenance is, we might even define relations, domains, and such to hold IP addresses, and then store the tuple in a column with appropriate functional interfaces.

None of these approaches necessarily violate first normal form, as long as the data type involved properly and completely encapsulates the required behavior.  Where such encapsulation is problematic, however, they would violate 1NF because they can no longer be treated as atomic values.  In all cases, the specific value has a 1:1 correspondence to an IP address.

Additionally where the needs are different, storage, application interface, and reporting classes should be different (this can be handled with updateable views, object-relational interfaces, and the like).

Object-Relational Interfaces and Single Responsibility


For purely relational databases, normalization is sufficient to address single responsibility.  Object-relational designs bring some additional complexity because some behavior may be encapsulated in the object interfaces.  There are two fundamental cases where this may make a difference, namely in terms of compositional patterns and in terms of encapsulated data within columns.

A compositional pattern in PostgreSQL typically would occur when we use table inheritance to manage commonly co-occuring fields which occur in ways which are functionally dependent on many other fields in a database.  For example, we might have a notes abstract table, and then have various tables which inherit this, possibly as part of other larger tables.  A common case where composition makes a big difference is in managing notes.  People may want to attach notes to all kinds of other data in a database, and so one cannot say that the text or subject of a note is mutually dependent,

A typical purely relational approach is to either have many independently managed notes tables or have a single global note table which stores notes for everything, and then have multiple join tables to add join dependencies.  The problem with this is that the note data is usually dependent logically, if not mathematically, on the join dependency, and so there is no reasonable way of expressing this without a great deal of complexity in the database design.

An object-relational approach might be to have multiple notes tables, but have them inherit the table structure of a common notes table.  This table can then be expanded, interfaces added as needed, and it should fill the single responsibility principle even though we might not be able to say that there is a natural functional dependency within the table itself.

The second case has to do with storing complex information in columns.  Here stability and robustness of code is especially important, and traditional approaches of the single responsibility principle apply directly to the contained data type.

Example:  Machine Configuration Database and SMTP configuration


One of my current projects is building a network configuration database for a LedgerSMB hosting business I am helping to found (more on this soon).  For competitive reasons I cannot display my whole code here.  However, what I would like to do is show a very abbreviated version here as I used to solve a very specific issue.

One of the basic challenges in a network configuration database is that the direct functional dependencies for a given machine may become quite complex when we assume that a given piece of network software is not likely to be running more than once on a given machine.  Additionally we often want to ensure that certain sorts of software are set to be configured for certain types of machines, and so constraints can exist that force wider tables.

The width and complexity of some configuration tables can possibly pose a management problem over time for the reason that they may not be obviously broken into manageable chunks of columns.

One possible solution is to decompose the storage class into smaller mix-ins, each of which expresses a set of functional dependencies on a specific key, fully encapsulating a single responsibility.  The overall storage class then exists to manage cross-mixin constraints and handle the actual physical storage.  The data can then be presented as a unified table, or as multiple joined tables (and this works even where views would add significant complexity).  In this way the smaller sub-tables can be given the responsibility of managing the configuration of specific pieces of software.

We might therefore have tables like:

-- abstract table, contains no data
CREATE TABLE mi_smtp_config (
    mi_id bigint,
    smtp_hostname text,
    smtp_forward_to text
);

CREATE TABLE machine_instance (
   mi_id bigserial,
   mi_name text not null,
   inservice_date date not null,
   retire_date date.
   ....
) INHERITS (mi_smtp_config, ...);

The major advantage to this approach is that we can easily check and add which fields are set up to configure which software, without looking through a much larger, wider table.   This also provides additional interfaces for related data, and the like.

For example, "select * from mi_smtp_config" is directly equivalent of "select (mi::mi_smtp_config).* from machine_instance mi;

Conclusions


When we think of relations as specialized "fact classes" as opposed to "behavior classes" in the application world, the idea of the single responsibility principle works quite well with relational databases, particularly when paired with other encapsulation processes like stored procedures and views.

In object-relational designs, the principle can be used as a general guide for further decomposing relations into mix-in classes, or creating intelligent data types for attributes, and it becomes possible to solve a number of problems in this regard without breaking normalization rules.

Friday, January 18, 2013

Building SOLID Databases: Introduction

The SOLID design approach is a set of principles developed in object-oriented programming.    This series will explore the applicability of these principles to Object-Relational databases, and contrast the way in which they manifest from the way in which they manifest in the application layer.

You can't program a database like you'd program an application.  The database essentially serves two twin functions:  a persistence store, and a model of information derivable from that store.  The ACID consistency model therefore limits what sorts of programming you can reasonably do in the database because it is a bad idea, outside of logging, to mix transactional and non-transactional side effects.

As an information model, the approach taken by applying these approaches then is relatively different.  One way to think of it is that if object oriented design might be seen as a triangle, applying this in the db requires turning that triangle upside down so you have a flat base to build your application on.

This series will look at applying SOLID principles to object-relational database design, comparing and contrasting the way in which the principles get applied to that of general software development.  Many aspects of relational design in fact play into these topics and so one ends up with a very different mix than one might have with pure object-oriented design.

In general this series continues to look at relations as sets of objects rather than sets of tuples.  This means that the question is how we define data structures and interfaces so that we can bridge object and relational worlds in the database.  The SOLID principles are good starting points but the sort of logic done is fundamentally different and so they are applied in different ways.

Welcome to the series.  I hope you enjoy it.

Thursday, January 10, 2013

An awesome one-liner for PostgreSQL

Recently I was looking at options for exploring CIDR blocks in PostgreSQL.  In particular, I was wondering about checking a CIDR block for unallocated IP addresses in another table.

I had been aware of network address types in PostgreSQL for some time but had not been aware of how powerful they actually were.  I decided to write a function to expand a CIDR bock into a list of IP blocks.  While my initial version wasn't perfect (it includes network and broadcast addresses in the block), changing that will not be hard.

The first version was:

CREATE OR REPLACE FUNCTION all_ips(cidr)
RETURNS SETOF inet LANGUAGE SQL IMMUTABLE AS
$$
 select $1 + s from generate_series(0,  broadcast($1) - network($1)) s;
$$;

That's it.

To exclude network and broadcast addresses, two simple modifications are required:

CREATE OR REPLACE FUNCTION all_ips(cidr)
RETURNS SETOF inet LANGUAGE SQL IMMUTABLE AS
$$
 select $1 + s from generate_series(1,  broadcast($1) - (network($1)) - 1) s;
$$;

And there you have it.  It is fast, or at least as fast as can be expected.

mq_test=# explain analyze
mq_test-# select all_ips('192.168.1.0/24');
                                      QUERY PLAN                               
     
--------------------------------------------------------------------------------
------
 Result  (cost=0.00..0.26 rows=1 width=0) (actual time=0.213..0.511 rows=254 loo
ps=1)
 Total runtime: 0.580 ms
(2 rows)

Ok, not so much for 10.0.0.0/8.....

                                            QUERY PLAN                         
                 
--------------------------------------------------------------------------------
------------------
 Result  (cost=0.00..0.26 rows=1 width=0) (actual time=5977.386..32370.877 rows=
16777214 loops=1)
 Total runtime: 37185.476 ms
(2 rows)

But what do you expect for generating almost 17 million rows?

Friday, January 4, 2013

Object-Relational Algebra 3: The Series Join function

I am back from a break due to working hard on getting LedgerSMB 1.4 closer to release (beta 2 has been released and we are working hard on moving towards beta 3).   Anyway, to finish up the series on object-relational algebra.


The second important addition I would make to relational algebra to give it object-relational capabilities is what I call a "series join" operation.  A series join only makes sense in object-relational approaches because the output can be minimal and yet certain functional dependencies on that output can be readily defined.

I use the capital sigma Σ to refer to a series join, acknowledging that the lower case sigma refers to the select operation and so there is a partial conflict.

A series join takes a theta condition much like any other join, but this theta condition operates on the input relation to the series join.  The ut set is joined to itself in the first iteration and then to the output set of the previous iteration in subsequent iterations.  This is repeated until the output set does not change with subsequent iterations.  in a finite data set, the mappings will also be finite.  An optional subscript provides a starting set of values in the input relation and an optional superscript provides a maximum iteration depth.

The output is set of tuples of (o1, o2) where o1 is the initial object's key or reference and o2 is the linked to object's key or reference.  From here the following functional dependencies arise:  path can be used to trace a path (how we get from one record to another) and depth (how many iterations we have to go to reach the destintion record) are two obvious ones.

A series join provides a useful way to express transitive operations.  This allows, for example, binary transitive closure to be expressed and tested because one can generate all possible paths from one node on a graph up until the point where they loop back on themselves.

Series join operations in the SQL world are roughly approximated by the CONNECT BY and WITH RECURSIVE constructs, both of which are typically used to construct a finite series of self-joins.  However there are key differences too.  In my model we are less worried about the tuple membership than what we can clearly derive from a series join.

Please pardon my layout since I am not quite sure how to make mathematical equations display perfectly in blogger.

Suppose we have a relation employee.

We might have reports = employee Σ13 with a theta condition of id θ manager and this would provide a list of the direct reports to this employee, their direct reports, and their direct reports.  We can also express certain functional dependencies, such as depth(reports), which is between 0 and 3, and path(reports) which will show the chain of command between the report and the employee.  If reports = employee Σ1 and employee 1 is the CEO, then we get the entire organizational chart of the business.

Series joins allow us to do graph traversal, hierarchical searches, and more in an OR database, and approximations of this have been implemented in the relational model.   They are mathematically clear, clean, avoiding magic operations and solve a great number of problems.