Thursday, January 23, 2014

PGObject on CPAN: NoSQL Ideas for PostgreSQL Applications

One of the legitimate points Martin Fowler has argued in favor of NoSQL databases is that expecting application to directly manipulate relational data is far less clean from an application design perspective than having a database encapsulated behind a loosely coupled interface (like a web service).  I would actually go further and point out that such an approach invariably leads to bad database design too because the information layout becomes the contracted software API and thus one either has to spend a lot of time and effort separating logical from physical storage layouts or one ends up having an ossified physical layout that can never change.

This problem has been well understood in the relational database community for a long time.  The real problem has, however, been tooling.  There are effectively two traditional tools for addressing this issue:

1.  Updateable views.  These then form a relational API that allows the database to store information in a way separate from how the application sees it.  If you are using an ORM, this is a really valuable tool.

2.  Stored procedures.  These provide a procedural API, but traditionally a relatively brittle one based on the same approach used by libraries.  Namely you typically have an ordered series of arguments, and all users of the API are expected to agree on the ordering and number of arguments.  While this may work passably for a single system (and even there lead to "dependency hell"), it poses significant issues in a large heterogeneous environment because the number of applications which must be coordinated in terms of updates becomes very high.  Oracle solves this using revision based editions, so you can have side-by-side versioning of stored procedures, and allows applications to specify which edition they are working on.  This is similar to side-by-side versioning of C libraries typical for Linux, or side-by-side versioning of assemblies in .Net.

On the application side, ORMs have become popular, but they still lead to a relational API being contractual, so are really best used with updateable views.

In part because of these shortcomings, we started writing ways around them for LedgerSMB starting with 1.3.  The implementations are PostgreSQL-specific.  More recently I wrote some Perl modules, now on CPAN, to implement these concepts.  These create the general PGObject framework, which given an application access to PostgreSQL stored procedures in a loosely coupled way.  It is hoped that other implementations of the same ideas will be written and other applications will use this framework.

The basic premise is that a procedural interface that is discoverable allows for easier management of software contracts than one which is non-discoverable.  The discoverability criteria then become the software contract.

PGObject allows what I call "API Paradigms" to be built around stored procedures.  An API paradigm is a consistent specification of how to write discoverable stored procedures and then re-use them in the application.  Most namespaces under PGObject represent such "paradigms."  The exceptions currently are the Type, Util, Test, and Debug second-tier namespaces.  Currently PGObject::Simple is the only available paradigm.

What follows is a general writeup of the currently usable PGObject::Simple approach and what each module does:

PGObject


PGObject is the bottom half module.  It is designed to service multiple top-half paradigms (the Simple paradigm is described below, but also working on a CompositeType paradigm which probably won't be ready initially yet).  PGObject has effectively one responsibility:  coordinate between application components and the database.  This is split into two sub-responsibilities:

  1. Locate and run stored procedures
  2. Encode/decode data for running in #1 above.

Specifically outside the responsibility of PGObject is anything to do with managing database connections, so every call to a database-facing routine (locating or running a stored procedure) requires a database handle to be passed to it.

The reason for this is that the database handles should be managed by the application not our CPAN modules and this needs to be flexible enough to handle the possibility that more than one database connection may be needed by an application.  This is not a problem because developers will probably not call these functions unless they are writing their own top-half paradigms (in which case the number of places in their code where they issue calls to these functions will be very limited).

A hook is available to retrieve only functions with a specified first argument type.  If more than one function is found that matches, an exception is thrown.

The Simple top-half paradigm (below) has a total of two such calls, and that's probably typical.

The encoding/decoding system is handled by a few simple rules.

On delivery to the database, any parameter that can('to_db') runs that method and inserts the return value in place of the parameter in the stored procedure.  This allows one to have objects which specify how they serialize.  Bigfloats can serialize as numbers, Datetime subclasses can serialize as date or timestamp strings, and more complex types could serialize however is deemed appropriate (to JSON, a native type string form, a composite type string form, etc).

On retrieval from the database, the type of each column is checked against a type registry (sub-registries may be used for multiple application support, and can be specified at call time as well).  If the type is registered, the return value is passed to the $class->from_db method and the output returned in place of the original value.  This allows for any database type to be mapped back to a handler class.

Currently PGObject::Type is a reserved namespace for dealing with released type handler classes.  We have a type handler for DateTime and one for BigFloat written already and working on one for JSON database types.

PGObject::Simple


The second-level modules outside of a few reserved namespaces designate top-half paradigms for interacting with stored procedures.  Currently only Simple is supported.

This must be subclassed to be used by an application and a method provided to retrieve or generate the appropriate database connection.  This allows application-specific wrappers which can interface with other db connection management logic.

All options for PGObject->call_procedure supported including running aggregates, order by, etc.  This means more options available for things like gl reports database-side than the current LedgerSMB code allows.

$object->call_dbmethod uses the args argument by using a hashref for typing the name to the value.  If I want to have a ->save_as_new method, I can add args => {id => undef} to ensure that undef will be used in place of $self->{id}.

Both call_procedure (for enumerated arguments) and call_dbmethod (for named arguments) are supported both from the package and object.  So you can MyClass->call_dbmethod(...) and $myobj->call_dbmethod.  Naturally if the procedure takes args, you will need to specify them or it will just submit nulls.

PGObject::Simple::Role


This is a Moo/Moose role handler for PGObject::Simple.

One of the main features it has is the ability to declaratively define db methods.  So instead of:

sub int {
    my $self = @_;
    return $self->call_dbmethod(funcname => 'foo_to_int');
}

You can just

dbmethod( int => (funcname => 'foo_to_int'));

We will probably move dbmethod off into another package so that it can be imported early and used elsewhere as well.  This would allow it to be called without the outermost parentheses.

The overall benefits of this framework is that it allows for discoverable interfaces, and the ability to specify what an application needs to know on the database.  This allows for many of the benefits of both relational and NoSQL databases at the same time including development flexibility, discoverable interfaces, encapsulation, and more.