Thursday, October 11, 2012

NoSQL-like Approaches to PostgreSQL, a reply to Martin Fowler

The major point of many of Martin Fowler's blog posts, repeated in his book, is that NoSQL represents a movement away from integrating on the database and towards encapsulating databases in applications and using services to integrate data.  However, that this isn't the way higher end RDBMS's are always used speaks more to the market than the RDBMS technology itself.  For the last five years I have been building relational databases along a similar approach, using web services as an inspiration for interfaces.  This post discusses a few different approaches to doing this on PostgreSQL.  Note that other RDBMS's may be different in many of these regards.

The Promise of the Relational Database Model


Relational databases are designed to store information in application neutral formats and then present it in various ways to various applications.  The goal is, as Fowler puts it, the "integration database."  In order to facilitate this goal in a maintainable way, most RDBMS's come with solid sets of features designed to encapsulate physical storage behind application-specific presentation of the data stored.  These include views, functions, and stored procedures.  These features allow a database engineer to build applications which encapsulate the physical storage, so that the database management system essentially serves out an information model through a myriad of client-specific interfaces.  This achieves what Fowler is advocating, namely encapsulating the database in the application, but it allows for declarative access to that API.   At least in theory, there is no difference between what Fowler advocates for NoSQL and what many DBA's do with an RDBMS.  This also suggests that Fowler is on to something important, but that ditching the RDBMS may in many cases not be the answer.

Why the RDBMS Rarely Delivers on That Promise


For all the theory though, most RDBMS's are rarely used in this way.  I think there are several factors that typically weigh against such encapsulation:

  • Investment (expertise and license costs) in a specific RDBMS by IT departments means that application developers want to write portable code, and thus write lowest common denominator code
  • Application developers often want to ensure that all access goes through their databases and thus do not want to centralize application logic in the database, and
  • Many of the standards ensure that tools are not really optimal for this task.  The best tools are those which go beyond the standards, thus with greater implications regarding portability.
The first market factor has to do with the fact that relational database management systems are expensive systems in several senses.  In addition to the (often hefty) licensing costs, you have the necessity to hire people with experience and/or train them.   Companies usually address this cost by consolidating their expertise and servers on a single RDBMS.  Therefore you see Oracle shops, SQL Server shops, DB2 shops, etc.  This leaves lock-in as a major cost for businesses, and it reduces the market for database-specific applications.  Application developers therefore, quite rationally, more often choose to write SQL code that runs on any different RDBMS.  This naturally requires eschewing advanced features, like Oracle Objects or pl/pgsql functions for vanilla SQL.

 The problem of course is that it is hard to encapsulate data when you are limited to the most basic feature set.  While views may work, making them updatable may be different, and have different consequences on every RDBMS.  Stored procedures or functions are far worse in this regard.  Consequently the need to write portable code requires essentially dispensing  with the very tools used to encapsulate data.  For this reason I don't think you can both write a database for use by multiple applications (i.e where the internal data structures are encapsulated) and also write a database that runs on multiple RDBMS's.  You have to choose.  Developers cannot be blamed for choosing the option which gives their applications the most market and lock-in.

The competing desires for application lock-in is another factor.  RDBMS vendors typically want to restrict access to a certain number of client access licenses or seat licenses, and if connections can be pooled in middleware this can help circumvent some of this limitation (it may cause contractual issues depending on the nature of the EULA, but the technical controls are easily worked around in this manner and at some point you run into problems with definitions, particularly when systems are loosely coupled).  Application developers want to sell their own licenses and this can only be done if the connections are checked in the application layer.  Therefore it is against the interests of many application developers to ship with encapsulated database schemas.  Instead, the RDBMS is used largely like a private data store with some additional reporting capabilities. 

Some RDBMS vendors actually optimize their systems for the above needs.  One feature SQLAnywhere offers is that developers like Intuit can lock a database to a specific application, banning all third party access, and a large part of MySQL's current popularity may be tied to the fact that it is well-optimized for moving unencapsulated databases that run on other DB's to it.  In particular the sql_mode setting on one hand facilitates porting code to MySQL and on the other makes it relatively unsafe for this sort of encapsulation.

However, market factors aren't the only ones pushing developers away from building databases in this manner.  The perpetual debate over stored procedurs illustrates a mismatch between at least one commonly used tool and the typical use case of that tool.

 Stored procedures, as the name implies are essentially imperative constructs which take specific arguments and return records.  The idea is that they provide a basic imperative interface language to a database.  Instead of SELECT .... you issue CALL some_proc(arg1, ....);

 There are several problems however with stored procedures as they are typically used.  The first is that they still have significant impedence mismatch with object-oriented programming.  The data structures they return tend to be fairly rigid so adding a new column tends to require multiple changes in code, often at least one in each layer of the program.  You can't just define your object once and then re-use over and over in other layers of your code.

A second significant problem is that stored procedures are at their best when they are a few well-connected and modularized queries, and at their worst when they are many different queries tied together in complex ways.  This leads to limited utility in actually encapsulating the data, and in all cases the abstraction layer is not really a good match for what is typically done with it.  For these reasons stored procedures as typically used make the most sense when working outside the object-oriented approach.

 Stored procedures have been replaced by object-relation mapping tools (ORMs) which attempt to provide a mapping between a relational interface and an object-oriented development environment.  ORMs automate basic database operations for things like insert, select, update, and delete operations, but they don't really provide an abstraction regarding the actual data mapping between the behavioral app layer and the information model layer.  This can currently only be done in the information model itself, so ORM's are best paired with updatable views, but this comes at the cost of portable SQL code.

Aside from these approaches, or moving to NoSQL, there are a variety of methods to encapsulate the data store inside a relational application.  These approaches require understanding both the desire for encapsulatin and interfaces, and the desire to integrate with applications as a service rather than as a relatively simple persistence layer manipulated mathematically.

Service Oriented Database Architecture

For most of the last five years, I have been building LedgerSMB using an approach I call "Service Oriented Database Architecture," or SODA, which is inspired in part by RESTful web services and SOAP.  From SOAP I took the emphasis on discoverability, and from REST, I took, to the extent possible, the goal of re-using everything in the database that can be re-used in order to define an API.  This approach thus uses the database semantics the way REST re-uses HTTP semantics, and while there are some differences forced by the way PostgreSQL does things (every function called by a SELECT statement), this is not the end of the world.  The goal, of course is to build database interfaces suitable for loosely coupled application/database combinations.

 The SODA approach is based on a number of principles, namely that:

  • Access to the database is through functions, not relations,
  • Functions, to the extent possible, always return a useful result, usually in a data structure corresponding to an object,
  • Function names (within the domain of this architecture) are unique, and
  • Function argument names correspond to the properties expected.
  • The database is responsible for its own security.
 If these are followed then the functions can be mapped, discovered, and run at run-time.  Here is a PHP class that implements such a run-time mapping:

 class DBObject
{
    protected $schema = 'public';
   
   /* function __call($procname, $order = null)
    * Maps in object properties into an arg array and calls call_procedure
    *
    * db procedures are checked for argument names and these are stripped of 
    * the "in_" prefix.  After this is complete, a property is matched and
    * mapped in.
    */
    public function __call($procname, $order = null){
        # variable initializations
        $procargs = array();

        # proc data lookup
        $procquery = "
            SELECT proname, pronargs, proargnames, proargtypes FROM pg_proc 
             WHERE proname = $1
               AND pronamespace = 
                   coalesce((SELECT oid FROM pg_namespace WHERE nspname = $2), 
                        pronamespace)";
         $db = DB::getObject();
         $sth = pg_query_params($db->dbhandle, $procquery, 
                               array($procname, $this->schema));
         $procdata = pg_fetch_assoc($sth);

         if (0 == pg_num_rows($sth)){
            throw new \exception('Function not found');
         }
         # building argument list
         preg_match('/^{(.*)}$/', $procdata['proargnames'], $matches);
         $procargnames = $phpArr = str_getcsv($matches[1]);
         foreach ($procargnames as $argname){
              $argname = preg_replace('/^in_/', '', $argname);
              array_push($procargs, $this->$argname);
         }

         # calling call_procedure
         return $this->call_procedure($procname, $procargs, $order);
    }
    /* function call_procedure($procname, $args = array(), $order = null)
     *
     * generates a query in the form of:
     * select * from $procname($1, $2, etc) ORDER BY colname
     * and runs it.  It returns an array of associative arrays representing the
     * output.
     */
    public function call_procedure($procname, $args = array(), $order = null){
         $results = array();
         # query generation
         $query = "select * from "
                       . pg_escape_identifier($this->schema) . "." 
                       . pg_escape_identifier($procname) . "(";
         $count = 1;
         $first = 1;
         foreach ($args as $arg){
             if (!$first){
                 $query .= ", ";
             }
             $query .= '$' . $count;
             $first = 0;
             ++ $count;
         }
         $query .= ')';
         if ($order){
            $query .= " ORDER BY " . pg_escape_identifier($order);
         }


         # execution and returning results
         $db = DB::getObject();
         $sth = pg_query_params($db->dbhandle, $query, $args);
         if (!$sth){
             return null;
         }
         for ($i = 0; $i < pg_num_rows($sth); $i++){
              print "retrieving row $i \n";
              array_push($results, pg_fetch_assoc($sth, $i));
         }
         return $results;
    }
    /* function merge(array $data)
     * merges data into the current object from an associative array
     * 
     * null or undef values are not set
     */
    public function merge($data){
        foreach ($this as $prop => $value){
             if (array_key_exists($prop, $data) and null != $data[$prop]){
                 $this->$prop = $data[$prop];
             }
        }
    }
    /* function is_allowed_role($role)
     * returns true if the user is allowed the role for the specific db 
     * i.e. $role should not include the prefix.  Otherwise it returns false
     */
    public function is_allowed_role($role){
        $db = DB::getObject();
        return $db->is_allowed_role($role);
    }
}  


The above code seems long but what it allows essentially is inheriting objects to simply declare that methods are mapped to stored procedures, and these mappings are automatically adjusted at the time that stored procedure is actually called.  Additionally this centralizes essentially all db access in a single file where it can be audited for SQL injection issues and the like, and you can go on programming as if you are hitting an object-oriented database.  Of course there are times when you need to  make modifications on many layers, such as when a new attribute needs to be added and stored, and it isn't in the table yet, but generally these are relatively rare.

In PHP, I can have a class which checks the version and selects the appropriate stored procedure easily even if they expect different object properties as arguments:




public function save(){
     $procname = 'company__save';
     if ('1.3' == \LedgerSMB\Config\DBVERSION){
         $procname = 'company_save';
     }
     $data = array_pop($this->$procname());
     $this->merge($data);
}



 What might a stored procedure look like?  Here is one:


CREATE OR REPLACE FUNCTION asset_dep_straight_line_yr_d
(in_asset_ids int[],  in_report_date date, in_report_id int)
RETURNS bool AS
$$
     INSERT INTO asset_report_line (asset_id, report_id, amount, department_id,
                                   warehouse_id)
     SELECT ai.id, $3,
            asset_dep__straight_line_base(
                  ai.usable_life, -- years
                  ai.usable_life -
                  get_fractional_year(coalesce(max(report_date),
                                         start_depreciation,
                                         purchase_date),
                                       coalesce(start_depreciation,
                                         purchase_date)),
                  get_fractional_year(coalesce(max(report_date),
                                         start_depreciation,
                                         purchase_date),
                                $2),
                  purchase_value - salvage_value,
                  coalesce(sum(l.amount), 0)),
            ai.department_id, ai.location_id
       FROM asset_item ai
  LEFT JOIN asset_report_line l ON (l.asset_id = ai.id)
  LEFT JOIN asset_report r ON (l.report_id = r.id)
      WHERE ai.id = ANY($1)
   GROUP BY ai.id, ai.start_depreciation, ai.purchase_date, ai.purchase_value,
            ai.salvage_value, ai.department_id, ai.location_id, ai.usable_life;

    UPDATE asset_report SET report_class = 1 WHERE id = $3;

    select true;
$$ language sql; 

Unfortunately the above has to return true because the nature of the operation does not really provide another effective approach though if we find one, it will be adjusted in the following major version upgrade.

This approach is generally nice because  it is light-weight and conforms relatively well to more rapidly changing environments.  However, the lack of structure imposed may be a problem in some environments also.  Where more engineering is required, the other approaches may work better.  This works relatively well, however, if you build your API to assume a relatively loose coupling between your database and the application hitting this sort of API.

Object-Oriented Database-level interface


Where tighter coupling is required, an object-oriented interface may be better.  In some ways this is worth avoiding because it leads to very ugly SQL queries, for example:


SELECT (save).*
FROM save(row(null, '12345', 'My Company, LTD', 232, '33-55432334')::company);


The overall issue here is that you have the possibility of multiple levels of discoverability involved.  It works very well for code generators, but not so wellf or the human masters.  Note the above could be rewritten, assuming no additional arguments as:

 SELECT (save).*
   FROM (row(null, '12345', 'My Company, LTD', 232, '33-55432334')::company).save;


the advantage to this approach is that your objects form classes whose structure is discoverable, and overloading becomes possible.    Code generators thus can work well, as the database contains all information needed to create all boilerplate code.  The database code itself is simplified as well.  On the other hand, troubleshooting can be a bit of a pain.   It also has the property of essentially requiring the use of code generators in order to create libraries for interoperability.  This closely ties the generated libraries to the interface created.

In-database ORM and JSON (In Javascript!)


 One fascinating approach I came across recently, but have very little experience with, is xTuple's in-database ORM which is largely written in pl/v8js stored procedures. Yes, you got that right, the stored procedures are written in Javascript.  I would invite people to check out the code and see what they think.  This is a fascinating approach and not one I have played around with yet but it definitely shows how far the encapsulation can be made to happen within PostgreSQL.

Conclusions


Encapsulating an application inside the database is not something which one must go to NoSQL to do.  RDBMS's which are strongly programmable are capable of doing this now, although perhaps few if any rival the flexibility in this area of PostgreSQL.  The RDBMS can then be an 'information model server' which serves out information models as requested, each of which encapsulates further data within it.  The data model can then be consumed and expanded in the Model of an MVC framework, but the impedance mismatch issues can largely be eliminated by understanding and utilizing separation of concerns to one's advantage.

 Of course none of this is to disparage NoSQL.  These products have been successfully used quite often as adjuncts to traditional RDBMS's, either preprocessing or post-processing data for later use or re-use.  Some users of polyglot storage models have found rapid increases in development pace when these are used together, with data often being piped from something like Mongo or Neo4j into PostgreSQL after essentially using these engines to transform the data, or using it to transform the data on output.   This is particularly true when processing large amounts of data which is to be processed in relatively non-relational ways, such as with graph databases, array-native databases (in the sciences) etc.  The combination between these approaches becomes very rich.

 But it does suggest that the days of utilizing the RDBMS  as a dumb data store for a given application are numbered in many cases.  I don't see anything challenging the primacy of the RDBMS environment, but at the same time, that doesn't mean no role for the other ones as well.  If you want a dumb persistence store, you can go with a NoSQL solution and be happier.  However, the overall approach of avoiding the mismatch by encapsulating data storage inside of an application is equally applicable to the RDBMS environment as any other.

6 comments:

  1. Can I ask you to split your posts into head (what's visible on index page, in rss feed) and rest (what you see on blog post page)?

    It would make rss management easier, and you wouldn't have to download whole blog post in rss feed.

    ReplyDelete
    Replies
    1. I will have to look into how to do this in Blogger. Thanks though. I will try.

      Delete
  2. Thank you for the insightful and informative article.

    I think a major area in which the approach you advocate is applicable is internal business applications (where I've been applying a less well articulated version of what you describe here for years). In these cases--which represent a massive percentage of all software development--many of the downsides to the approach you describe are irrelevant:

    * Internal applications do not need to be widely portable and distributable by definition.

    * Core business data must be accessed by various unrelated functions in often dramatically different ways. Putting as much data model logic close to the data as possible means not having to duplicate data model logic in every view that touches core data.

    In essence, where application developers focus on the data store as an aspect of their application, businesses view applications as tools to access and manipulate their business data. Enhancing the functionality and capability of the data store has much more benefit from the latter perspective.

    ReplyDelete
  3. "Application developers often want to ensure that all access goes through their databases and thus do not want to centralize application logic in the database"

    The wording and your past remarks makes me believe you meant to say "...all access goes through their application..." instead of "database".

    ReplyDelete
  4. Chris, can you send me your email address? Can't seem to find it anywhere. joe@tanga.com

    Love these articles, btw.

    Thanks,
    Joe

    ReplyDelete
  5. Would like to add a note here that Chris' interest in our project has prompted us to work towards getting the xTuple ORM system up on pgxn. The first step to achieve that was to separate our ORM components from our specific application logic. You can now find the ORM-only repository here:
    https://github.com/xtuple/orm

    The licensing has also been changed to be compatible with pgxn.

    More to come...

    John Rogelstad
    Dir. of Product Development xTuple

    ReplyDelete