Monday, October 15, 2012

Object/Relational Algebra 2: Function Properties

Previously I introduced the idea that functional notation allowed relational algebra to play well with other mathematical disciplines, and therefore solved at least in theory, the limits of relational algebra's expressiveness.  In this second part I will discuss the initial properties of functions as I see them in this approach, and the next, and final, posting in this series will cover the special series join function.  The series join function itself provides a way to address transitive connections between tuples in a relvar.

Definition and Types of Functions


The general approach of this object-relational algebra involves applying algebra to manipulate operations on functions of sets of tuples.  A set of tuples is a relvar, or relational variable.  The set of tuples in the relation is the domain of the function, and as with all functions, it represents exactly one result per domain element, so for every tuple, a single value will result.  That value of course can be a set (including a null set), a tuple, or even a relvar.  However, it is also necessary to note that a function must return the same basic type of result for every tuple.  We can't return a simple value for one, a relation for another, and a set of values for a third.

Each element in a function can be seen as being relational in the sense of within the expressive boundaries of relational algebra, or non-relational in the sense of being outside those boundaries.  Functions themselves can then be divided into three categories:
  1. Relational functions contain only relational elements and can be expressed solely in relational algebra.
  2. Non-relational functions contain only non-relational elements and cannot be further simplified in relational algebra in any circumstance, and
  3. Semi-relational functions contain both relational and non-relational elements, and these can be partly further simplified in relational operations.
All types of functions have some properties in common.  If a function is a function of a relation, then it is necessarily true that it is also a function of every candidate key in that relation, and that for any well-formed data that could be added, it will continue to return a single value of consistent type.

A second level of categorization can be had regarding whether the domain of the relation is fully mathematically self-contained or not.

For example, suppose we have a relvar of employee data, E, and E contains an attribute dob, representing the employee's date of birth.   Therefore, for every tuple in E, there is a dob element.   We can then have two functions:

age(E)  represents the current age of the employee at the time it is evaluated.

age2000(E) represents the age of the employee on New Year's Eve, 2000.

age2000(E) has a fully self-contained domain.  The values depend on the values stored in the relvar E, and nowhere else.  age(E) however does not have a fully self-contained domain.  For any given relational operation, age(E) will behave like a function and we can use it as such, but the results will change over time.  Oracle calls these determinate and indeterminate functions respectively.   PostgreSQL divides these up into additional categories for planning purposes --- in addition to immutable functions whose output never changes for a given input, you have stable and volatile functions, the latter are not really a function per se of the input.

Properties of Relational Functions


I didn't really start getting really excited about this until I started working with relational functions.  Once I started though, there was no going back.  Relational functions can be expressed in relational algebra and therefore can roughly map to subqueries in SQL.  Of course this is not exact, but there may be cases where if is helpful to look at from this perspective.  This is particularly important when looking at optimizing simple functions written in SQL when called in where clauses.

Relational functions can be expanded algebraically in other relational operations.  For example:

Suppose we have a relvar L which represents landmarks, and has an attribute country_id which is a foreign key to country, and suppose further we have a relvar C which represents countries and has an attribute called name which represents the country's name.  We can then define the following functions:

let country(L) = σid=L.country_id(C)

let frenchl(L) = σ(country(L)).name='France'(L)

country(L) can be in-lined into frenchl(L), and this can  be transformed into a subselect, and eventually (in a set- rather than bag- based approach at least) a left join.

The set of single-value returning relational functions is a proper superset to the number of functional dependencies in the database reachable through appropriate join dependencies.  Without grouping and aggregation, these sets are identical.  However, with grouping and aggregation, relational functions express a larger array of possible values.

 

Properties of Non-Relational Functions


Non-relational functions cannot be expressed in relational algebra and therefore must be solved outside the stream of relational operations themselves.  Non-relational functions must be treated as to their value types.  A set-returning function can be treated as a set, a single-value returning function can be treated like a relvar's attibute, and a relvar-returning function can be treated like a relvar itself.

This field moves relational algebra from the area of solving relational queries into the area of symbolic simplification.

Next:  The special "Join Series" function.

Thursday, October 11, 2012

Object-Relational Algebra 1: Definitions and Assumptions

 Introduction

This post begins a series to try to establish a mathematical basis for object-relational database systems.  The idea here is to extend relational math in order to cover object-relational database operations rather than to model current databases mathematically.  I therefore want to start with my assumptions and definitions.

For simplicity's sake I will follow with Codd's assumptions, namely that each row is unique, that the ordering is not significant, and that within each row, the columns are significant.  This establishes a definition of a relation as a set of tuples.  Everything here is then an extension of Codd's math, rather than a replacement for it.  In the programming world, boolean operations may mean something slightly different, or bags may be used instead of sets, but the goal here is not to model databases but to model data, and Codd's model is more elegant in this regard.  It is the real world that gets messy.

To Codd's assumptions I will add my own, namely that the value of any attribute is opaque to relational operations but may act as the domain for a function in this way.  This becomes important when I get to the special Σ function and the functional dependencies on its output.  The special function in my system Σ(R) (join series) with a θ condition gives you a way to represent self-joins of arbitrary depth for relational math, for example, and combined with functional dependencies of its output gives you a way to express queries of transitive binary closure, but that is not its most interesting use at least to me.

A second important point here is that I will diverge from some previous efforts in that I will not throw in "magic operators" designed to solve corner cases.  The goal here is to build a more expressive system that can solve more problems.  Therefore an operator that determines transitive binary closure is ruled out, and if we can't solve a problem without a magic operator, then this will be left for others.  Additionally this effort is intended to be minimalist and allow many different behaviors to be encapsulated through different approaches.  For this reason I will build directly on the work of Codd and more or less ignore other approaches that various researchers have taken.

Finally this is somewhat to be distinguished from the way SQL does things.  SQL itself is a messy (from a math standpoint) application of both relational algebra and predicate calculus, using bags instead of sets.

One important problem is choosing a name for this extension.  The term object in computer science tends to be a bit abstract and not really a good description of what is going on here.  A better description might be functional-relational algebra.  I call it object-relational algebra only because it fits in with object-relational databases.

In the posts that follow, relation and function will be given their ordinary mathematical meanings.  However the relationship between the two are not well defined to my knowledge.

What is an Object Anyway?

In computer science terms, an object is a data structure (here represented as a tuple) which has imperative behavior attached to it.  Object-oriented programming is thus essentially an extension to structural programming where behavior follows structure.    Classes define both structure and behavior, and objects instantiate and apply these rules.  Applying a model such as this to an algebra of data is somewhat problematic, because algebra is about derivation of values, not about behavior.

What we are doing here is similar, and yet different.  I haven't fleshed out a theory of inheritance vs composition (PostgreSQL supports multiple inheritance which can be used to implement either inheritance or composition, strictly speaking), and collection tables can be used to implement composition in various databases including PostgreSQL.  A theory of composition and inheritance is left to others.  From the standpoint of my model these are equivalent.

Instead of behavior, I use the same basic approach of tying structure to function in order order to expand the reach of the relational model.  In computer science terms, a class is then roughly similar to a relation, and an object roughly similar to a tuple.  However, instead of behavior, the typical focus is on derived information.  Thus functions become important in a way which they have not typically been used in relational math.

Functions of a Relation

So here, f(R) is a function of relvar R if, and only if, for every possible tuple in R, f(R) returns one distinct result (that result could however be a tuple or even a set).    Functional dependencies stored in the database can then be modeled as functions, but so can functional dependencies which are not stored.

The following then is not a function: f(R) =  x^0.5 because it resolves to two distinct values for every value of attribute x in relation R (one is positive and the other is negative).  However the following is a function:  f(R) = (abs(x^0.5), -1 * abs(x^0.5)) because it resolves to a tuple with both possible answers from the previous, at least if x is always positive or imaginary numbers are allowed.  It could return a set instead and that would be acceptable, however in this case the structure of the result is also set by the function, and the above function is distinct from g(R) = { abs(x^0.5), -1 * abs(x^0.5) } because the structure of the result is different.

In standard relational algebra, a tuple is finitely expressive, namely one can only express a finite set of values off a single tuple.  However, for any given tuple, an infinite number of functions are possible, and thus when combined with functions, a tuple becomes infinitely expressive.  Not only can all functional dependencies of the tuple's superkey be expressed as a function of the tuple, but any transformation of the tuple's values, or the values of functional dependencies, can be expressed as such as well.

A functional approach also allows us to dispense with the rename operation in relational algebra, since renamed relations can be seen as relation-returning functions.

Kinds of Functions

In my model, functions can be divided up into the following categories:
  1. Relational Functions can be expressed solely in relational algebra.
  2. Non-relational functions possess no non-trivial relational algebra reduction.  x^2 is non-relational.
  3. Complex Functions, relationally speaking have non-trivial relational reductions, but non-relational components too.
Assuming sufficient join dependencies in a database, every functional dependency can be expressed through relational functions.  Moreover trivial relational dependencies can always be expressed by relational functions, and indeed by mere projection operations.   We can then define a trivial relational function as one which reduces solely to project operations off information stored in the tuple.

Result

The resulting system essentially creates something like an object-oriented database but one which is fundamentally relational, and in which objects behave differently than they do with mere object-oriented persistence.   While each tuple is infinitely expressive, this is possible only because of a distinction between primary (Codd's Algebra) stored data and secondary (calculated) answers.  This extension however allows any kind of mathematics (or other logic) to be tied into relational algebra.   This allows relational algebra to be used along with many other kinds of disciplines to build extremely sophisticated data models.

Forthcoming:
1: Exploring properties of relational, non-relational and complex functions
2: The special function  Σ(R) representing the output of a simple series join.

NoSQL-like Approaches to PostgreSQL, a reply to Martin Fowler

The major point of many of Martin Fowler's blog posts, repeated in his book, is that NoSQL represents a movement away from integrating on the database and towards encapsulating databases in applications and using services to integrate data.  However, that this isn't the way higher end RDBMS's are always used speaks more to the market than the RDBMS technology itself.  For the last five years I have been building relational databases along a similar approach, using web services as an inspiration for interfaces.  This post discusses a few different approaches to doing this on PostgreSQL.  Note that other RDBMS's may be different in many of these regards.

The Promise of the Relational Database Model


Relational databases are designed to store information in application neutral formats and then present it in various ways to various applications.  The goal is, as Fowler puts it, the "integration database."  In order to facilitate this goal in a maintainable way, most RDBMS's come with solid sets of features designed to encapsulate physical storage behind application-specific presentation of the data stored.  These include views, functions, and stored procedures.  These features allow a database engineer to build applications which encapsulate the physical storage, so that the database management system essentially serves out an information model through a myriad of client-specific interfaces.  This achieves what Fowler is advocating, namely encapsulating the database in the application, but it allows for declarative access to that API.   At least in theory, there is no difference between what Fowler advocates for NoSQL and what many DBA's do with an RDBMS.  This also suggests that Fowler is on to something important, but that ditching the RDBMS may in many cases not be the answer.

Why the RDBMS Rarely Delivers on That Promise


For all the theory though, most RDBMS's are rarely used in this way.  I think there are several factors that typically weigh against such encapsulation:

  • Investment (expertise and license costs) in a specific RDBMS by IT departments means that application developers want to write portable code, and thus write lowest common denominator code
  • Application developers often want to ensure that all access goes through their databases and thus do not want to centralize application logic in the database, and
  • Many of the standards ensure that tools are not really optimal for this task.  The best tools are those which go beyond the standards, thus with greater implications regarding portability.
The first market factor has to do with the fact that relational database management systems are expensive systems in several senses.  In addition to the (often hefty) licensing costs, you have the necessity to hire people with experience and/or train them.   Companies usually address this cost by consolidating their expertise and servers on a single RDBMS.  Therefore you see Oracle shops, SQL Server shops, DB2 shops, etc.  This leaves lock-in as a major cost for businesses, and it reduces the market for database-specific applications.  Application developers therefore, quite rationally, more often choose to write SQL code that runs on any different RDBMS.  This naturally requires eschewing advanced features, like Oracle Objects or pl/pgsql functions for vanilla SQL.

 The problem of course is that it is hard to encapsulate data when you are limited to the most basic feature set.  While views may work, making them updatable may be different, and have different consequences on every RDBMS.  Stored procedures or functions are far worse in this regard.  Consequently the need to write portable code requires essentially dispensing  with the very tools used to encapsulate data.  For this reason I don't think you can both write a database for use by multiple applications (i.e where the internal data structures are encapsulated) and also write a database that runs on multiple RDBMS's.  You have to choose.  Developers cannot be blamed for choosing the option which gives their applications the most market and lock-in.

The competing desires for application lock-in is another factor.  RDBMS vendors typically want to restrict access to a certain number of client access licenses or seat licenses, and if connections can be pooled in middleware this can help circumvent some of this limitation (it may cause contractual issues depending on the nature of the EULA, but the technical controls are easily worked around in this manner and at some point you run into problems with definitions, particularly when systems are loosely coupled).  Application developers want to sell their own licenses and this can only be done if the connections are checked in the application layer.  Therefore it is against the interests of many application developers to ship with encapsulated database schemas.  Instead, the RDBMS is used largely like a private data store with some additional reporting capabilities. 

Some RDBMS vendors actually optimize their systems for the above needs.  One feature SQLAnywhere offers is that developers like Intuit can lock a database to a specific application, banning all third party access, and a large part of MySQL's current popularity may be tied to the fact that it is well-optimized for moving unencapsulated databases that run on other DB's to it.  In particular the sql_mode setting on one hand facilitates porting code to MySQL and on the other makes it relatively unsafe for this sort of encapsulation.

However, market factors aren't the only ones pushing developers away from building databases in this manner.  The perpetual debate over stored procedurs illustrates a mismatch between at least one commonly used tool and the typical use case of that tool.

 Stored procedures, as the name implies are essentially imperative constructs which take specific arguments and return records.  The idea is that they provide a basic imperative interface language to a database.  Instead of SELECT .... you issue CALL some_proc(arg1, ....);

 There are several problems however with stored procedures as they are typically used.  The first is that they still have significant impedence mismatch with object-oriented programming.  The data structures they return tend to be fairly rigid so adding a new column tends to require multiple changes in code, often at least one in each layer of the program.  You can't just define your object once and then re-use over and over in other layers of your code.

A second significant problem is that stored procedures are at their best when they are a few well-connected and modularized queries, and at their worst when they are many different queries tied together in complex ways.  This leads to limited utility in actually encapsulating the data, and in all cases the abstraction layer is not really a good match for what is typically done with it.  For these reasons stored procedures as typically used make the most sense when working outside the object-oriented approach.

 Stored procedures have been replaced by object-relation mapping tools (ORMs) which attempt to provide a mapping between a relational interface and an object-oriented development environment.  ORMs automate basic database operations for things like insert, select, update, and delete operations, but they don't really provide an abstraction regarding the actual data mapping between the behavioral app layer and the information model layer.  This can currently only be done in the information model itself, so ORM's are best paired with updatable views, but this comes at the cost of portable SQL code.

Aside from these approaches, or moving to NoSQL, there are a variety of methods to encapsulate the data store inside a relational application.  These approaches require understanding both the desire for encapsulatin and interfaces, and the desire to integrate with applications as a service rather than as a relatively simple persistence layer manipulated mathematically.

Service Oriented Database Architecture

For most of the last five years, I have been building LedgerSMB using an approach I call "Service Oriented Database Architecture," or SODA, which is inspired in part by RESTful web services and SOAP.  From SOAP I took the emphasis on discoverability, and from REST, I took, to the extent possible, the goal of re-using everything in the database that can be re-used in order to define an API.  This approach thus uses the database semantics the way REST re-uses HTTP semantics, and while there are some differences forced by the way PostgreSQL does things (every function called by a SELECT statement), this is not the end of the world.  The goal, of course is to build database interfaces suitable for loosely coupled application/database combinations.

 The SODA approach is based on a number of principles, namely that:

  • Access to the database is through functions, not relations,
  • Functions, to the extent possible, always return a useful result, usually in a data structure corresponding to an object,
  • Function names (within the domain of this architecture) are unique, and
  • Function argument names correspond to the properties expected.
  • The database is responsible for its own security.
 If these are followed then the functions can be mapped, discovered, and run at run-time.  Here is a PHP class that implements such a run-time mapping:

 class DBObject
{
    protected $schema = 'public';
   
   /* function __call($procname, $order = null)
    * Maps in object properties into an arg array and calls call_procedure
    *
    * db procedures are checked for argument names and these are stripped of 
    * the "in_" prefix.  After this is complete, a property is matched and
    * mapped in.
    */
    public function __call($procname, $order = null){
        # variable initializations
        $procargs = array();

        # proc data lookup
        $procquery = "
            SELECT proname, pronargs, proargnames, proargtypes FROM pg_proc 
             WHERE proname = $1
               AND pronamespace = 
                   coalesce((SELECT oid FROM pg_namespace WHERE nspname = $2), 
                        pronamespace)";
         $db = DB::getObject();
         $sth = pg_query_params($db->dbhandle, $procquery, 
                               array($procname, $this->schema));
         $procdata = pg_fetch_assoc($sth);

         if (0 == pg_num_rows($sth)){
            throw new \exception('Function not found');
         }
         # building argument list
         preg_match('/^{(.*)}$/', $procdata['proargnames'], $matches);
         $procargnames = $phpArr = str_getcsv($matches[1]);
         foreach ($procargnames as $argname){
              $argname = preg_replace('/^in_/', '', $argname);
              array_push($procargs, $this->$argname);
         }

         # calling call_procedure
         return $this->call_procedure($procname, $procargs, $order);
    }
    /* function call_procedure($procname, $args = array(), $order = null)
     *
     * generates a query in the form of:
     * select * from $procname($1, $2, etc) ORDER BY colname
     * and runs it.  It returns an array of associative arrays representing the
     * output.
     */
    public function call_procedure($procname, $args = array(), $order = null){
         $results = array();
         # query generation
         $query = "select * from "
                       . pg_escape_identifier($this->schema) . "." 
                       . pg_escape_identifier($procname) . "(";
         $count = 1;
         $first = 1;
         foreach ($args as $arg){
             if (!$first){
                 $query .= ", ";
             }
             $query .= '$' . $count;
             $first = 0;
             ++ $count;
         }
         $query .= ')';
         if ($order){
            $query .= " ORDER BY " . pg_escape_identifier($order);
         }


         # execution and returning results
         $db = DB::getObject();
         $sth = pg_query_params($db->dbhandle, $query, $args);
         if (!$sth){
             return null;
         }
         for ($i = 0; $i < pg_num_rows($sth); $i++){
              print "retrieving row $i \n";
              array_push($results, pg_fetch_assoc($sth, $i));
         }
         return $results;
    }
    /* function merge(array $data)
     * merges data into the current object from an associative array
     * 
     * null or undef values are not set
     */
    public function merge($data){
        foreach ($this as $prop => $value){
             if (array_key_exists($prop, $data) and null != $data[$prop]){
                 $this->$prop = $data[$prop];
             }
        }
    }
    /* function is_allowed_role($role)
     * returns true if the user is allowed the role for the specific db 
     * i.e. $role should not include the prefix.  Otherwise it returns false
     */
    public function is_allowed_role($role){
        $db = DB::getObject();
        return $db->is_allowed_role($role);
    }
}  


The above code seems long but what it allows essentially is inheriting objects to simply declare that methods are mapped to stored procedures, and these mappings are automatically adjusted at the time that stored procedure is actually called.  Additionally this centralizes essentially all db access in a single file where it can be audited for SQL injection issues and the like, and you can go on programming as if you are hitting an object-oriented database.  Of course there are times when you need to  make modifications on many layers, such as when a new attribute needs to be added and stored, and it isn't in the table yet, but generally these are relatively rare.

In PHP, I can have a class which checks the version and selects the appropriate stored procedure easily even if they expect different object properties as arguments:




public function save(){
     $procname = 'company__save';
     if ('1.3' == \LedgerSMB\Config\DBVERSION){
         $procname = 'company_save';
     }
     $data = array_pop($this->$procname());
     $this->merge($data);
}



 What might a stored procedure look like?  Here is one:


CREATE OR REPLACE FUNCTION asset_dep_straight_line_yr_d
(in_asset_ids int[],  in_report_date date, in_report_id int)
RETURNS bool AS
$$
     INSERT INTO asset_report_line (asset_id, report_id, amount, department_id,
                                   warehouse_id)
     SELECT ai.id, $3,
            asset_dep__straight_line_base(
                  ai.usable_life, -- years
                  ai.usable_life -
                  get_fractional_year(coalesce(max(report_date),
                                         start_depreciation,
                                         purchase_date),
                                       coalesce(start_depreciation,
                                         purchase_date)),
                  get_fractional_year(coalesce(max(report_date),
                                         start_depreciation,
                                         purchase_date),
                                $2),
                  purchase_value - salvage_value,
                  coalesce(sum(l.amount), 0)),
            ai.department_id, ai.location_id
       FROM asset_item ai
  LEFT JOIN asset_report_line l ON (l.asset_id = ai.id)
  LEFT JOIN asset_report r ON (l.report_id = r.id)
      WHERE ai.id = ANY($1)
   GROUP BY ai.id, ai.start_depreciation, ai.purchase_date, ai.purchase_value,
            ai.salvage_value, ai.department_id, ai.location_id, ai.usable_life;

    UPDATE asset_report SET report_class = 1 WHERE id = $3;

    select true;
$$ language sql; 

Unfortunately the above has to return true because the nature of the operation does not really provide another effective approach though if we find one, it will be adjusted in the following major version upgrade.

This approach is generally nice because  it is light-weight and conforms relatively well to more rapidly changing environments.  However, the lack of structure imposed may be a problem in some environments also.  Where more engineering is required, the other approaches may work better.  This works relatively well, however, if you build your API to assume a relatively loose coupling between your database and the application hitting this sort of API.

Object-Oriented Database-level interface


Where tighter coupling is required, an object-oriented interface may be better.  In some ways this is worth avoiding because it leads to very ugly SQL queries, for example:


SELECT (save).*
FROM save(row(null, '12345', 'My Company, LTD', 232, '33-55432334')::company);


The overall issue here is that you have the possibility of multiple levels of discoverability involved.  It works very well for code generators, but not so wellf or the human masters.  Note the above could be rewritten, assuming no additional arguments as:

 SELECT (save).*
   FROM (row(null, '12345', 'My Company, LTD', 232, '33-55432334')::company).save;


the advantage to this approach is that your objects form classes whose structure is discoverable, and overloading becomes possible.    Code generators thus can work well, as the database contains all information needed to create all boilerplate code.  The database code itself is simplified as well.  On the other hand, troubleshooting can be a bit of a pain.   It also has the property of essentially requiring the use of code generators in order to create libraries for interoperability.  This closely ties the generated libraries to the interface created.

In-database ORM and JSON (In Javascript!)


 One fascinating approach I came across recently, but have very little experience with, is xTuple's in-database ORM which is largely written in pl/v8js stored procedures. Yes, you got that right, the stored procedures are written in Javascript.  I would invite people to check out the code and see what they think.  This is a fascinating approach and not one I have played around with yet but it definitely shows how far the encapsulation can be made to happen within PostgreSQL.

Conclusions


Encapsulating an application inside the database is not something which one must go to NoSQL to do.  RDBMS's which are strongly programmable are capable of doing this now, although perhaps few if any rival the flexibility in this area of PostgreSQL.  The RDBMS can then be an 'information model server' which serves out information models as requested, each of which encapsulates further data within it.  The data model can then be consumed and expanded in the Model of an MVC framework, but the impedance mismatch issues can largely be eliminated by understanding and utilizing separation of concerns to one's advantage.

 Of course none of this is to disparage NoSQL.  These products have been successfully used quite often as adjuncts to traditional RDBMS's, either preprocessing or post-processing data for later use or re-use.  Some users of polyglot storage models have found rapid increases in development pace when these are used together, with data often being piped from something like Mongo or Neo4j into PostgreSQL after essentially using these engines to transform the data, or using it to transform the data on output.   This is particularly true when processing large amounts of data which is to be processed in relatively non-relational ways, such as with graph databases, array-native databases (in the sciences) etc.  The combination between these approaches becomes very rich.

 But it does suggest that the days of utilizing the RDBMS  as a dumb data store for a given application are numbered in many cases.  I don't see anything challenging the primacy of the RDBMS environment, but at the same time, that doesn't mean no role for the other ones as well.  If you want a dumb persistence store, you can go with a NoSQL solution and be happier.  However, the overall approach of avoiding the mismatch by encapsulating data storage inside of an application is equally applicable to the RDBMS environment as any other.

Monday, October 8, 2012

Three Approaches to Object-Relational Databases: PostgreSQL, Oracle, and Informix

PostgreSQL vs Informix

Probably the closest database object-relationally to Postgres is Informix.  Informix in fact got its object-relational approach with the purchase of Illustra, a Postgres fork.  Illustra however split from Postgres before the latter adopted SQL, and so the SQL implementations are entirely independent.

Informix has perhaps had the most influence of any database software on how the industry sees object-relational databases and so because of the share heritage and Informix's place in the industry, it is the first comparison to make.

Table Methods

I was not able to find information on class.method notation in the Informix documentation.  As far as I can tell, Informix requires the methods to be called using function(object) syntax.  This is along the lines of Stonebraker's image processing example.  In this way the connection between structure and function feels somewhat more bolted on than it does in PostgreSQL.  However it would be a mistake to see this as the entirity of Informix's object-relational capabilities.

Inheritance

Informix supports single inheritance for both types and tables using the UNDER supertype syntax.  UNDER, similar to INHERITS in PostgreSQL establish an "is-a" relationship between the supertype and the subtype.  Unlike in PostgreSQL, indexes and referential integrity is inherited in Informix, meaning that foreign keys work properly in both directions.  Single inheritance is thus quite a bit more mature on Informix than on PostgreSQL, but the lack of multiple inheritance prevents composition of tables by in-lining type attributes (which is possible on PostgreSQL but not on Informix).

This shows that Informix has a different approach to table inheritance, namely that there is a simple use case which it supports very well and is quite well polished, but more complex use cases are beyond it.  In PostgreSQL, on the other hand, declarative referential integrity doesn't work and thus referential integrity requires writing your own constraint triggers.

Return Results

A select * from parent_table in informix returns all attributes of all rows from the parent table and all descendant tables.  This can lead to a situation where the result set is "jagged" (i.e. where the rows have different numbers of elements), where child tables add additional columns.  In this case, it is necessary to check the row definition when receiving each row.

One way to look at it is that both PostgreSQL and Informix return a set of objects, bot PostgreSQL coerces them into the types that are asked for, while informix returns them as they are.  Thus if you select * from person, and employee inherits person, then you automatically get all the employee information as well.

This feature is as far as I know, unique to Informix.  I know of no other ORDBMS that allows a result set to be jagged in this way, but it shows one important problem that happens when one starts merging object-oriented and relational approaches.  What does select * mean in this case?  It isn't self-evident, and therefore Informix and PostgreSQL take different approaches.

Alternative Approaches

In the absence of a system of composing tables by in-lining types, the way multiple inheritance works in PostgreSQL, table composition requires using row types as columns in Informix.  The use of complex types in this way in Informix is much more mature than it is in PostgreSQL (which only even began to support this very recently recently).

Consequently composition is done primarily through member types, but this evokes an important tradeoff between ease of querying in a relational sense and rich modelling.

I personally find the SQL required to make elegant use of columns as types in this way somewhat ugly but I recognize this is a personal practice.  Still, consider the following:

SELECT print(e.name), mailing_label(e.address) from employees e;

If address and name are both complex types then whatever we are going to do is going to be a bit ugly.  There isn't much we can do about that.

PostgreSQL and Informix vs Oracle

Oracle in some ways shows some influence from Informix but it takes the approach in a somewhat different direction.  Oracle makes no bones about being a relational database first and foremost and has adopted an approach which avoids, rather than answers, questions of proper behavior.  Oracle Objects in some ways resemble Informix's approach and in some ways PostgreSQL's, but they are in general more different than the same.

Tables and Types 

Oracle objects tends approaches the question of object to relation equivalence by allowing types to be inherited, while tables cannot be.  Tables can copy type structures for their schemas however.  Thus one generally has to create an object model first and then create tables to store those objects.  This approach sets up very different patterns and antipatterns than one would get in Informix and PostgreSQL where tables themselves can inherit.  On one hand this separates (forcibly) data holding tables and their abstract parents.  On the other, this makes it harder to work with, except where complex types are being used in columns of a table.

It is worth noting that types support methods in Oracle and this simplifies things greatly.  However, I am left wondering why one would use Oracle objects instead of virtual columns and just treat Oracle as a form of relational database management system with little use of object extensions.

Object Composition Approaches

In Oracle the only approach that works is to use columns to store complex types.  Forunately those types can have methods so the SQL is not as ugly as it would be on Informix.  You can:

select e.name.print(), e.address.mailing_label() from employees e;

This strikes me as a bit more readable.

It seems to me that Oracle Objects have two main uses.  The first is in the logical model of the database although this role can be taken over by ORMs to some extent.  The second and more attractive approach is to use Oracle Objects not for their advertised use of modelling of things like customers or invoices but rather to create intelligent datatypes for columns.

For example, if I want to store images and dynamically determine what type of image we are looking for, I could still do something like:

SELECT id FROM slides s
 WHERE s.picture.is_of('sunset');

This is far cleaner SQL-wise than the equivalent syntax in Informix or PostgreSQL:

SELECT id FROM slides s
 WHERE is_of(s.picture, 'sunset'); 

This allows the same sorts of problems to be addressed as Stonebraker talks about, and it allows structure and logic to be clearly tied together, at the expense of substitutability as is found in table inheritance in both Informix and PostgreSQL.

The complexity costs are likely to be worth it in Oracle in a very different set of cases than would be the case in PostgreSQL or Informix.  These cases intersect at queries which must return data based on very complex criteria which cannot be reduced to relational algebra and predicate calculus.

Final Thoughts

The above discusses three different approaches to the question of how to encapsulate complex data into a database which may need to query based on arbitrary criteria not readily reducible to relational algebra and predicate calculus.

PostgreSQL and Informix both tightly integrate object handling far more deeply than Oracle.  Informix seems more polished in the areas it supports, but PostgreSQL has a broader support for the overall idea.

Oracle's approach is essentially to move object handling capabilities to the column level for the most part.  Yes, you an create tables of objects, but you cannot inherit them directly and you cannot compose your object model using multiple parents without moving everything to the column level.  This makes object behavior mostly useful in the creation of "smart column types."

Each of these three ORDBMS's takes its own approach.  All three allow the development of very intelligent database models.  All three pose different complexity issues.

Wednesday, October 3, 2012

Faq: Why is LedgerSMB PostgreSQL-only?

We get asked a lot, why is LedgerSMB Postgresql-only?  Why not run on MySQL?  After all, since 5.0, MySQL has offered relatively decent type constraints, and offers a database that works sufficient to build ACID-compliant applications, and so forth.  After all, this line of reasoning goes, we'd get more market share by being able to run on a larger number of hosting providers.

This post discusses why we made the decision to go with Postgres and decline to support any other databases.  I still believe it has been the correct decision for us but it might not be the right decision for other projects.  I hope though that writing this up will help other projects weigh their options and choose appropriately.

The short version is that we decided to favor an application with many avenues for integration over the idea of an application that would run on any platform.  We chose the path of moving towards giving our users as rich an experience as possible, and as rich opportunities as possible in integrating this software with their business over the idea of having as many customers as possible.  If the software works wonders, people will make room for it.

Our Goal

In our case, our goal has generally been to be a major part of the business infrastructure of our users.  We want to create a flexible framework which supports extension and repurposing, allowing businesses to find new uses for what we are doing.

Additionally early on our decision was shaped by the desire to hook LedgerSMB into reporting tools written in other languages, and we generalized this to tools generally.  This has been a relatively safe bet, and with additional compatibility classes auto-detection of DB-level API's is possible.

Finally we started as a fork of a PostgreSQL-only application and so the choice was whether to deepen our commitment to PostgreSQL only or whether to move towards portability.

Considerations

Our overall approach  was based on the idea that financial and ERP systems are best when they are open to integration, and that ERP frameworks need to be open to extension as well.  After all, ERP software typically runs a wide variety of functions in a business, and these are often connected to yet more workflows where custom apps may be developed.  If those apps can be developed against LedgerSMB this delivers greater value to the customer.

In other words, we see ERP as much about development platform and tools as it is about application functionality and flexibility in adding features is key.

Our Decision

Our decision was to solidify our usage of PostgreSQL, use it to add additional security and tune performance.  We quickly made the decision to move much of the logic into stored procedures in order to support tools written in other languages as well.

One important benefit which helped push us this way was the fact that a stored procedure interface could be centrally controlled and checked against SQL injection while ad hoc query interfaces required careful auditing.

Our Approach and Its Challenges

Our current approach is to name arguments such that they can be mapped in. This works reasonably well with a single application, it is simple and requires relatively straight-forward queries.  However it is currently limited in two ways:  namespaces, and the fact that the interface is very heavily procedural and this makes maintenance a bit of a pain.  It also means that class attributes must be manually maintained across integration classes, which is sub-optimal.

We are currently exploring the possibility of moving to a more object-oriented interface in the db which would solve the above issues.  The interface would be dynamically discovered by querying system catalogs or views we would create, and results cached by the application.  This would allow dynamic API discovery and classes to export the db-level API to other systems in a purely db-centric way.  This would hence be expected to cut our maintenance effort significantly.

A major challenge here however has to do with quality of stored procedures.  The current quality of stored procedures varies quite a bit.   In general, one of the problems one faces is that application developers don't always make good stored procedure developers because the sorts of thinking are rather different.  However, over time we expect to be able to fix this issue.

Another key part of our strategy is that of layered API's.  Not only are we building discoverable db-level API's but also RESTful web ser

Final Words

LedgerSMB chose a db-centric model because we want to support a wide range of applications written in a wide range of languages, running in a wide range of environments.  This is very specific to our approach, and it is not without challenges.   However it has already had some benefits in our PHP compatibility classes and these are expected to multiply in the future.  In the end we choose a db for many apps over an app for many db's.

Monday, October 1, 2012

Towards a simple explanation of Object-Relational Database Management Systems

A relational database management system is a piece of software which takes organized information and processes it in order to answer questions presented by other software.

In a typical relational database management system, data is arranged into tables, each of which is similar to a worksheet in an Excel workbook.  Information in these tables is matched, and processed, and returned to the user.  A skilled database architect can arrange the data so that the structure (called the schema) is easily maintained and extended, so new types of information can be stored as the need arises.

Object-relational database management systems take this system and expand it, allowing for more complex data to be stored in each column and for a wider range of calculations to be attached to the table.   The combination of more complex data and complex calculations allows one to build advanced databases that can do more than one could before.

These systems are called object-relational because they are an attempt expand what relational database management systems can do by borrowing some ideas from object-oriented programming.  The two disciplines are however very far apart because object-oriented programming aims to model behavior encapsulated behind interfaces while object-relational database design seeks to extend relational math to model derived information as well as what is minimally known.

Thursday, September 27, 2012

O/R Series Epilogue 2: Towards a not-so-simple explanation of Object Relational Database Management Systems

Object relational databases build on the relational model but focus on answering questions regarding information that is derived from other information using a variety of general purpose tools.  Modelling of the information then shifts not only to what is known but what can be easily (or even not so easily) extrapolated from what is known.  Existing ORDBMS-type systems appear to include Informix (of course, given the PostgreSQL legacy), Oracle, and DB2, but PostgreSQL and Informix have the longest lineage in this area.

The key features of an ORDBMS are:
  1. The ability to model easily extrapolated information within the data model itself, through use of user defined functions
  2. Extensibility with regard to types and functions, written in general purpose programming languages
These features work together in important ways and it is possible, as we have seen, to build a great deal of intelligence into a data model by only moving slightly beyond the standard relational approach.  Although every step further brings with it complexity costs, these are very often useful and allow problems to be solved close to the database which cannot be really easily solved otherwise.

Towards an Object-Relational Algebra

One way to think of object-relational modelling is in the language of relational algebra.  I haven't been able to find accepted notation for functions on individual tuples, so here I use f(relation) notation, to show that the function operates over the set of tuples in the relation.

Relational algebra as defined by Dr Codd is extremely important but it cannot solve certain classes of important queries and so SQL has gone beyond it.  Important blind-spots include:
  1. Transitive operations at arbitrary depth.  Transitive closure and "n next highest values" are both examples of this.
  2. Calculated data problems
Most problems fall into one of these two categories if they are outside the relational model.

We already know that elements of a tuple are functionally dependent on others if, and only if, for each value of the dependency, there is only one functionally dependent value.  So if a is functionally dependent on b, for every b there is exactly one valid value of a.  Functional dependency, as in algebra, is value-based, not expression-based.

I am choosing an algebraic-inspired notation for this, where f(R) is a function of relation R if, and only if, for every tuple in relation R,  there is only one f(R).

f(R) is trivial if for every tuple (a, b, c) in relation R, f(R) returns a value or a tuple that is a subset of the tuple processed.  So if for every tuple (a, b, c), if f(R) = a, or if f(R) = (a, c), then the function is trivial.   Every trivial function also represents a trivial functional dependency within the tuple.

A function is relational if it can be expressed solely through relational algebra.  All trivial functions can be expressed relationally (using π operations) and therefore are also relational.  A relational function thus always specifies a functional data dependency in or between relations.  Relational functions have the property of always denoting global functional dependencies.

A function is non-relational if it cannot be expressed solely through non-relational algebra, for example if it involves processing of the actual value of one or more of the tuple elements or their functional dependencies.  If we have a relation (Emp) containing an attribute salary_per_wk, and annual_salary(Emp) = πsalary_per_wk * 52(Emp), then annual_salary is non-relational because it involves actual processing of data inside the tuple.  Relational functions often can be expanded in relational operations but as far as relational operations, non-relational functions are black boxes and function very much like attributes of a relation.

For example id(R) = πid(R) and c(R) = πnameid=id(R)(C)) are both relational functions, but only id(R) is trivial.  c(R) represents essentially a join and subselect.

An example of a general operation in functional notation might be:

π age(R) (R).  Similarly we can πname(R)age(R)=41(R))

Of course, we must be careful.  Since age(R) is only locally functionally dependant, indexes are out of the question and we must be careful about specifying constraints.  However defining a relation such that age(R) < 65 might prove problematic unless we are re-checking every day.

This would be similar to the following statement in PostgreSQL:

SELECT r.name
  FROM employee r
 WHERE r.age = 41;

where name and age are table methods.  This allows us to store less information in the database (and hence with less duplication and chances for error) and extrapolate important information out of the data that is stored.  It also allows us to store data in ways which are less traditional (nested structures etc) for the sole purpose of writing functions against it in that specific format and thus modelling constraints which cannot be modelled using more traditional structures (though as we have seen that poses significant gotchas and complexity costs).

Similarly recursive queries require recursive operators to work.  I place a capital Greek Sigma (Σ) to signify recursion above the join operator.  This is borrowed because it is the series designator elsewhere in mathematics.  An optional maximum depth is specified as a subscript to the Σ.  So Σ5 would indicate that the expression or join should be subject to no more than 5 iterations.  In a recursive join, the join is repeated until the θ condition is no longer satisfied.  The functions of path(R) and depth(R) are functionally dependent on the output of a recursive join, so Σ5 is identical to (σdepth(r)=5(Σ(...) r)).  The choice of the Σ is also helpful because while Σ always returns an extended superset, σ always returns a subset.

Since path is functionally dependent on the output of a recursive join, we can prove transitive closure over a finite set using recursive self-joins, functions, and boolean operators.  We can also express next highest N tuples results.  Alternatively the Σ can be followed by parentheses to show that an entire expression should be repeated until it brings no further results into the set.  In a Σ expression the set is divided into two subsets:  previous and new.  New results are those returned by the last iteration, and are the only ones processed for join conditions.  On each iteration, the "new" tuples are moved into the previous set and the tuples which satisfied the join condition are moved into the "new" set.

I also use three other symbols for specifying order-dependent information.  ω (omega) denotes a "window" order and partition in which aggregates can be applied to tuples in order, tuples can be removed from the beginning (α with a subscript for number of tuples to "skip") or truncated from the end (τ with a subscript for the number of tuples after which the set is to be truncated).  These allow me to approach the problems which SQL can address but relational algebra cannot.  An interesting property of ω is that the window order only is valid for some specific operations and is lost on any join or select operations.  These operations have interesting properties as well but they are somewhat outside the scope of this posting.  I will however note that it is usually cleaner to solve next N result issues with window ordering, tuple omission, and truncation than it is with recursion and aggregates.

Next:  Towards a simple explanation of object-relational database systems.

Sunday, September 23, 2012

O/R Series Epilogue 1: 4 Things I Wish PostgreSQL did

In this area there are things I wish PostgreSQL did.  These are listed in order of priority.  Each item also has workarounds listed.

Parameters in Object Methods

In Oracle you can define an object method such that you can do something like:

SELECT li.length(30) from line_segment li;

In PostgreSQL there is no way to pass parameters to an object method called with object.method syntax.  The equivalent in PostgreSQL would be:

SELECT length(li, 30) from line_segment li;

This isn't a huge deal but it would make it a lot clearer that length was closely tied to the line_segment type structure if it was possible to do things that way.

Inheritable foreign key constraints

A very useful and extension to what we have would be to allow foreign keys to be inherited on the referencing side.  This would save one from having to define foreign keys over and over for each child table and it would make key classes even more useful.

Ideally I would like to be able to do

CREATE TABLE country_ref (
     country_id int references country(id)
);

and have the foreign key properly enforced on all inheriting tables.  This would just involve copying the foreign key constraint and so would probably not require any deep changes to structure the way the ability to reference child table and parent tables together might.

Inheritable unique constraints 

Related to this I think another thing that would smooth things over down the road and might ultimately lead to more functionality down the road.  For now, I think the first step would be a SCOPE predicate for unique constraints and indexes.  This could be something like:

CREATE TABLE country_ref_u (
    unique (country_id) SCOPE PER TABLE
    FOREIGN KEY (country_id) REFERENCES country(id)
);

Such a unique constraint could then specify that this is unique on each table where it is inherited.  Initially PER TABLE and ONLY HERE (i.e. not inherited) might be supported with an effort to eventually support SCOPE FULL (and DRI against an inheritance tree).

One of the fundamental problems that one runs into with the issue of unique constraints cross tables is that extending the index format to include tableoid might lead to significantly larger indexes and very likely slower lookups where inheritance is not used.  One option might be to have an additional index page with special pointers to other indexes.  So per table indexes might be useful jumping off points for full tree constraints.

Consistent assumptions in composite types used for columns

Composite types in columns is one of those areas where the behavior is remarkably inconsistent and the edge cases poorly documented.  This isn't surprising given that this is relatively new functionality at the edges of relational database usage, and different people have different ideas of how it should work.  However, it seems very difficult to get everyone on the same set of assumptions and it is impossible to fix the current mish-mash while maintaining backwards compatibility.  Right now, there is no consistency however, and this makes it a difficult area to use going forwards.

Wednesday, September 19, 2012

Object/Relational modelling Part 7: General Design Considerations

The previous posts have hopefully opened up the way we look at modelling relational data, in a move from thinking about relations and tuples to thinking about sets of objects coupled with derived values and catalogs.  These approaches are all viable as long as the actual table storage layout still meets all criteria of a well-designed relational database.

Unfortunately with new techniques and approaches, additional antipatterns are possible.  Every feature can be badly used, and every feature can be badly combined with other features.  For this reason, the more one adds advanced features which complicate the data model, the more it becomes possible to spectacularly fail.  Some of the tools discussed so far can be liberally used, but many must be used sparingly and only as necessary.

I have personally been working with features like table inheritance for some years now.   The features can be of immense help if used properly, but much of the current documentation sets users up to be spectacularly disappointed.

I am now at a point where I can summarize the differences between relational thinking and object-relational thinking.  The difference is similar to the difference between algebra and calculus.  Relationally we want our information arranged so we can select, join, filter, and aggregate to provide useful answers to questions.    This is not so different from using algebra to factor and simplify expressions or to solve problems for specific unknowns.

When a student is learning calculus however, the same tools are used but applied in a very different way.  The approach to thinking is very different.  Calculus applies many of the same techniques to very small numbers and to trends between numbers.  For example, if you have a curve and want to find out what the slope is at a certain point, you look at the limit of the slope between two points as those points converge.  Similarly an integral is the sum of an infinite series of infinitesimals.   Calculus can thus be seen as a level of abstraction above algebra useful primarily for those problems that algebra cannot solve directly, but it builds upon algebraic techniques, and anyone doing calculus will apply algebra wherever possible first.

Here we have a similar process.  We are aggregating or decomposing relational information such that we can derive interesting values from them. However, for a whole host of reasons we do not want to lose sight of the fact that the database --- at least in the sections we are running day to day SELECT queries from --- is relationally well-formed.

If the above comparison has merit then it is worth heeding the words of my college Calculus teacher, Mike Lavender.  Many of these features are power tools and should be used both sparingly and with care.

A second important point is that object-relational and object-oriented thinking is also quite different.  Object-relational thinking ties data structures to their functional dependencies which may not be stored in the database, and tends to be best when approaching data structures from the perspective of what answers can be derived.  Thus, a square is a rectangle with an additional constraint, and this is perfectly consistent with the Liskov Substitution Principle applied to an information model.  A square is mathematically substitutable for a rectangle in terms of what we can derive from it.

On the other hand, object-oriented programming is about modelling behavior and what can be done to objects without transforming them across categories.  Thus in object-oriented programming, you cannot say that a square is a rectangle because there are things we can do to a rectangle that we cannot do to a square without transforming them across class boundaries.  As I will attempt to show in the future, while object-oriented design in application code is very much an art, object-relational design can be expressed in terms of mathematics, namely relational algebra with some extremely modest extensions.  This difference in role and approach means that while many of the object-oriented design principles can be found to apply, they apply to object-relational designs in ways very different than they would in the app layer.

I am planning a short (three post) series on my attempt at a object-relational algebra, and a longer series on applying SOLID principles to Object-Relational design.


The Problem:  "Relational Decay"

 

The basic problem can be described as the ability to silently degrade the usefulness of relational data constraints because the design does not allow for clean mapping of constraints.  This requires either ever-increasing complexity of data management or it requires degraded data constraints.  Both of these are dangerous and can lead eventually to data integrity and accuracy issues.

Anti-Pattern: Mixing Stored and Subclass Data in Same Virtual Table

Having data written to both parent and child table complicates data modelling in ways that result in functionally ambiguous foreign keys.  Functionally ambiguous foreign keys are generally bad and to be avoided.  In general foreign keys should be clear and unambiguous and should reference specific tables.  Data inheritance, mixing stored data and data of subclasses together is a recipe for problems.  If a single query pulls data like this together, relational integrity becomes problematic.  Don't do it.

Anti-Pattern:  Multi-way sync'ing.

One of the solutions for virtually any key issue is to use materialized views.   This is particularly helpful when representation of data must be transformed from a form where derived values can be constrained to one where it can be reasonably relationally queried (nested table structures create this problem among others).  Its tempting to try to synchronize data both says, but this is unmanageably complex.  Data flow needs to only go one way.

Anti-Pattern: Commonly Retrieving Data Directly from Nested Storage

As we have shown, nested tables present massive problems regarding storage and indexing because the logical storage and the physical storage are not really of the same structure.  Nested storage makes a great deal of sense for some problems, as we have seen, but used indiscriminately, the result is invariably a mess.

Anti-Pattern:  Casting complex type to text

When prototyping types, perhaps before writing a type in C, it may be tempting to write casts of tupe types to text.  If you do this however, things are likely to break because there are built-in casts to text that are used internally.  Instead use an as_text method.

 

Solutions:  Relational first, Object second


The solutions below have one thing in common.  They allow relational integrity to be preserved while putting these sorts of  power tools to use.  These however are not necessarily to be used willy-nilly either.  Each one has something of a complexity cost and this must be weighed against the reductions in complexity that each one provides for a project.

Interface inheritance

Interfaces should be inherited, not data.   In this regard, the inheritance design follows from data.  This first pattern is to build inheritance trees assuming that the only tables being actively queried will be those at the bottom of the tree, where data is actually stored.  In this case, inheritance is used not to provide additional query capabilities but rather to provide consistent interfaces across a domain.

Stock and Leaf Inheritance Tree Design

Stock and leaf is a name I gave to the idea that you can build an inheritance tree which separates tables out into functional units from which no further inheritance occurs (leaves) and infrastructure tables which provide consistent query capability for a subset of table information.  These stock tables never have data stored in them but provide general query capabilities.

A way to think of the stock and leaf design might be a table partitioned in a complex way.  Maybe we have an initial partition based on some ranges of the one primary key column, but one of the first-level tables here (but not all) is partitioned on a second primary key column.  Thus we have leaf nodes occurring at several different levels of inheritance.  This allows for organic growth of inheritance, but avoids the functionally ambiguous foreign keys that often result.

Simplified Write Model

If the data model requires synchronizing some aspects of it, it is very important to keep the write scenarios to a minimum.  Fewer write scenarios means fewer things that can go wrong, easier testing, more understandable failure cases, and easier troubleshooting when things go wrong.

Log-Aggregate-Snapshot Modelling

One way to greatly simplify writes is to move from representing current state in the database to representing cumulative changes in the database.  Accounting systems, for example, have a very restricted write model (usually few if any update scenarios, most important tables being append-only, etc) along with some current state (balance of the checking account) being calculated based on past changes.  This is a pattern which can be used elsewhere, and it enables other approaches (see below).

The typical objection is that representing state as a series of changes means one is storing a lot more data and that calculating state then requires expensive calculations.  The solution to this is to use periodic snapshots which can be used as roll-forward points, as well as to constrain when input may be entered.  For example we may take year-end snapshots of account balances, plus month-end snapshots of invoice balances, preventing invoices from being entered that are more than a month old, and any financial transactions from being entered into periods in which books are closed.  Similarly this allows us to delete old data without destroying our ability to track current state.

Log-Aggregate-Snapshot modelling trades data model complexity for write simplicity.  It is itself a power tool but it makes many other power tools safely usable.

Separation of Entry Storage and Query Storage for Append-Only Data

Another approach we can borrow from accounting, this time the paper world, is the idea that the format at point of entry can be transformed and re-saved for later use in another form.   This works well only when the data is append-only and the write transformations are well understood.  However one can use it for materialized views for inheritance trees where needed, or for breaking out data stored in nested tables where this is needed in order to enforce constraints.

For example, we may have nested tables in order to enforce subset constraints.  We can break these out on save into conventional tables where row constraints can be enforced and data more flexibly queried and optimized.

object_id inherited field

When using stock and leaf approaches one of the difficulties can be in tracking rows back from the stock table catalogs where this is needed into the actual tables.  Inheriting an object_id field poses some problems, but it works relatively well.  The solution typically is to do something like:

CREATE TABLE object_root (
    object_id bigserial
);

And then inherit object_root.

 

The O/R "Hand Tools"


The following are in the hand tools category.  They can be elegantly used wherever they seem useful.  However they are ranked by increasing complexity cost.

Object methods for derived data and storage

The major complexity cost here is that as table structures change, methods can get out of date or break.  This can be mitigated by source code control and factoring of the code so that it is easy to understand the codebase.

Table Inheritance to Model Interface Inheritance

This approach trades enforcement of interface consistency with increased ramp-up time and knowledge.  This will slow down the database table layout, but it will speed up the use of the database once designed.  On the whole this is a bit of a wash complexity-wise but one should counsel against over-using this.

The O/R "Power Tools"


Like the hand tools above, these are ranked according to complexity costs.  Unlike the hand tools, these are specialized tools, best used rarely.  They have high complexity costs and usually require compensating designs to use safely.

Complex Data Types in Views

Complex data types in views are machine, not human interfaces.  These make it harder to access data and can confuse casual observers.  Additionally as types change underlying plumbing may have to change as well.

Complex Data Types in Tables

Complex data in tables greatly increases the complexity of table constraints.  There are cases where it helps, but in general multiple inheritance is a cleaner alternative with fewer gotchas.  This is likely to be useful when moving O/R code over from Oracle or other systems that do not support multiple inheritance or the rare case where table inheritance is inadequate to solve the problem at hand.

Nested Table Storage

Currently nested table storage is relatively useless except in cases where it allows otherwise impossible constraints to be modelled.  The current approaches to storage make nested table storage relatively useless for general purpose queries, at least where the data sets are likely to get large.  They can, however be used for small data sets where indexes might not be useful anyway, or for cases as a point of original entry where data is then copied into another table for actual access.

There isn't any reason why these issues can't be solved in future versions, but the fact that this is a tool that works well only with edge cases anyway means it is unlikely to be a very high priority to folks not using it.

Table Inheritance to Model Data Inheritance

This adds tremendous complexity to the schema, and requires a lot of effort to make work correctly.  Only use it if you are sure that in your specific case the complexity issues it solves for you are worth the costs.

Sunday, September 16, 2012

O/R Modelling in part 6: Polymorphism (and Gotchas) in PostgreSQL

Polymorphism is common in virtually all database management systems.  When we add an integer to a float, we expect this to just work.  Similarly various types of text fields can usually be concatenated together, etc.  In particularly extensible database systems, like PostgreSQL, polymorphism tends to lead to additional problems.  Indeed, implicit casts were cut way back a few years ago specifically because of problems they might cause.  This is an area I expect we will do quite a bit more with in LedgerSMB in the future.

The PostgreSQL Type System

PostgreSQL types can be divided into three specific types:  Primitives (those supplied by PostgreSQL out of the box or by extensions), domains, and object types.  The object types include composite types and table types.  Although typically these are lumped together, we will cover table types and composite types separately because the capabilities are far more developed when tables are created.

Base Primitive Types

Base types are intended to be the basic building blocks of relations.  They include such types as int4 and text.  These cannot have  methods attached but they can be arguments in functions provided that the procedural language understands the type.  Most of these will be passed in using the textual representation if all else fails.

Polymorphic Primitive Types

Polymorphic primitives are primitives which can take the form of other base types.  The primary ones are anyelement (which is any simple base type) and anyarray which is an array filled with any simple base type.  These are mostly used in functions so that the function can apply to array operations generally.  I have not encountered them elsewhere.

Pseudotype Primitives

Pseudotypes include triggers and a few other things.  They are mostly used as keywords, not as actual types.

SQL-Created Domain Types

Domains are the first significant type of SQL-created or user-defined type.  An SQL domain is a base type, valid for a subset of the base values.  Domains can be set to be enforced not null, and this applies only on storage, but unlike base types, this is checked as a property of a composite type as well.

SQL-Created Object Composite Types

PostgreSQL allows you to create composite types which can be used as input and output types for functions, elements in object composite types, columns in a table, and so forth.  Composite types have no inheritance, and cannot inherit interfaces from eachother.  Composite types cannot have any recursive elements and this is checked deeply.  For example you cannot:

CREATE TYPE a AS (
   a int
);

CREATE TYPE b  AS (
   a a
);

 ALTER TYPE a ADD attribute b b;

An attempt to do so will result in an error:

 ERROR:  composite type a cannot be made a member of itself

SQL-Created Object Table Types 

Table types are similar to composite types except that inheritance is supported.  This means in practice that the behavior of tables is quite different than the behavior of composite types when it comes to methods, implicit casting, and the like.  Note that implicit casts are not inherited.

Type Conversion:  CREATE CAST

Values and objects can be converted between types using casts.  Casts can be explicit only, implicit on assignment, and implicit generally.  In some cases casts can be dangerous.

The overall syntax of CREATE CAST is discussed in the PostgreSQL manual.   In general almost all user created casts are likely to use a SQL function to create one to another.  In general explicit casts are best because this forces clarity in code and predictability at run-time.

For example see our previous country table:

or_examples=# select * from country;
 id |     name      | short_name
----+---------------+------------
  1 | France        | FR
  2 | Indonesia     | ID
  3 | Canada        | CA
  4 | United States | US
(4 rows)

Now suppose we create a constructor function:

CREATE FUNCTION country(int) RETURNS country
LANGUAGE SQL AS $$
SELECT * FROM country WHERE id = $1 $$;

We can show how this works:

or_examples=# select country(1);
    country   
---------------
 (1,France,FR)
(1 row)

or_examples=# select * from country(1);
 id |  name  | short_name
----+--------+------------
  1 | France | FR
(1 row)

Now we can:

CREATE CAST (int as country)
WITH FUNCTION country(int);

 And now we can cast an int to country:

or_examples=# select (1::country).name;
  name 
--------
 France
(1 row)

The main use here is that it allows you to pass an int anywhere into a function expecting a country and we can construct this at run-time.  Care of course must be used the same way as with dereferencing foreign keys if performance is to be maintained.  In theory you could also cast country to int, but in practice casting complex types to base types tends to cause problems than it solves.

Inheritance-Based Implicit Casts

Inheritance-based implicit casts are an exception to the dangers of implicit casting.  These are not base type to base type casts, and they essentially act as slices of a table.  If your table structures use inheritance in a sane way they will work well.

Defined Implicit Casts

We can generally define implicit casts between tuples.  This is best done sparingly because any implicit cast multiplies execution pathways.  Defined implicit casts are not inherited.

To create an implicit cast, we could do as follows:

DROP CAST (int as country);

CREATE CAST (int as country)
WITH FUNCTION country(int)
AS IMPLICIT;

The problem of course is that as soon as we do this we have the potential for ambiguity.

Suppose I have a hypothetical state table, and I have functions called capital which can take arguments of country or state and returns the name of the capital city.  If I make a query of "select capital(1);" then am I asking for a country or a city capital?  I have always avoided this situation because it can never end well (so much so that I don't actually know how PostgreSQL resolves the issue).

Keep implicit casts for cases where you really need them and avoid them in other cases.

Problems with Implicit Casts to Primitives

The problems with implicit casts are greatly magnified when casting to (and hence between) primitives.  We will discuss just one such problem here:  numbers to timestamps.

Currently PostgeSQL supports casting dates as timestamps which is probably correct.  Suppose we want to take a number and turn it into a timestamp.  Maybe we are converting UNIX epochs to timestamps regularly and want to do so without a cast.

create cast (double precision as timestamp with time zone) 
with function to_timestamp(double precision) 
as implicit;

And so we try this:

CREATE FUNCTION to_timestamp(int)
returns timestamptz
language sql immutable as
$$ select $1::double precision::timestamptz; $$;

And add an explicit cast:
create cast (int as timestamp with time zone) 
with function to_timestamp(int) 
as implicit;

Works and well enough.  However, now we have a bug factory.

or_examples=# select extract(year from '2011-01-01'::date);
 date_part
-----------
      2011
(1 row)

or_examples=# select extract(year from 2011-01-01);
 date_part
-----------
      1969
(1 row)

Why?  Because the second is an integer math expression and evaluates to the integer of 2009, or 2009 seconds after midnight, January 1, 1969.  This is probably not what we had in mind.  Similarly:

or_examples=# select age('2011-01-01'::date);
      age     
---------------
 1 year 8 mons
(1 row)

or_examples=# select age(2011-01-01);
           age           
--------------------------
 42 years 8 mons 07:26:31
(1 row)

It is important to note here that the computer is doing exactly what we told it to do.  The problem however is that we now have bugs which are non-obvious to someone who is not quite fluent on the system to start with.

Another historical case that used to bite people is implicit casts of tuples to text.  This lead to very counterintuitive behavior in some cases, such as the infamous tuple.name or tuple.text issues.   Basically tuple.name would return name(tuple::text) which would be a textual representation of the tuple truncated to 63 characters.  This was fixed by making the cast of tuple to text explicit in PostgreSQL 8.3.

In general, polymorphism is a very important feature of PostgreSQL but it is something which has many gotchas.  In general adding explicit casts won't get you into too much trouble but implicit casts are often asking for issues.

Thursday, September 13, 2012

Object/Relational Interlude: Messaging in PostgreSQL

PostgreSQL as an object-relational development platform doesn't end with data modelling.  The notification system is particularly powerful and worthy of mention.  This system however has some challenges and gotchas which are also worth mentioning.  The example below was early sample code bundled with LedgerSMB from 1.1 onward.  The code is not really useful in production for reasons specified but it can be a basis for production systems, and it functions as an example both of how to make things work and the problems you can encounter.

I would like to thank Marc Balmer of Micro Systems for some contributions from slides he had sent to me on this.

Basic Framework:  LISTEN/NOTIFY

PostgreSQL provides a messaging system with two commands, listen and notify.  These commands allow any process to broadcast an event to any listener.  In some older versions, listeners are listed in a system table and this information is presumed public.  In newer versions this information is not in the system catalogs and so this reduces the security exposure from the use of this method, but many areas of exposure remain and must be understood.

This approach is divided into broadcast channels.  Any user can broadcast along any channel.  Anyone can listen on any channel. The actual notifications are public.  The syntax is:

LISTEN channel; -- for the listener

and


NOTIFY channel [, payload]; -- for the broadcaster


Channels are database-specific.

A Brief Example

or_examples=# listen test;
LISTEN

In another session:

or_examples=# notify test, 'testing testing testing';
NOTIFY

In the first session, psql will only check for notifications when executing another query so we:

or_examples=# select 1;
 ?column?
----------
        1
(1 row)

Asynchronous notification "test" with payload "testing testing testing" received from server process with PID 29052.

It is of course possible to send actual meaningful data in the payload but this is probably not a good idea since we can't really verify the sender without a lot of extra code.

In an actual application we can take the notification to be a sign that we need to check something in the application.  In general, I can't find any case where the payload is actually a good idea to use.  It's probably better to use tables to create a message queue.  Then the notification can be used to reduce the actual requests to the tables.

Note further that NOTIFY is raised on transaction commit, making it transaction-safe.  NOTIFY allows you to build extremely interactive systems where many applications actively communicate across the database and trans-transaction logic can be encapsulated from a user application's perspective, within the database layer itself.

Listen Gotchas

The Listen architecture does not provide security by itself,  This must be provided on the underlying tables.  In general it is always a mistake to trust the payload of the notification too much.

Additionally you need a way to extract only the records you want.  I will go over a badly designed case below.  One approach is a message table.  Another approach might be a message view.  Either way the listener is a separate program.

Finally it is worth noting that the application actually has to ask the connection libraries if there is a notification.  This can cause some confusion because, for example, psql only checks when a query is run.  With a Perl script, however, we might wake up to check once every minute or second, or whatever we want to do.

Notify Gotchas

Listeners require no security privileges to listen to a notification or act on them.  The listener is a separate program and could be anywhere else.  You should never trust that the NOTIFY payload is not overheard somewhere.

Additionally, it is worth remembering that NOTIFY is *always* raised on commit and cannot be removed without rolling back the transaction.  However, if the actual message is in a message table, it could be safely deleted.

Patterns

Message Queue Tables

When we add tables with messages to the system we will be able to actually use PostgreSQL as a message queue server.  For example, in 9.2 we can to do something like:

CREATE TABLE message_queue_test (
    id bigserial primary key, -- appropriate, not natural data
    sender text not null default session_user,
    payload json not null,
    received bool
);

CREATE INDEX msq_test_pending_idx 
ON message_queue_test (id, sender, payload) 
WHERE received IS NOT TRUE;

Of course the cases where this can be used as a covering index are vanishingly small, but over time the index may be useful in eliminating rows already read.  You can then keep old records around for a while for debugging purposes.  You could even have an abstract table, multiple child/"leaf" tables, and unified object-relational interfaces for these.  Indeed this may be an interesting project for the future.

Concurrent Access and Locking

Note that the standard rules exist with concurrent access and locking.  If you want this to be a message queue that several different programs can read and every message must get sent out only once, then you should make sure you use SELECT ... FOR UPDATE.

An Example

 In LedgerSMB I created a sample script that would show how to send an email out whenever a short part was sold, to remind someone to order.  Unfortunately this was not so well designed, but I guess it was good enough as an example.

The Trigger

CREATE OR REPLACE FUNCTION trigger_parts_short() RETURNS TRIGGER
AS
'
BEGIN
  IF NEW.onhand >= NEW.rop THEN
    NOTIFY parts_short;
  END IF;
  RETURN NEW;
END;
' LANGUAGE PLPGSQL;

In actual use cases, this isn't enough information to process this properly.  Please see the problems below.  The basic problem is that this notification causes the listener, when used, to email out a full parts short report to the individual(s) specified.  Depending on how it is set up, this might be a bunch of reports in a short time.  This could really use a queue table.

The Listener

I wrote this code quickly when I was still learning Perl. It's definitely quick and dirty code.  I am not proud of the clarity or maintainability, but what it does is actually trivial.

The script starts off with the basic setup, which basically sets up the configuration information and the database connections:

require "config.pl";

use DBI;
my $dsn = "dbi:Pg:dbname=$database";
my $dbh = DBI->connect(
    $dsn, $db_user,
    $db_passwd,
    {
        AutoCommit => 1,
        PrintError => 0,
        RaiseError => 1,
    }
);
$dbh->{pg_enable_utf8} = 1;

my $sth;

$dbh->do("LISTEN parts_short");



The next part is the main loop, which just wakes up, checks for notifications, acts on them if applicable, and if not goes back to sleep:

while (1) {    # loop infinitely
    if ( $dbh->func('pg_notifies') ) {
        &on_notify;
    }
    sleep $cycle_delay;
}

And finally what we do if we got a notification:

sub on_notify {
    open( MAIL, '|-', "$sendmail" );
    $sth = $dbh->prepare( "
        SELECT partnumber, description, onhand, rop FROM parts
        WHERE onhand <= rop
  " );
    $sth->execute;
    print MAIL $template_top;
    while ( ( $partnumber, $description, $avail, $rop ) = $sth->fetchrow_array )
    {
        write MAIL;
    }
    print MAIL $template_foot;
    close MAIL;
}

Problems

Aside from the fact that the code isn't the most maintainable code out there, there are a number of structural problems with the whole solution.  First, what this actually does is probably not what you actually want.  It wakes up every so often (a configurable value) and checks for notifications and if it finds them will send out a parts short report to the designated individual.  Note it won't send just the new parts, but actually all the parts short.  If you have a store with thousands of parts in stock and are nearing ROP on a few hundred, and then the holiday traffic starts before your orders come in, this will be rather annoying and likely to be intentionally filtered out by whoever is supposed to receive it.

Additionally, if you have a case where you are short many parts, a malicious user may be able to find a way to force the system to send out a parts short report every time it wakes up, which may not cause a denial of service attack but in fact may be highly annoying and prevent the actual emails from being reasonably useful.

Solutions

The best solution is to go with a queue table, and have the trigger write to it.  Then the reader can read it, and email out a notice as to exactly what has happened.  This will mean more relevant information and better signal to noise ratio.

Next:  Polymorphism in PostgreSQL