Monday, October 15, 2012

Object/Relational Algebra 2: Function Properties

Previously I introduced the idea that functional notation allowed relational algebra to play well with other mathematical disciplines, and therefore solved at least in theory, the limits of relational algebra's expressiveness.  In this second part I will discuss the initial properties of functions as I see them in this approach, and the next, and final, posting in this series will cover the special series join function.  The series join function itself provides a way to address transitive connections between tuples in a relvar.

Definition and Types of Functions


The general approach of this object-relational algebra involves applying algebra to manipulate operations on functions of sets of tuples.  A set of tuples is a relvar, or relational variable.  The set of tuples in the relation is the domain of the function, and as with all functions, it represents exactly one result per domain element, so for every tuple, a single value will result.  That value of course can be a set (including a null set), a tuple, or even a relvar.  However, it is also necessary to note that a function must return the same basic type of result for every tuple.  We can't return a simple value for one, a relation for another, and a set of values for a third.

Each element in a function can be seen as being relational in the sense of within the expressive boundaries of relational algebra, or non-relational in the sense of being outside those boundaries.  Functions themselves can then be divided into three categories:
  1. Relational functions contain only relational elements and can be expressed solely in relational algebra.
  2. Non-relational functions contain only non-relational elements and cannot be further simplified in relational algebra in any circumstance, and
  3. Semi-relational functions contain both relational and non-relational elements, and these can be partly further simplified in relational operations.
All types of functions have some properties in common.  If a function is a function of a relation, then it is necessarily true that it is also a function of every candidate key in that relation, and that for any well-formed data that could be added, it will continue to return a single value of consistent type.

A second level of categorization can be had regarding whether the domain of the relation is fully mathematically self-contained or not.

For example, suppose we have a relvar of employee data, E, and E contains an attribute dob, representing the employee's date of birth.   Therefore, for every tuple in E, there is a dob element.   We can then have two functions:

age(E)  represents the current age of the employee at the time it is evaluated.

age2000(E) represents the age of the employee on New Year's Eve, 2000.

age2000(E) has a fully self-contained domain.  The values depend on the values stored in the relvar E, and nowhere else.  age(E) however does not have a fully self-contained domain.  For any given relational operation, age(E) will behave like a function and we can use it as such, but the results will change over time.  Oracle calls these determinate and indeterminate functions respectively.   PostgreSQL divides these up into additional categories for planning purposes --- in addition to immutable functions whose output never changes for a given input, you have stable and volatile functions, the latter are not really a function per se of the input.

Properties of Relational Functions


I didn't really start getting really excited about this until I started working with relational functions.  Once I started though, there was no going back.  Relational functions can be expressed in relational algebra and therefore can roughly map to subqueries in SQL.  Of course this is not exact, but there may be cases where if is helpful to look at from this perspective.  This is particularly important when looking at optimizing simple functions written in SQL when called in where clauses.

Relational functions can be expanded algebraically in other relational operations.  For example:

Suppose we have a relvar L which represents landmarks, and has an attribute country_id which is a foreign key to country, and suppose further we have a relvar C which represents countries and has an attribute called name which represents the country's name.  We can then define the following functions:

let country(L) = σid=L.country_id(C)

let frenchl(L) = σ(country(L)).name='France'(L)

country(L) can be in-lined into frenchl(L), and this can  be transformed into a subselect, and eventually (in a set- rather than bag- based approach at least) a left join.

The set of single-value returning relational functions is a proper superset to the number of functional dependencies in the database reachable through appropriate join dependencies.  Without grouping and aggregation, these sets are identical.  However, with grouping and aggregation, relational functions express a larger array of possible values.

 

Properties of Non-Relational Functions


Non-relational functions cannot be expressed in relational algebra and therefore must be solved outside the stream of relational operations themselves.  Non-relational functions must be treated as to their value types.  A set-returning function can be treated as a set, a single-value returning function can be treated like a relvar's attibute, and a relvar-returning function can be treated like a relvar itself.

This field moves relational algebra from the area of solving relational queries into the area of symbolic simplification.

Next:  The special "Join Series" function.

Thursday, October 11, 2012

Object-Relational Algebra 1: Definitions and Assumptions

 Introduction

This post begins a series to try to establish a mathematical basis for object-relational database systems.  The idea here is to extend relational math in order to cover object-relational database operations rather than to model current databases mathematically.  I therefore want to start with my assumptions and definitions.

For simplicity's sake I will follow with Codd's assumptions, namely that each row is unique, that the ordering is not significant, and that within each row, the columns are significant.  This establishes a definition of a relation as a set of tuples.  Everything here is then an extension of Codd's math, rather than a replacement for it.  In the programming world, boolean operations may mean something slightly different, or bags may be used instead of sets, but the goal here is not to model databases but to model data, and Codd's model is more elegant in this regard.  It is the real world that gets messy.

To Codd's assumptions I will add my own, namely that the value of any attribute is opaque to relational operations but may act as the domain for a function in this way.  This becomes important when I get to the special Σ function and the functional dependencies on its output.  The special function in my system Σ(R) (join series) with a θ condition gives you a way to represent self-joins of arbitrary depth for relational math, for example, and combined with functional dependencies of its output gives you a way to express queries of transitive binary closure, but that is not its most interesting use at least to me.

A second important point here is that I will diverge from some previous efforts in that I will not throw in "magic operators" designed to solve corner cases.  The goal here is to build a more expressive system that can solve more problems.  Therefore an operator that determines transitive binary closure is ruled out, and if we can't solve a problem without a magic operator, then this will be left for others.  Additionally this effort is intended to be minimalist and allow many different behaviors to be encapsulated through different approaches.  For this reason I will build directly on the work of Codd and more or less ignore other approaches that various researchers have taken.

Finally this is somewhat to be distinguished from the way SQL does things.  SQL itself is a messy (from a math standpoint) application of both relational algebra and predicate calculus, using bags instead of sets.

One important problem is choosing a name for this extension.  The term object in computer science tends to be a bit abstract and not really a good description of what is going on here.  A better description might be functional-relational algebra.  I call it object-relational algebra only because it fits in with object-relational databases.

In the posts that follow, relation and function will be given their ordinary mathematical meanings.  However the relationship between the two are not well defined to my knowledge.

What is an Object Anyway?

In computer science terms, an object is a data structure (here represented as a tuple) which has imperative behavior attached to it.  Object-oriented programming is thus essentially an extension to structural programming where behavior follows structure.    Classes define both structure and behavior, and objects instantiate and apply these rules.  Applying a model such as this to an algebra of data is somewhat problematic, because algebra is about derivation of values, not about behavior.

What we are doing here is similar, and yet different.  I haven't fleshed out a theory of inheritance vs composition (PostgreSQL supports multiple inheritance which can be used to implement either inheritance or composition, strictly speaking), and collection tables can be used to implement composition in various databases including PostgreSQL.  A theory of composition and inheritance is left to others.  From the standpoint of my model these are equivalent.

Instead of behavior, I use the same basic approach of tying structure to function in order order to expand the reach of the relational model.  In computer science terms, a class is then roughly similar to a relation, and an object roughly similar to a tuple.  However, instead of behavior, the typical focus is on derived information.  Thus functions become important in a way which they have not typically been used in relational math.

Functions of a Relation

So here, f(R) is a function of relvar R if, and only if, for every possible tuple in R, f(R) returns one distinct result (that result could however be a tuple or even a set).    Functional dependencies stored in the database can then be modeled as functions, but so can functional dependencies which are not stored.

The following then is not a function: f(R) =  x^0.5 because it resolves to two distinct values for every value of attribute x in relation R (one is positive and the other is negative).  However the following is a function:  f(R) = (abs(x^0.5), -1 * abs(x^0.5)) because it resolves to a tuple with both possible answers from the previous, at least if x is always positive or imaginary numbers are allowed.  It could return a set instead and that would be acceptable, however in this case the structure of the result is also set by the function, and the above function is distinct from g(R) = { abs(x^0.5), -1 * abs(x^0.5) } because the structure of the result is different.

In standard relational algebra, a tuple is finitely expressive, namely one can only express a finite set of values off a single tuple.  However, for any given tuple, an infinite number of functions are possible, and thus when combined with functions, a tuple becomes infinitely expressive.  Not only can all functional dependencies of the tuple's superkey be expressed as a function of the tuple, but any transformation of the tuple's values, or the values of functional dependencies, can be expressed as such as well.

A functional approach also allows us to dispense with the rename operation in relational algebra, since renamed relations can be seen as relation-returning functions.

Kinds of Functions

In my model, functions can be divided up into the following categories:
  1. Relational Functions can be expressed solely in relational algebra.
  2. Non-relational functions possess no non-trivial relational algebra reduction.  x^2 is non-relational.
  3. Complex Functions, relationally speaking have non-trivial relational reductions, but non-relational components too.
Assuming sufficient join dependencies in a database, every functional dependency can be expressed through relational functions.  Moreover trivial relational dependencies can always be expressed by relational functions, and indeed by mere projection operations.   We can then define a trivial relational function as one which reduces solely to project operations off information stored in the tuple.

Result

The resulting system essentially creates something like an object-oriented database but one which is fundamentally relational, and in which objects behave differently than they do with mere object-oriented persistence.   While each tuple is infinitely expressive, this is possible only because of a distinction between primary (Codd's Algebra) stored data and secondary (calculated) answers.  This extension however allows any kind of mathematics (or other logic) to be tied into relational algebra.   This allows relational algebra to be used along with many other kinds of disciplines to build extremely sophisticated data models.

Forthcoming:
1: Exploring properties of relational, non-relational and complex functions
2: The special function  Σ(R) representing the output of a simple series join.

NoSQL-like Approaches to PostgreSQL, a reply to Martin Fowler

The major point of many of Martin Fowler's blog posts, repeated in his book, is that NoSQL represents a movement away from integrating on the database and towards encapsulating databases in applications and using services to integrate data.  However, that this isn't the way higher end RDBMS's are always used speaks more to the market than the RDBMS technology itself.  For the last five years I have been building relational databases along a similar approach, using web services as an inspiration for interfaces.  This post discusses a few different approaches to doing this on PostgreSQL.  Note that other RDBMS's may be different in many of these regards.

The Promise of the Relational Database Model


Relational databases are designed to store information in application neutral formats and then present it in various ways to various applications.  The goal is, as Fowler puts it, the "integration database."  In order to facilitate this goal in a maintainable way, most RDBMS's come with solid sets of features designed to encapsulate physical storage behind application-specific presentation of the data stored.  These include views, functions, and stored procedures.  These features allow a database engineer to build applications which encapsulate the physical storage, so that the database management system essentially serves out an information model through a myriad of client-specific interfaces.  This achieves what Fowler is advocating, namely encapsulating the database in the application, but it allows for declarative access to that API.   At least in theory, there is no difference between what Fowler advocates for NoSQL and what many DBA's do with an RDBMS.  This also suggests that Fowler is on to something important, but that ditching the RDBMS may in many cases not be the answer.

Why the RDBMS Rarely Delivers on That Promise


For all the theory though, most RDBMS's are rarely used in this way.  I think there are several factors that typically weigh against such encapsulation:

  • Investment (expertise and license costs) in a specific RDBMS by IT departments means that application developers want to write portable code, and thus write lowest common denominator code
  • Application developers often want to ensure that all access goes through their databases and thus do not want to centralize application logic in the database, and
  • Many of the standards ensure that tools are not really optimal for this task.  The best tools are those which go beyond the standards, thus with greater implications regarding portability.
The first market factor has to do with the fact that relational database management systems are expensive systems in several senses.  In addition to the (often hefty) licensing costs, you have the necessity to hire people with experience and/or train them.   Companies usually address this cost by consolidating their expertise and servers on a single RDBMS.  Therefore you see Oracle shops, SQL Server shops, DB2 shops, etc.  This leaves lock-in as a major cost for businesses, and it reduces the market for database-specific applications.  Application developers therefore, quite rationally, more often choose to write SQL code that runs on any different RDBMS.  This naturally requires eschewing advanced features, like Oracle Objects or pl/pgsql functions for vanilla SQL.

 The problem of course is that it is hard to encapsulate data when you are limited to the most basic feature set.  While views may work, making them updatable may be different, and have different consequences on every RDBMS.  Stored procedures or functions are far worse in this regard.  Consequently the need to write portable code requires essentially dispensing  with the very tools used to encapsulate data.  For this reason I don't think you can both write a database for use by multiple applications (i.e where the internal data structures are encapsulated) and also write a database that runs on multiple RDBMS's.  You have to choose.  Developers cannot be blamed for choosing the option which gives their applications the most market and lock-in.

The competing desires for application lock-in is another factor.  RDBMS vendors typically want to restrict access to a certain number of client access licenses or seat licenses, and if connections can be pooled in middleware this can help circumvent some of this limitation (it may cause contractual issues depending on the nature of the EULA, but the technical controls are easily worked around in this manner and at some point you run into problems with definitions, particularly when systems are loosely coupled).  Application developers want to sell their own licenses and this can only be done if the connections are checked in the application layer.  Therefore it is against the interests of many application developers to ship with encapsulated database schemas.  Instead, the RDBMS is used largely like a private data store with some additional reporting capabilities. 

Some RDBMS vendors actually optimize their systems for the above needs.  One feature SQLAnywhere offers is that developers like Intuit can lock a database to a specific application, banning all third party access, and a large part of MySQL's current popularity may be tied to the fact that it is well-optimized for moving unencapsulated databases that run on other DB's to it.  In particular the sql_mode setting on one hand facilitates porting code to MySQL and on the other makes it relatively unsafe for this sort of encapsulation.

However, market factors aren't the only ones pushing developers away from building databases in this manner.  The perpetual debate over stored procedurs illustrates a mismatch between at least one commonly used tool and the typical use case of that tool.

 Stored procedures, as the name implies are essentially imperative constructs which take specific arguments and return records.  The idea is that they provide a basic imperative interface language to a database.  Instead of SELECT .... you issue CALL some_proc(arg1, ....);

 There are several problems however with stored procedures as they are typically used.  The first is that they still have significant impedence mismatch with object-oriented programming.  The data structures they return tend to be fairly rigid so adding a new column tends to require multiple changes in code, often at least one in each layer of the program.  You can't just define your object once and then re-use over and over in other layers of your code.

A second significant problem is that stored procedures are at their best when they are a few well-connected and modularized queries, and at their worst when they are many different queries tied together in complex ways.  This leads to limited utility in actually encapsulating the data, and in all cases the abstraction layer is not really a good match for what is typically done with it.  For these reasons stored procedures as typically used make the most sense when working outside the object-oriented approach.

 Stored procedures have been replaced by object-relation mapping tools (ORMs) which attempt to provide a mapping between a relational interface and an object-oriented development environment.  ORMs automate basic database operations for things like insert, select, update, and delete operations, but they don't really provide an abstraction regarding the actual data mapping between the behavioral app layer and the information model layer.  This can currently only be done in the information model itself, so ORM's are best paired with updatable views, but this comes at the cost of portable SQL code.

Aside from these approaches, or moving to NoSQL, there are a variety of methods to encapsulate the data store inside a relational application.  These approaches require understanding both the desire for encapsulatin and interfaces, and the desire to integrate with applications as a service rather than as a relatively simple persistence layer manipulated mathematically.

Service Oriented Database Architecture

For most of the last five years, I have been building LedgerSMB using an approach I call "Service Oriented Database Architecture," or SODA, which is inspired in part by RESTful web services and SOAP.  From SOAP I took the emphasis on discoverability, and from REST, I took, to the extent possible, the goal of re-using everything in the database that can be re-used in order to define an API.  This approach thus uses the database semantics the way REST re-uses HTTP semantics, and while there are some differences forced by the way PostgreSQL does things (every function called by a SELECT statement), this is not the end of the world.  The goal, of course is to build database interfaces suitable for loosely coupled application/database combinations.

 The SODA approach is based on a number of principles, namely that:

  • Access to the database is through functions, not relations,
  • Functions, to the extent possible, always return a useful result, usually in a data structure corresponding to an object,
  • Function names (within the domain of this architecture) are unique, and
  • Function argument names correspond to the properties expected.
  • The database is responsible for its own security.
 If these are followed then the functions can be mapped, discovered, and run at run-time.  Here is a PHP class that implements such a run-time mapping:

 class DBObject
{
    protected $schema = 'public';
   
   /* function __call($procname, $order = null)
    * Maps in object properties into an arg array and calls call_procedure
    *
    * db procedures are checked for argument names and these are stripped of 
    * the "in_" prefix.  After this is complete, a property is matched and
    * mapped in.
    */
    public function __call($procname, $order = null){
        # variable initializations
        $procargs = array();

        # proc data lookup
        $procquery = "
            SELECT proname, pronargs, proargnames, proargtypes FROM pg_proc 
             WHERE proname = $1
               AND pronamespace = 
                   coalesce((SELECT oid FROM pg_namespace WHERE nspname = $2), 
                        pronamespace)";
         $db = DB::getObject();
         $sth = pg_query_params($db->dbhandle, $procquery, 
                               array($procname, $this->schema));
         $procdata = pg_fetch_assoc($sth);

         if (0 == pg_num_rows($sth)){
            throw new \exception('Function not found');
         }
         # building argument list
         preg_match('/^{(.*)}$/', $procdata['proargnames'], $matches);
         $procargnames = $phpArr = str_getcsv($matches[1]);
         foreach ($procargnames as $argname){
              $argname = preg_replace('/^in_/', '', $argname);
              array_push($procargs, $this->$argname);
         }

         # calling call_procedure
         return $this->call_procedure($procname, $procargs, $order);
    }
    /* function call_procedure($procname, $args = array(), $order = null)
     *
     * generates a query in the form of:
     * select * from $procname($1, $2, etc) ORDER BY colname
     * and runs it.  It returns an array of associative arrays representing the
     * output.
     */
    public function call_procedure($procname, $args = array(), $order = null){
         $results = array();
         # query generation
         $query = "select * from "
                       . pg_escape_identifier($this->schema) . "." 
                       . pg_escape_identifier($procname) . "(";
         $count = 1;
         $first = 1;
         foreach ($args as $arg){
             if (!$first){
                 $query .= ", ";
             }
             $query .= '$' . $count;
             $first = 0;
             ++ $count;
         }
         $query .= ')';
         if ($order){
            $query .= " ORDER BY " . pg_escape_identifier($order);
         }


         # execution and returning results
         $db = DB::getObject();
         $sth = pg_query_params($db->dbhandle, $query, $args);
         if (!$sth){
             return null;
         }
         for ($i = 0; $i < pg_num_rows($sth); $i++){
              print "retrieving row $i \n";
              array_push($results, pg_fetch_assoc($sth, $i));
         }
         return $results;
    }
    /* function merge(array $data)
     * merges data into the current object from an associative array
     * 
     * null or undef values are not set
     */
    public function merge($data){
        foreach ($this as $prop => $value){
             if (array_key_exists($prop, $data) and null != $data[$prop]){
                 $this->$prop = $data[$prop];
             }
        }
    }
    /* function is_allowed_role($role)
     * returns true if the user is allowed the role for the specific db 
     * i.e. $role should not include the prefix.  Otherwise it returns false
     */
    public function is_allowed_role($role){
        $db = DB::getObject();
        return $db->is_allowed_role($role);
    }
}  


The above code seems long but what it allows essentially is inheriting objects to simply declare that methods are mapped to stored procedures, and these mappings are automatically adjusted at the time that stored procedure is actually called.  Additionally this centralizes essentially all db access in a single file where it can be audited for SQL injection issues and the like, and you can go on programming as if you are hitting an object-oriented database.  Of course there are times when you need to  make modifications on many layers, such as when a new attribute needs to be added and stored, and it isn't in the table yet, but generally these are relatively rare.

In PHP, I can have a class which checks the version and selects the appropriate stored procedure easily even if they expect different object properties as arguments:




public function save(){
     $procname = 'company__save';
     if ('1.3' == \LedgerSMB\Config\DBVERSION){
         $procname = 'company_save';
     }
     $data = array_pop($this->$procname());
     $this->merge($data);
}



 What might a stored procedure look like?  Here is one:


CREATE OR REPLACE FUNCTION asset_dep_straight_line_yr_d
(in_asset_ids int[],  in_report_date date, in_report_id int)
RETURNS bool AS
$$
     INSERT INTO asset_report_line (asset_id, report_id, amount, department_id,
                                   warehouse_id)
     SELECT ai.id, $3,
            asset_dep__straight_line_base(
                  ai.usable_life, -- years
                  ai.usable_life -
                  get_fractional_year(coalesce(max(report_date),
                                         start_depreciation,
                                         purchase_date),
                                       coalesce(start_depreciation,
                                         purchase_date)),
                  get_fractional_year(coalesce(max(report_date),
                                         start_depreciation,
                                         purchase_date),
                                $2),
                  purchase_value - salvage_value,
                  coalesce(sum(l.amount), 0)),
            ai.department_id, ai.location_id
       FROM asset_item ai
  LEFT JOIN asset_report_line l ON (l.asset_id = ai.id)
  LEFT JOIN asset_report r ON (l.report_id = r.id)
      WHERE ai.id = ANY($1)
   GROUP BY ai.id, ai.start_depreciation, ai.purchase_date, ai.purchase_value,
            ai.salvage_value, ai.department_id, ai.location_id, ai.usable_life;

    UPDATE asset_report SET report_class = 1 WHERE id = $3;

    select true;
$$ language sql; 

Unfortunately the above has to return true because the nature of the operation does not really provide another effective approach though if we find one, it will be adjusted in the following major version upgrade.

This approach is generally nice because  it is light-weight and conforms relatively well to more rapidly changing environments.  However, the lack of structure imposed may be a problem in some environments also.  Where more engineering is required, the other approaches may work better.  This works relatively well, however, if you build your API to assume a relatively loose coupling between your database and the application hitting this sort of API.

Object-Oriented Database-level interface


Where tighter coupling is required, an object-oriented interface may be better.  In some ways this is worth avoiding because it leads to very ugly SQL queries, for example:


SELECT (save).*
FROM save(row(null, '12345', 'My Company, LTD', 232, '33-55432334')::company);


The overall issue here is that you have the possibility of multiple levels of discoverability involved.  It works very well for code generators, but not so wellf or the human masters.  Note the above could be rewritten, assuming no additional arguments as:

 SELECT (save).*
   FROM (row(null, '12345', 'My Company, LTD', 232, '33-55432334')::company).save;


the advantage to this approach is that your objects form classes whose structure is discoverable, and overloading becomes possible.    Code generators thus can work well, as the database contains all information needed to create all boilerplate code.  The database code itself is simplified as well.  On the other hand, troubleshooting can be a bit of a pain.   It also has the property of essentially requiring the use of code generators in order to create libraries for interoperability.  This closely ties the generated libraries to the interface created.

In-database ORM and JSON (In Javascript!)


 One fascinating approach I came across recently, but have very little experience with, is xTuple's in-database ORM which is largely written in pl/v8js stored procedures. Yes, you got that right, the stored procedures are written in Javascript.  I would invite people to check out the code and see what they think.  This is a fascinating approach and not one I have played around with yet but it definitely shows how far the encapsulation can be made to happen within PostgreSQL.

Conclusions


Encapsulating an application inside the database is not something which one must go to NoSQL to do.  RDBMS's which are strongly programmable are capable of doing this now, although perhaps few if any rival the flexibility in this area of PostgreSQL.  The RDBMS can then be an 'information model server' which serves out information models as requested, each of which encapsulates further data within it.  The data model can then be consumed and expanded in the Model of an MVC framework, but the impedance mismatch issues can largely be eliminated by understanding and utilizing separation of concerns to one's advantage.

 Of course none of this is to disparage NoSQL.  These products have been successfully used quite often as adjuncts to traditional RDBMS's, either preprocessing or post-processing data for later use or re-use.  Some users of polyglot storage models have found rapid increases in development pace when these are used together, with data often being piped from something like Mongo or Neo4j into PostgreSQL after essentially using these engines to transform the data, or using it to transform the data on output.   This is particularly true when processing large amounts of data which is to be processed in relatively non-relational ways, such as with graph databases, array-native databases (in the sciences) etc.  The combination between these approaches becomes very rich.

 But it does suggest that the days of utilizing the RDBMS  as a dumb data store for a given application are numbered in many cases.  I don't see anything challenging the primacy of the RDBMS environment, but at the same time, that doesn't mean no role for the other ones as well.  If you want a dumb persistence store, you can go with a NoSQL solution and be happier.  However, the overall approach of avoiding the mismatch by encapsulating data storage inside of an application is equally applicable to the RDBMS environment as any other.

Monday, October 8, 2012

Three Approaches to Object-Relational Databases: PostgreSQL, Oracle, and Informix

PostgreSQL vs Informix

Probably the closest database object-relationally to Postgres is Informix.  Informix in fact got its object-relational approach with the purchase of Illustra, a Postgres fork.  Illustra however split from Postgres before the latter adopted SQL, and so the SQL implementations are entirely independent.

Informix has perhaps had the most influence of any database software on how the industry sees object-relational databases and so because of the share heritage and Informix's place in the industry, it is the first comparison to make.

Table Methods

I was not able to find information on class.method notation in the Informix documentation.  As far as I can tell, Informix requires the methods to be called using function(object) syntax.  This is along the lines of Stonebraker's image processing example.  In this way the connection between structure and function feels somewhat more bolted on than it does in PostgreSQL.  However it would be a mistake to see this as the entirity of Informix's object-relational capabilities.

Inheritance

Informix supports single inheritance for both types and tables using the UNDER supertype syntax.  UNDER, similar to INHERITS in PostgreSQL establish an "is-a" relationship between the supertype and the subtype.  Unlike in PostgreSQL, indexes and referential integrity is inherited in Informix, meaning that foreign keys work properly in both directions.  Single inheritance is thus quite a bit more mature on Informix than on PostgreSQL, but the lack of multiple inheritance prevents composition of tables by in-lining type attributes (which is possible on PostgreSQL but not on Informix).

This shows that Informix has a different approach to table inheritance, namely that there is a simple use case which it supports very well and is quite well polished, but more complex use cases are beyond it.  In PostgreSQL, on the other hand, declarative referential integrity doesn't work and thus referential integrity requires writing your own constraint triggers.

Return Results

A select * from parent_table in informix returns all attributes of all rows from the parent table and all descendant tables.  This can lead to a situation where the result set is "jagged" (i.e. where the rows have different numbers of elements), where child tables add additional columns.  In this case, it is necessary to check the row definition when receiving each row.

One way to look at it is that both PostgreSQL and Informix return a set of objects, bot PostgreSQL coerces them into the types that are asked for, while informix returns them as they are.  Thus if you select * from person, and employee inherits person, then you automatically get all the employee information as well.

This feature is as far as I know, unique to Informix.  I know of no other ORDBMS that allows a result set to be jagged in this way, but it shows one important problem that happens when one starts merging object-oriented and relational approaches.  What does select * mean in this case?  It isn't self-evident, and therefore Informix and PostgreSQL take different approaches.

Alternative Approaches

In the absence of a system of composing tables by in-lining types, the way multiple inheritance works in PostgreSQL, table composition requires using row types as columns in Informix.  The use of complex types in this way in Informix is much more mature than it is in PostgreSQL (which only even began to support this very recently recently).

Consequently composition is done primarily through member types, but this evokes an important tradeoff between ease of querying in a relational sense and rich modelling.

I personally find the SQL required to make elegant use of columns as types in this way somewhat ugly but I recognize this is a personal practice.  Still, consider the following:

SELECT print(e.name), mailing_label(e.address) from employees e;

If address and name are both complex types then whatever we are going to do is going to be a bit ugly.  There isn't much we can do about that.

PostgreSQL and Informix vs Oracle

Oracle in some ways shows some influence from Informix but it takes the approach in a somewhat different direction.  Oracle makes no bones about being a relational database first and foremost and has adopted an approach which avoids, rather than answers, questions of proper behavior.  Oracle Objects in some ways resemble Informix's approach and in some ways PostgreSQL's, but they are in general more different than the same.

Tables and Types 

Oracle objects tends approaches the question of object to relation equivalence by allowing types to be inherited, while tables cannot be.  Tables can copy type structures for their schemas however.  Thus one generally has to create an object model first and then create tables to store those objects.  This approach sets up very different patterns and antipatterns than one would get in Informix and PostgreSQL where tables themselves can inherit.  On one hand this separates (forcibly) data holding tables and their abstract parents.  On the other, this makes it harder to work with, except where complex types are being used in columns of a table.

It is worth noting that types support methods in Oracle and this simplifies things greatly.  However, I am left wondering why one would use Oracle objects instead of virtual columns and just treat Oracle as a form of relational database management system with little use of object extensions.

Object Composition Approaches

In Oracle the only approach that works is to use columns to store complex types.  Forunately those types can have methods so the SQL is not as ugly as it would be on Informix.  You can:

select e.name.print(), e.address.mailing_label() from employees e;

This strikes me as a bit more readable.

It seems to me that Oracle Objects have two main uses.  The first is in the logical model of the database although this role can be taken over by ORMs to some extent.  The second and more attractive approach is to use Oracle Objects not for their advertised use of modelling of things like customers or invoices but rather to create intelligent datatypes for columns.

For example, if I want to store images and dynamically determine what type of image we are looking for, I could still do something like:

SELECT id FROM slides s
 WHERE s.picture.is_of('sunset');

This is far cleaner SQL-wise than the equivalent syntax in Informix or PostgreSQL:

SELECT id FROM slides s
 WHERE is_of(s.picture, 'sunset'); 

This allows the same sorts of problems to be addressed as Stonebraker talks about, and it allows structure and logic to be clearly tied together, at the expense of substitutability as is found in table inheritance in both Informix and PostgreSQL.

The complexity costs are likely to be worth it in Oracle in a very different set of cases than would be the case in PostgreSQL or Informix.  These cases intersect at queries which must return data based on very complex criteria which cannot be reduced to relational algebra and predicate calculus.

Final Thoughts

The above discusses three different approaches to the question of how to encapsulate complex data into a database which may need to query based on arbitrary criteria not readily reducible to relational algebra and predicate calculus.

PostgreSQL and Informix both tightly integrate object handling far more deeply than Oracle.  Informix seems more polished in the areas it supports, but PostgreSQL has a broader support for the overall idea.

Oracle's approach is essentially to move object handling capabilities to the column level for the most part.  Yes, you an create tables of objects, but you cannot inherit them directly and you cannot compose your object model using multiple parents without moving everything to the column level.  This makes object behavior mostly useful in the creation of "smart column types."

Each of these three ORDBMS's takes its own approach.  All three allow the development of very intelligent database models.  All three pose different complexity issues.

Wednesday, October 3, 2012

Faq: Why is LedgerSMB PostgreSQL-only?

We get asked a lot, why is LedgerSMB Postgresql-only?  Why not run on MySQL?  After all, since 5.0, MySQL has offered relatively decent type constraints, and offers a database that works sufficient to build ACID-compliant applications, and so forth.  After all, this line of reasoning goes, we'd get more market share by being able to run on a larger number of hosting providers.

This post discusses why we made the decision to go with Postgres and decline to support any other databases.  I still believe it has been the correct decision for us but it might not be the right decision for other projects.  I hope though that writing this up will help other projects weigh their options and choose appropriately.

The short version is that we decided to favor an application with many avenues for integration over the idea of an application that would run on any platform.  We chose the path of moving towards giving our users as rich an experience as possible, and as rich opportunities as possible in integrating this software with their business over the idea of having as many customers as possible.  If the software works wonders, people will make room for it.

Our Goal

In our case, our goal has generally been to be a major part of the business infrastructure of our users.  We want to create a flexible framework which supports extension and repurposing, allowing businesses to find new uses for what we are doing.

Additionally early on our decision was shaped by the desire to hook LedgerSMB into reporting tools written in other languages, and we generalized this to tools generally.  This has been a relatively safe bet, and with additional compatibility classes auto-detection of DB-level API's is possible.

Finally we started as a fork of a PostgreSQL-only application and so the choice was whether to deepen our commitment to PostgreSQL only or whether to move towards portability.

Considerations

Our overall approach  was based on the idea that financial and ERP systems are best when they are open to integration, and that ERP frameworks need to be open to extension as well.  After all, ERP software typically runs a wide variety of functions in a business, and these are often connected to yet more workflows where custom apps may be developed.  If those apps can be developed against LedgerSMB this delivers greater value to the customer.

In other words, we see ERP as much about development platform and tools as it is about application functionality and flexibility in adding features is key.

Our Decision

Our decision was to solidify our usage of PostgreSQL, use it to add additional security and tune performance.  We quickly made the decision to move much of the logic into stored procedures in order to support tools written in other languages as well.

One important benefit which helped push us this way was the fact that a stored procedure interface could be centrally controlled and checked against SQL injection while ad hoc query interfaces required careful auditing.

Our Approach and Its Challenges

Our current approach is to name arguments such that they can be mapped in. This works reasonably well with a single application, it is simple and requires relatively straight-forward queries.  However it is currently limited in two ways:  namespaces, and the fact that the interface is very heavily procedural and this makes maintenance a bit of a pain.  It also means that class attributes must be manually maintained across integration classes, which is sub-optimal.

We are currently exploring the possibility of moving to a more object-oriented interface in the db which would solve the above issues.  The interface would be dynamically discovered by querying system catalogs or views we would create, and results cached by the application.  This would allow dynamic API discovery and classes to export the db-level API to other systems in a purely db-centric way.  This would hence be expected to cut our maintenance effort significantly.

A major challenge here however has to do with quality of stored procedures.  The current quality of stored procedures varies quite a bit.   In general, one of the problems one faces is that application developers don't always make good stored procedure developers because the sorts of thinking are rather different.  However, over time we expect to be able to fix this issue.

Another key part of our strategy is that of layered API's.  Not only are we building discoverable db-level API's but also RESTful web ser

Final Words

LedgerSMB chose a db-centric model because we want to support a wide range of applications written in a wide range of languages, running in a wide range of environments.  This is very specific to our approach, and it is not without challenges.   However it has already had some benefits in our PHP compatibility classes and these are expected to multiply in the future.  In the end we choose a db for many apps over an app for many db's.

Monday, October 1, 2012

Towards a simple explanation of Object-Relational Database Management Systems

A relational database management system is a piece of software which takes organized information and processes it in order to answer questions presented by other software.

In a typical relational database management system, data is arranged into tables, each of which is similar to a worksheet in an Excel workbook.  Information in these tables is matched, and processed, and returned to the user.  A skilled database architect can arrange the data so that the structure (called the schema) is easily maintained and extended, so new types of information can be stored as the need arises.

Object-relational database management systems take this system and expand it, allowing for more complex data to be stored in each column and for a wider range of calculations to be attached to the table.   The combination of more complex data and complex calculations allows one to build advanced databases that can do more than one could before.

These systems are called object-relational because they are an attempt expand what relational database management systems can do by borrowing some ideas from object-oriented programming.  The two disciplines are however very far apart because object-oriented programming aims to model behavior encapsulated behind interfaces while object-relational database design seeks to extend relational math to model derived information as well as what is minimally known.