Thursday, August 30, 2012

O/R Modelling 3.2: Set/Subset Modelling with Inheritance

Despite the fact that set/subset modelling is usually not a win for inheritance, sometimes in the cases of more complex logic, it can be.  David Fetter has come up with an example where he tries to solve the problem of single parent/multiple child inheritance on a purely relational model.

In general when you get into constraints this complex, the question is no longer how to avoid complexity but how to manage it.  There are several ways in which refactoring the tables may be helpful, and table inheritance allows us to do just that (no we don't have to lose foreign key management capabilties to do so).  This allows us to look at set/subset management differently.

Of course how we go about managing this complexity may still be a matter of taste and of the exact circumstances involved (the same solution may not be optimal everywhere).  One person may prefer lots of NULLS and complex constraints.  One person may prefer few NULLs, no inheritance, and lots of triggers.  Someone else may prefer a richer toolkit with the pitfalls that provides.  Managing complexity is more an art than a science.  In general though, we don't blame the paintbrush when the artist doesn't know how to use it.

In my experience, using NULLs to indicate type furthermore creates a situation where it is quite possible to end up with misbehavior that takes some time to troubleshoot and so I don't typically like this approach but again, I recognize that the complexity issues remain regardless of the solution.

In general the approach here takes the following steps:

  1. Decompose the table in a relational model
  2. Look for common interfaces
  3. Inherit those interfaces
Fetter's Table Structure:

I am skipping over the check constraints and unique constraints which are relatively complex here.  You should read the full post yourself to see what complexity we are managing.  However for refactoring purposes, I can't really avoid quoting his table field structure:

CREATE TABLE payment (
    payment_id BIGSERIAL PRIMARY KEY,
    payer_id INTEGER NOT NULL REFERENCES payer,
    payment_date DATE NOT NULL,
    amount NUMERIC NOT NULL CHECK(amount > 0),
    currency_id INTEGER NOT NULL REFERENCES currency,
    /* Check */
    check_number TEXT,
    check_date DATE,
    /* Credit Card */\
    cc_id INTEGER REFERENCES locked_down_tight_cc_table,
    /* You wouldn't store an actual CC number here, would you?!? */
    expiry_date DATE,
        CHECK(expiry_date = DATE_TRUNC('month', expiry_date)),
    /* Manila Envelopes are pretty anonymous 
    is_manila BOOLEAN NOT NULL DEFAULT FALSE,
    -- check constraints then follow which are quite complex
    -- and partial unique constraints are added after the fact
); 
/* comments by David Fetter */
-- Comments added by Chris Travers 

Step 1:  Relational Refactoring

We can then factor this out into the following tables:
  • payment_class new, added to categorize payments.
    • Includes an extended storage column to indicate that storage is required for a type.
  • payment (main details go here), adding payment_class_id field
  • payment_check stores check-specific fields
    • payment_class_id set to 1
  • payment_cards stores card-specific fields
    • payment_class_id set to 2

We would then have to create a custom constraint trigger for managing relational integrity between payment and payment_check, payment_cards.  Note that standard foreign keys work one direction so this isn't too bad.

Step 2:  Common Interfaces:

Every payment extension table requires a payment_class_id and a payment_id field which together act as a foreign key in the parent table.  This provides an interface which can go both ways and simplify the interfaces.  This should then be our base table.

Step 3:  Completed redesign

Note that some foreign keys are omitted in order to allow the table to be created in this demo environment.

CREATE TABLE payment_ref (
    payment_id bigint not null,
    payment_class_id int
);

CREATE TABLE payment_class (
    id int not null unique, -- manually assigned, few
    label text primary key,
    extended_storage bool not null    
);

INSERT INTO payment_class (id, label, extended_storage)
VALUES (1, 'check', true),
       (2, 'cards', true),
       (3, 'cash', false);

CREATE TABLE payment (
    payment_id bigserial PRIMARY KEY,
    payment_class_id int NOT NULL references payment_class(id),
    -- above two columns will be primary key
    payer_id int not null, 
    -- foreign key on payer_id omitted due to demo here
    amount numeric not null check (amount > 0),
    currency_id int not null,
    -- foreign key on currency_id omitted due to demo here
    UNIQUE (payment_id, payment_class_id) -- secondary key
);

CREATE TABLE payment_check (
    check_date date not null,
    check_number text not null,
    CHECK (payment_class_id = 1),
    FOREIGN KEY (payment_class_id, payment_id) 
        REFERENCES payment (payment_class_id, payment_id)
        DEFERRABLE INITIALLY IMMEDIATE
) INHERITS (payment_ref);

CREATE TABLE payment_cards (
    cc_info int not null, 
    -- foreign key omitted so table can be created in demo environment
    CHECK (payment_class_id = 2),
    FOREIGN KEY (payment_class_id, payment_id) 
        REFERENCES payment (payment_class_id, payment_id)
        DEFERRABLE INITIALLY IMMEDIATE
) INHERITS (payment_ref);

All that is left is to create the constraint triggers and we are done.  Note inheritance here buys us a consistent interface for joins, plus a consistent interface for the constraint triggers.

In essence the way we are using table inheritance here is as a way to partition additional information to be joined to a table, where foreign keys against the total partition set are relatively uninteresting.  Note however, one could come up with ways here to define an additional unique index (on payment_class(id, extended_storage) which would allow effective foreign keys to be added to the inheritance tree.  The above approach, once constraint triggers are added (and tested) would be easier to understand and maintain in my experience than one using NULLs to indicate type, and would have more modular and maintainable check constraints.

Now the constraint triggers will still have to be created and tested and this adds significant complexity.  This may be a win where we need to maintain extensibility here, for example if we want to be open to adding gift certificates as forms of payment, or international money orders, etc.  On the other hand if we do not have to preserve this, then the inheritance-based approach adds needless complexity regarding foreign keys.

If the relational model is good enough (namely because of the lack of a need for extensibility in payment types) one can simply add an id field to each sub-table, a deferrable foreign key constraint (initially deferred) from payment to the other tables, and check constraints on whether these are NULL based on payment type.  In that case, inheritance just gets us enforcement for identical semantics for foreign keys, and a virtual catalog of payment references.

On the other hand if the purely relational model (fkeys against physical storage tables directly) is not good enough, we can build a system which we use constraint triggers to enforce foreign keys against an object catalog (payment_ref) in this case.

The Moral of the Story

Set/Subset modelling with table partitioning does have some uses, but maintainability requires that the tables be relationally factored as much as possible first, and inherit no more than necessary.  However, sparingly used, inheritance can facilitate (rather than replace!) good relational design.  This works best when one looks at inheritance as a way of doing table partitioning with additional fields per partition.

With 9.2, the addition of NO INHERIT check constraints will allow the top level partition to be used, but before that the top-level needs to be clear of entries.  In general, however, I am not convinced that having the top level partition be used in this case is actually a good idea.  As soon as you do this, a vanilla SELECT and a foreign key will have suddenly very different behavior.  Semantic clarity counsels for using NO INHERIT constraints to forbid all inserts into parent nodes.

Is the complexity tradeoff worth it though?  You be the judge.

Monday, August 27, 2012

PostgreSQL O/R Modelling Part 3: Table Inheritance and O/R Modelling in PostgreSQL

Note:  PostgreSQL 9.2 will allow constraints which are not inherited.  This will significantly  impact the use of inheritance, and allow for real set/subset modelling with records present in both parent and child tables.  This is a major step forward in terms of table inheritance.

PostgreSQL allows tables, but not views or complex types, to inherit table structures.  This allows for a number of additional options when doing certain types of modelling.  In this case, we will built on our inventory_item type by adding attached notes.  However we may have notes attached to other tables too, and we want to make sure that each note can be attached to only one table.

This is exactly how we use table inheritance in LedgerSMB.

Before we begin, however it is worth looking at exactly what is inherited by a child table and what is not.

The following are inherited:
  • Basic column definitions (including whether a column is nullable)
  • Column default values
  • CHECK constraints (cannot be overridden in child tables)
  • Descendants can be implicitly casted to ancestors, and
  • Table methods are inherited 
The following are not inherited:
  • Indexes
  • Unique constraints
  • Primary Keys
  • Foreign keys
  • Rules and Triggers
The Foreign Key Problems and How to Solve Them

There are two basic foreign key problems when using table inheritance on PostgreSQL. The first is that foreign keys themselves are not inherited, and the second is that a foreign key may only target a specific relation, and so rows in child tables are not valid foreign key targets.  These problems together have the same solution, which is that where these are not distinct tables for foreign key purposes, the tables should be factored in such a way as to break this information out, and failing that, a trigger-maintained materialized view will go a long ways towards addressing all of the key management issues.

For example, for files, we might place the id and class information in a separate table and then reference these in on child relations.  This gives a valid foreign key target for the set of all inherited tuples.  Foreign keys can also be moved to this second table, allowing for them to be centrally managed as well, though it also means that JOIN operations must include the second table.  A second option might be to omit class and use the child table's tableoid, perhaps renaming it in the materialized view as table_id.

The Primary Key Problem and How to Solve It

Similarly because indexes are not inherited, we cannot ensure that a specific combination of tuples is unique through a hierarchy.

The most obvious solution is to treat the combination of the natural primary key and the tableoid as a primary key for modelling purposes.   However this becomes only really useful when the primary key is a composite field with mutually exclusive constraints on child tables.  In other words, partitioning is necessary to sane set/subset modelling using table inheritance.

Where Not to Use Table Inheritance


Most of the table inheritance documentation suggests that the primary use of table inheritance is set/subset modelling, where an subset of a set has its own extended properties.  In practical use, however, a purely relational solution is usually cleaner for this specific sort of case and results in fewer key management issues (and therefore less overall complexity).

For set/subset modelling, a much cleaner solution is to use composite foreign keys.

For example, revising the example from the manual for table inheritance, a cleaner way to address the issue where you have cities and capitals is to do something like:

CREATE TABLE cities (
    name text primary key,
    altitude float, 
    population int,
    is_capital bool not null,
    unique(is_capital, name)
);

The unique constraint does nothing to the actual table.  If "name" is guaranteed to be unique, then ("is_capital", "name") is guaranteed to be unique.  What it does, however, is designate that as a secondary key.

You can then create a table of:

CREATE TABLE capitals (
    name text primary key,
    is_capital bool not null,
    state text not null,
    foreign key (name, is_capital) 
        references cities (name, is_capital)
        DEFERRABLE INITIALLY DEFERRED
);


 This approach, where subset membership is a part of a secondary key, allows for greater control regarding set/subset modelling.  You can then create a trigger on cities which checks to see if the city exists in capitals (enforcing an insert flow of "insert into capitals, then insert into cities, then commit").

No primary or foreign key problems occur with this solution to set/subset models.   With multiple subsets you will likely wish to generalize and use a "city_class" field of type int referencing a city_class table so you can control subsets more broadly.  However in general, table inheritance is a poor match for this problem because of foreign key constraint issues.  Primary key management issues are less of an issue however (and becoming significantly easier to solve).


The one use case for set/subset modelling using inheritance is table partitioning.  It has all the problems above, and enforcing foreign keys between two partitioned tables can become very complex, very fast unless materialized views are used to proxy the enforcement.


Table Structure:  Purely Relational Approach

CREATE SCHEMA rel_examples;
CREATE TABLE rel_examples.note (
   note text not null,
   subject text not null,
   created_by int not null,
   note_class int not null,
   created_at timestamp not null,
   id int unique,
   primary key (id, note_class)
);

CREATE TABLE rel_examples.note_to_inventory_item (
   note_id int not null,
   note_class int not null check (note_class = 1),
   item_id int references inventory_item (id),
   foreign key (note_id, note_class) 
      references rel_examples.note (id, note_class)
);

CREATE TABLE rel_examples.note_to_......;

These would then be queried using something like:

SELECT n.* FROM rel_examples.note n
  JOIN rel_examples.note_to_inventory_item n2i 
       ON n2i.note_id = n.id AND n2i.note_class = n.note_class
  JOIN inventory_item i ON i.id = n2i.item_id
 WHERE i.sku = 'TEST123';

This doesn't preclude other Object-Relational approaches being used here.  I could of course add a method tsvector to make full text searches easier, or the like and apply that to note, and it would work just as well.  However it doesn't really apply to the data model very well, and some performance can be gained, since we are only looking at pulling notes by attachment, by breaking up the table into more workable chunks which could be queried independently.

Additionally the global note table is a maintenance  nightmare.  Enforcing business rules in that case is very difficult.

Table Structure:  Object-Relational Approach

An Object-Relational approach starts off looking somewhat different:

CREATE TABLE note (
    id serial not null unique,
    note text not null,
    subject text not null,
    note_class int not null,
    created_by int not null,
    ref_key int not null,
    created_at timestamp not null
);

We'd then probably add a trigger to prevent this table from receiving rows.  Note that we cannot use a check constraint for this purpose because it will be inherited by every child table.  Typically we'd use a trigger instead of a rule because that way we can easily raise an exception.

Then we'd

CREATE TABLE inventory_note (
    CHECK (note_class = 1),
    PRIMARY KEY (id, note_class),
    FOREIGN KEY (ref_key) references inventory_item(id)
) INHERITS (note);

Suppose we also create a tsvector method as such:

CREATE OR REPLACE FUNCTION tsvector(note) RETURNS tsvector
LANGUAGE SQL AS $$
SELECT tsvector($1.subject || ' ' || $1.note);
$$;

Now in my example database, I have an inventory_item with an id of 2, so I might:

INSERT INTO inventory_note
     (subject, note, note_class, created_by, created_at, ref_key)
VALUES
     ('Testing Notes', 'This is a test of the note system', 1, 1, now(), 2);

We know this worked:

or_examples=# select * from note;
 id |               note                |    subject    | note_class | created_b
y | ref_key |         created_at        
----+-----------------------------------+---------------+------------+----------
--+---------+----------------------------
  1 | This is a test of the note system | Testing Notes |          1 |         
1 |       2 | 2012-08-22 01:17:27.807437
(1 row)

Note above that I selected from the parent table and child tables were pulled in automatically.  If I want to exclude these I can use SELECT * FROM ONLY note and none of these will be pulled in.

We can also note that the method is inherited:

or_examples=# select n.tsvector from inventory_note n;
                              tsvector                              
---------------------------------------------------------------------
 'Notes' 'Testing' 'This' 'a' 'is' 'note' 'of' 'system' 'test' 'the'
(1 row)

The tables could then be queried with one of two equivalent approaches:

SELECT n.* FROM note n
  JOIN inventory_item i ON n.ref_key = i.id AND n.note_class = 1
 WHERE i.sku  = 'TEST123';

or alternatively:

SELECT n.* FROM inventory_note n
  JOIN inventory_item i ON n.ref_key = i.id
 WHERE i.sku = 'TEST123';

Both queries return the same results.

Advantages of the Object-Relational Approach


The object-relational model allows us to be sure that every note is attached to something, which we cannot do gracefully in the relational model, and it allows us to ensure that two notes attached to different items are in fact different.  It also allows for simpler SQL.

On the performance side, it is worth noting that in both the query cases above, PostgreSQL will only query child tables where the constraints could be met by the parameters of the query, so if we have a dozen child tables, each of which may be very large, we only query the table we are interested in.  In other words, inherited tables are essentially naturally partitioned (and in fact people use inheritance to do table partitioning in PostgreSQL).

Multiple Inheritance

Multiple inheritance is supported, but if used care must be taken that identically named fields are actually used in both classes in compatible ways.  If they are not, then problems occur.  Multiple inheritance can however be used safely both for interface development (when emulating inheritance in composite types, see below), and where only one of the inheriting relations contains actual columns (the other might contain constraints).

Thus we could still safely do something like:

CREATE note (
    note text,
    subject text,
    ....
);

CREATE table joins_inventory_item (
   inventory_id int,
);

CREATE NOTE inventory_note (
    FOREIGN KEY (inventory_id) REFERENCES inventory_item(id),
    ... -- partial primary key constraints, and primary key def
) inherits (note, joins_inventory_item);

This could then be used to enforce consistency in interface design, ensuring more readable queries and the like, but it doesn't stop there.  We can use joins_inventory_item to be the vehicle for a way to follow a reference.  So:

CREATE OR REPLACE FUNCTION inventory_item
(joins_inventory_item)
RETURNS inventory_item
LANGUAGE SQL AS
$$ SELECT * FROM inventory_item WHERE id = $1.inventory_id $$; 

In this case, then you can use inventory_item as a virtual pointer to the inventory item being joined as such:

CREATE TABLE inventory_barcode (
    barcode text,
    FOREIGN KEY (inventory_id) REFERENCES inventory_item(id)
) INHERITS (joins_inventory_item);

The uses for multiple inheritance however don't end there.  Multiple inheritance gives you the ability to define your database in re-usable chunks with logic following that chunk.    Alternative presentation of sets of columns then becomes manageable in a way it is not with complex types.

A More Complex Example:  Landmarks, Countries, and Notes
Another example may be where we are storing landmarks, which are attached to countries.  Our basic table schema may be:

CREATE TABLE country (
    id serial not null unique,
    name text primary key,
    short_name varchar(2) not null unique
);

CREATE TABLE country_ref (
    country_id int
);

Our traversal function:

CREATE FUNCTION country(country_ref) RETURNS country
LANGUAGE SQL STABLE AS
$BODY$ SELECT * FROM country WHERE id = $1.country_id $BODY$;

Next to notes:

CREATE TABLE note_fields (
    note_subject text,
    note_content text
);

Like country_ref, we probably will never query note_fields themselves.  There is very little of value in querying only contents like this.  However we can add a tsvector derived value:

CREATE FUNCTION note_tsvector(note_fields)
RETURNS tsvector IMMUTABLE
LANGUAGE SQL AS $BODY$

SELECT to_tsvector('english', coalesce($1.note_subject, '') || ' ' ||
coalesce($1.note_content, ''));

$BODY$;

Now, the function name is prefixed with note_ in order to avoid conflicts with other similar functions.  This would allow multiple searchable fields to be combined in a table and then mixed by inheriting classes.

CREATE TABLE landmark (
    id serial not null unique,
    name text primary key,
    nearest_city text not null,
    foreign key (country_id) references country(id),
    CHECK (country_id IS NOT NULL),
    CHECK (note_content IS NOT NULL)
) INHERITS (note_fields, country_ref);

The actual table data will not display too well here, but anyway here are a few slices of it:

or_examples=# select id, name, nearest_city, country_id from landmark;
 id |        name        | nearest_city  | country_id
----+--------------------+---------------+------------
  1 | Eiffel Tower       | Paris         |          1
  2 | Borobudur          | Jogjakarta    |          2
  3 | CN Tower           | Toronto       |          3
  4 | Golden Gate Bridge | San Francisco |          4
(4 rows)

and the rest of the table:

or_examples=# select id, name, note_content from landmark;
 id |        name        |              note_content             
----+--------------------+----------------------------------------
  1 | Eiffel Tower       | Designed by a great bridge builder
  2 | Borobudur          | Largest Buddhist monument in the world
  3 | CN Tower           | Major Toronto Landmark
  4 | Golden Gate Bridge | Iconic suspension bridge
(4 rows)

The note_subject field is null on all records.

SELECT name, (l.country).name as country_name FROM landmark l;

or_examples=# SELECT name, (l.country).name as country_name FROM landmark l;
        name        | country_name
--------------------+---------------
 Eiffel Tower       | France
 Borobudur          | Indonesia
 CN Tower           | Canada
 Golden Gate Bridge | United States
(4 rows)

Demonstrating the note_tsvector interface.  Note that this could be improved upon by creating different tsvectors for different languages.

or_examples=# select name, note_content from landmark l where
plainto_tsquery('english', 'bridge') @@ l.note_tsvector;
        name        |            note_content
--------------------+------------------------------------
 Eiffel Tower       | Designed by a great bridge builder
 Golden Gate Bridge | Iconic suspension bridge
(2 rows)

So the above example shows how the interfaces offered by two different parent tables can be invoked by a child table.  This sort of approach avoids a lot of the problems that come with storing composite types in columns (see upcoming posts) because the complex types are stored inline inside the table.


Base Tables as Catalogs

The base tables in these designs are not generally useful to query for data queries. However they can simplify certain types of other operations both manually and data-integrity wise.   The uses for data integrity will the subject of a future post.  However here we will focus on manual tasks where this sort of inheritance can help.


Suppose we use the above note structure in a program which manages customer contacts, and one individual discovers inappropriate language in some of the notes entered by another worker.  In this case it may be helpful to try to determine the scope of the problem.  Suppose the note includes language like "I told the customer to stfu."  We might want to:


SELECT *, tableoid::regclass::text as table_name
  FROM note_fields nf 
 WHERE nf.note_tsvector @@ to_tsquery([pattern]);


In this case we may be trying to determine whether to fire the offending employee, or we may have fired the employee and be trying to figure out what sort of damage control is necessary.


A test query with the current data set above shows what will be shown:


or_examples=# SELECT *, tableoid::regclass::text as table_name
  FROM note_fields nf
 WHERE nf.note_tsvector @@ to_tsquery('bridge');
 note_subject |            note_content            | table_name
--------------+------------------------------------+------------
              | Designed by a great bridge builder | landmark
              | Iconic suspension bridge           | landmark
(2 rows)


This sort of interface has obvious uses when trying to do set/subset modelling where the interface is inherited.  Careful design is required to make this perform adequately however.  In the case above, we will be having to do somewhat deep inspection of all tables but we don't necessarily require immediate answers, however.


In general, this sort of approach makes it somewhat difficult to store records in both parent and child tables without reducing semantic clarity of those tables.  In this regard, the parent acts like a catalog of all children.  For this reason I remain sceptical whether even with NO INHERIT check constraints it will be a good idea to insert records in both parent and child tables.  NO INHERIT check constraints, however finally provide a useful tool for enforcing this constraint however.


Contrasts with DB2, Informix, and Oracle

DB2 and Oracle have largely adapted Informix's approach to table inheritance, which supports single table inheritance only, and therefore requires that complex types be stored in columns.  With multiple inheritance, if we are careful about column naming, we can actually in-line multiple complex types in the relation.  This provides at once both a more relational interface and one which admits of better object design.

Emulating Inhertance in Composite Types

Composite types do not allow inheritance in PostgreSQL, and neither do views.  In general, I would consider mixing inheritance and views to be dangerous and so would urge folks not to get around this limitation (which may be possible using RULEs and inherited tables).

One basic way to address this is to create a type schema and inherit tables from a table called something like is_abstract as follows:

CREATE TABLE is_abstract (check (false ));

No table that inherits is_abstract will ever be allowed to store rows, but check constraints are only checked on data storage because otherwise many problems arise so the fact that we have a check constraint that, by definition, always fails allows us to deny use of the table for storing data but not use of a relation as a class which could be instantiated from data stored elsewhere.

Then we can do something like

CREATE TABLE types.my_table (
    id int,
    content text
) INHERITS (is_abstract);

Once this is done, types.mytable will never be able to store rows.  You can then inherit from it, and use it to build a logical data model.  If you want to change this later and actually store data, this can be done using alter table statements, first breaking the inheritance, and second droping the is_abstract_check constraint.  Once these two are done, the inheritance tree can be used to store information.

Next:  Inheriting interfaces in set/subset modelling
Next Week:  Complex Types

Friday, August 24, 2012

PostgreSQL OR Modelling Part 2: Intro to Object Relational Classes in PostgreSQL

In the last post we went briefly over the considerations and concerns of object-relational programming in PostgreSQL.  Now we will put this into practice.  While LedgerSMB has not fully adopted this approach I think it is likely to be the long-term direction for the project.

In object relational thinking classes have properties and methods, and sets of objects are retrieve and even manipulated using a relational interface.  While relational systems look at sets of tuples, object-relational systems look at relational manipulation of sets of objects.  A table (and to a lesser extent, a view of a composite type) is hence a class and can have various forms of behavior associated with it the rows, which act as objects.  This means a number of considerations change.  Some differences include:

  • SELECT * is often useful to ensure one receives a proper data type
  • Derived values can be immitated by methods, as can dereferences of keys
  • Common filter conditions can be centralized, as can common queries
In general, understanding this section is important to understanding further sections in this series.

Basic Table Structure

Our current example set will use a very simplified schema for storing inventory.  Consider the (greatly simplified) chart of accounts table:

 CREATE TABLE account (
     id int not null unique,
     control_code text primary key, -- account number
     description text not null
);

Populated by:

INSERT INTO account (id, accno, description)
VALUES (1, '1500', 'Inventory'),
       (2, '4500', 'Sales'),
       (3, '5500', 'Purchase'); 

Of course in a real system the chart of account (and inventory) tables would be more complex.

CREATE TABLE inventory_item (
    id serial primary key,
    cogs_account_id int references account(id),
    inv_account_id int references account(id),
    income_account_id int references account(id),
    sku text not null,
    description text,
    last_cost numeric, -- null if never purchased
    sell_price numeric not null,
    active bool not null default true
);

Now, we'd also want to make sure only one active part can be associated with a sku at any given time, so we'd:

CREATE UNIQUE INDEX inventory_item_sku_idx_u
ON inventory_item (sku) WHERE active IS TRUE;

The create table statement also creates a complex type with the same structure, and thus it defines a data structure.  The idea of relations as data structures in themselves will come up again and again in this series.  It is the fact that tables are data structures which allows us to do interesting things here.

Method 1:  Derived Value called "markup"

The first method we may want to add to this table is a markup method, for calculating our markup based on current sell price and last cost.  Since this value will always be based on two other stored values, there is no sense in storing it (other than possibly in a pre-calculated index).  To do this, we:

CREATE FUNCTION markup(inventory_item) RETURNS numeric AS
$$ SELECT CASE WHEN $1.last_cost = 0 THEN NULL

               ELSE  ($1.sell_price - $1.last_cost)
                     / $1.last_cost

           END;

$$ LANGUAGE SQL IMMUTABLE;

Looking through the syntax here, this is a function named "markup" which receives a single input of the table type.  It then calculates a value and returns it based solely on inputs and returns a value.  The fact that it function always returns the same value when the same input is passed is reflected in the IMMUTABLE designation.  The planner uses this to plan queries, and PostgreSQL will not index function outputs unless they are marked immutable.

Once this is done, we can:

SELECT sku, description, i.markup from inventory_item i;

Note that you must include the table designation in calling the method.  PostgreSQL converts the i.markup into markup(i).

Of course, our table being empty, no values will be returned.  However if you try to omit the i. before markup, you will get an error.

Not only can we include this in the output we can search on the output:

SELECT sku, description from inventory_item i 
 where i.markup  < 1.5;

If we find ourselves doing a lot of such queries and need to index the values we can create an index:

CREATE INDEX inventory_item_markup_idx 
ON inventory_item (markup(inventory_item));

If you use object.method notation, you must add extra parentheses because of create index semantics [see comment added below]:

 CREATE INDEX inventory_item_markup_idx ON inventory_item ((inventory_item.markup));

Instead of adding columns and using triggers to maintain them, we can simply use functions to calculate values on the fly.  Any value you can derive directly from values already stored in the table can thus be calculated on output perhaps even with the values indexed.  This is one of the key optimizations that ORDBMS's allow.

Method 2:  Dereferencing  an Account

The above table is defined in a way that makes it easy to do standard, relational joins.  More specialized Object-Relational-friendly references will be covered in a future posting.   However suppose we want to create a link to the accounts.  We might create a method like:

CREATE OR REPLACE FUNCTION cogs_account(inventory_item) 
RETURNS account
LANGUAGE SQL AS
$$ SELECT * FROM account where id = $1.cogs_account_id $$;

We can now:

or_examples=# select (i.cogs_account).* FROM inventory_item i;

We will get an empty row with the chart of accounts structure.  This gives us the beginnings of a path system for LedgerSMB (a real reference/path system will be discussed in a future posting).

We can even:

select * from inventory_item i 
 where (i.cogs_account).control_code = '5500';

Note that in many of these cases, parentheses are necessary to ensure that the system can tell that we are talking about an object instead of a schema or other system.  This is true generally wherever complex data types are used in PostgreSQL (otherwise the i might be taken to be a schema name).  Unlike Oracle, PostgreSQL does not make assumptions of this sort, and unlike DB2 does not have a separate dereferencing operator.

Warning: Dereferencing objects in this way essentially forces a nested loop join.  It is not recommended when working with large return sets.  For example the previous has a plan (given that I already have one row in this table) as:

or_examples=# explain analyze
or_examples-# select * from inventory_item i where (i.cogs_account).control_code = '5500';
                                                 QUERY PLAN                    
                           
--------------------------------------------------------------------------------
----------------------------
 Seq Scan on inventory_item i  (cost=0.00..1.26 rows=1 width=145) (actual time=0
.174..0.175 rows=1 loops=1)
   Filter: ((cogs_account(i.*)).control_code = '5500'::text)
 Total runtime: 0.199 ms
(3 rows)

Note that the filter is not expanded to a join.  This means that for every row filtered upon, it is executing a query to pull the resulting record from the account table.  Depending on the size of that table and the number of pages containing rows referenced, this may perform badly.  For a few rows, however, it will perform well enough.

Method 3:  Computer-discoverable save method

One application for object-relational modeling is to provide a top-level machine-discoverable API which software programs can use, creating more autonomy and such between the application and the database.  We will create here a machine-discoverable save function which only updates values that may be updated, and returns the values as saved back to the application.  We can consider this sort of loose coupling a "dialog" rather than "remote command" interface because both are assumed to be autonomous and try to provide meaningful responses to the other where appropriate.

Our founction looks like this (not using writeable CTE's for backward compatibility):

CREATE FUNCTION save(inventory_item)
RETURNS inventory_item
LANGUAGE PLPGSQL STABLE AS
$$
  DECLARE out_item inventory_item;

 BEGIN
 -- we don't want to allow accounts to change on existing items
 UPDATE inventory_item
    SET sku = in_item.sku,
        description = in_item.description,
        last_cost = in_item.last_cost,
        sell_price = in_item.sell_price,
        active = in_item.active
  WHERE id = in_item.id;

 IF FOUND THEN
     SELECT * INTO out_item FROM inventory_item
      WHERE id = in_item.id;
     RETURN out_item;
 ELSE
 INSERT INTO inventory_item
                 (cogs_account_id,
                  inv_account_id,
                  income_account_id,
                  sku,
                  description,
                  last_cost,
                  sell_price,
                  active)

          VALUES (in_item.cogs_account_id,
                  in_item.inv_account_id,
                  in_item.income_account_id,
                  in_item.sku,
                  in_item.description,
                  in_item.last_cost,
                  in_item.sell_price,
                  in_item.active);
 SELECT * INTO out_item
      FROM inventory_item
     WHERE id = currval('inventory_item_id_seq');
    RETURN out_item;

  END IF;

 END;
$$;

This function is idempotent and it returns the values as saved to the application so it can determine whether to commit or rollback the transaction.  The structure of the tuple is also discoverable using the system catalogs and so an application can actually look up how to construct a query to save the inventory item.  However it certainly cannot be inlined and it will be slow on large sets.  If you have thousands of inventory parts, doing:

SELECT i.save FROM inventory_item i;

Will be painful and do very little other than generate dead tuples.  Don't do it.

On the other hand a software program (either at code generation or run-time) can look up the structure of the type and generate a call like:

 SELECT (i.save).* 
   FROM (SELECT (row(null, 3, 1, 2, 'TEST123', 'Inventory testing item', 1, 2, true)::inventory_item).save) i;

Obviously this is a sub-optimal interface for humans but it has the advantage of discoverability for a computer.  Note the "::inventory_item" may be unnecessary in some cases, however it is required for all practical purposes on all non-trivial databases because it avoids ambiguity issues.  We really want to make sure that it is an inventory_item we are saving especially as data types may change.  This then allows us to control application entry points to the data.

Note that the application has no knowledge of of what is actually happening under the hood of the save function.  We could be saving it in unrelated relations (and this may be a good way to deal with updateable views in an O-R paradigm where single row updates are the primary use case).

Caveats:  In general I think it is a little dangerous to mix imperative code with declarative SQL in this way.   Object-relational modelling is very different from object-oriented programming because with object-relational modelling we are modelling information, while with object-oriented programming we are encapsulating behavior.  This big difference results in endless confusion.

The most obvious way around this is to treat all SQL queries as questions and treat the transactional boundaries as imperative frames to the declarative conversation.  A human-oriented translation of the following exchange might be:

BEGIN;  -- Hello.  I have some questions for you.

SELECT (i.save).*
  FROM (SELECT (row(null, 3, 1, 2, 'TEST124', 
                'Inventory testing item 2', 1, 2, 
                true)::inventory_item).save) i;

-- Suppose I ask you to save an inventory item with the following info for me.
-- What will be saved?


  id | cogs_account_id | inv_account_id | income_account_id |   sku   |       desc
ription        | last_cost | sell_price | active
----+-----------------+----------------+-------------------+---------+-----------
---------------+-----------+------------+--------
  4 |               3 |              1 |                 2 | TEST124 | Inventory
testing item 2 |         1 |          2 | t


-- The above information will be saved if you ask me to.

COMMIT; -- Do it.

This way of thinking about the overall framework helps prevent a lot of problems down the road.  In particular this helps establish a separation of concerns between the application and the database.  The application is responsible for imperative logic (i.e. what must be done) while the database answers declarative queries and only affirmatively stores data on commit.  Imperative changes to data only occur when the application issues the commit command.

In this regard object behavior (outside of storage which is an odd fit for the model) really doesn't belong in the database.  The database is there to provide answers to questions and update stored information when told to commit changes.  All other behavior should be handled by other applications.  In a future post we will look at some ways to broaden the ways applications can receive data from PostgreSQL.  Object-relational modelling then moves beyond the question of "what information do I have and how can I organize it to get answers" to "what derivative information can be useful and how can I add that to my otherwise properly relational database?"

Alternate Constructor inventory_item(int)

Now in many cases we may not want to have to provide the whole object definition in order to instantiate it.  In fact we may want to be able to ask the database to instantiate it for us.  This is where alternate constructors come in.  Alternate constructors furthermore can be for single objects or for sets of objects.  We will look at the single objects first, and later look at the set-based constructors.

This constructor looks like:

CREATE OR REPLACE FUNCTION inventory_item(int)
RETURNS inventory_item
LANGUAGE SQL
AS $$

SELECT * FROM inventory_item WHERE id = $1

$$;

If I have an item in my db with an id of two (saved in with the previous method call) I can:

 or_examples=# select * from inventory_item(2);
 id | cogs_account_id | inv_account_id | income_account_id |   sku   |      descr
iption       | last_cost | sell_price | active
----+-----------------+----------------+-------------------+---------+-----------
-------------+-----------+------------+--------
  2 |               3 |              1 |                 2 | TEST123 | Inventory
testing item |         1 |          2 | t
(1 row)

I can also chain this together with other methods.  If all I want is the markup for item 2, I can:

or_examples=# select i.markup from inventory_item(2) i;
         markup        
------------------------
 1.00000000000000000000
(1 row)

I can even:

or_examples=# select (inventory_item(2)).markup;
         markup        
------------------------
 1.00000000000000000000
(1 row)

An application can then use something like this to traverse in-application links and retrieve new objects.

Alternate Constructor inventory_item(text)

Similarly we can have a constructor which constructs this from a  text field, looking up by active SKU:

CREATE OR REPLACE FUNCTION inventory_item(text)
RETURNS inventory_item
LANGUAGE sql AS $$

SELECT * FROM inventory_item WHERE sku = $1 AND active is true;

$$;

We can then:

SELECT (inventory_item('TEST123')).markup;

and get the same result as before.

Warning:  Once you start dealing with text and int constructors, you have the possibility of ambiguity on queries in.  For example, SELECT inventory_item('2') will run this on the text constructor instead of the integer constructor, giving you no results.  For this reason it is a very good idea to explicitly cast your inputs to the constructor.

Set Constructor inventory_item(tsquery)

Not only can this be used to create a constructor for a single item.  Whole sets can be constructed this way.  For example we could create a text search constructor (and this allows the default search criteria to change over time in a centrally managed way):

CREATE OR REPLACE FUNCTION inventory_item(tsquery)
RETURNS SETOF inventory_item
LANGUAGE SQL AS $$

SELECT * FROM inventory_item WHERE description @@ $1;

$$;

This allows some relatively powerful searches to be done, without the application having to worry about exactly what is searched on.  For example, we can:

or_examples=# select * from inventory_item(plainto_tsquery('test')); id | cogs_account_id | inv_account_id | income_account_id |   sku   |      descr
iption       | last_cost | sell_price | active
----+-----------------+----------------+-------------------+---------+-----------
-------------+-----------+------------+--------
  2 |               3 |              1 |                 2 | TEST123 | Inventory
testing item |         1 |          2 | t
(1 row)

Here test has been found to match testing because testing is a form of the word test.

Of coruse if we have more than one item in the table, we get more than one result:

 or_examples=# select * from inventory_item(plainto_tsquery('test'));
 id | cogs_account_id | inv_account_id | income_account_id |   sku   |       desc
ription        | last_cost | sell_price | active
----+-----------------+----------------+-------------------+---------+-----------
---------------+-----------+------------+--------
  2 |               3 |              1 |                 2 | TEST123 | Inventory
testing item   |         1 |          2 | t
  4 |               3 |              1 |                 2 | TEST124 | Inventory
testing item 2 |         1 |          2 | t
(2 rows)


These provide examples, I hope, provide some ideas for how one can take the object-relational concepts and apply them towards building more sophisticated, robust, and high-performance databases, as well as better interfaces for object-oriented programs.

Next Week:  Table Inheritance in PostgreSQL

Tuesday, August 21, 2012

O/R Modelling Part 1: Intro to PostgreSQL as Object-Relational Database Management System

This is a very brief intro to PostgreSQL as an object-relational database management system.   In future blog posts, we will look at more hands-on examples of these features in action.  Keep in mind these are advanced features typically used by advanced applications.

This is a very brief guide to the concepts we will be looking at more deeply in future posts, tying together in recipes and examples.   While PostgreSQL was initially designed to explore object-relational modelling possibilities, the toolkit today is somewhat different than it was initially intended, and therefore the focus of this series will be how to use PostgreSQL in an Object-Relational manner, rather than tracking the history of various components.

How is PostgreSQL "Object-Relational?"

The term Object-Relational has been applied to databases which attempt to bridge the relational and object-oriented worlds with varying degrees of success.  Bridging this gap is typically seen as desirable because object-oriented and relational models are very different paradigms and programmers often do not want to switch between them.  There are, however, fundamental differences that make this a very hard thing to do well.  The best way to think of PostgreSQL in this way is as a relational database management system with some object-oriented features.

By blending object-primative and relational models, it is often possible to provide much more sophisticated data models than one can using the relatively limited standard types in SQL.  This can be done both as an interface between an application and database, and as intra-query logic.  In future posts I will offer specific examples of each concept, explore how PostgreSQL differs from Oracle, DB2, and Informix in this area.

PostgreSQL is a development platform in a box.  It supports stored procedures written in entirely procedural languages like PL/PGSQL or Perl without loaded modules, and more object-oriented languages like Python or Java, often through third party modules.  To be sure you can't write a graphical interface inside PostgreSQL, and it would not be a good idea to write additional network servers, such as web servers, directly inside the database.  However the environment allows you to create sophisticated interfaces for managing and transforming your data. Because it is a platform in a box the various components need to be understood as different and yet interoperable.  In fact the primary concerns of object-oriented programming are all supported by PostgreSQL, but this is done in a way that is almost, but not quite, entirely unlike traditional object oriented programming.  For this reason the "object-relational" label tends to be a frequent source of confusion.

Data storage in PostgreSQL is entirely relational, although this can be degraded using types which are not atomic, such as arrays, XML, JSON, and hstore.  Before delving into object-oriented approaches, it is important to master the relational model of databases.  For the novice, this section is therefore entirely informational.  For the advanced developer, however, it is hoped that it will prove inspirational.

In object-oriented terms, every relation is a class, but not every class is a relation.  Operations are performed on sets of objects (an object being a row), and new row structures can be created ad-hoc.  PostgreSQL is, however, a strictly typed environment and so in many cases, polymorphism requires some work.

Data Abstraction and Encapsulation in PostgreSQL


The relational model itself provides some tools for data abstraction and encapsulation, and these features are taken to quite some length in PostgreSQL.  Taken together these are very powerful tools and allow for things like calculated fields to be simulated in relations and even indexed for high performance.

Views are the primary tool here.  With views, you can create an API for your data which is abstracted from the physical storage.  Using the rules system, you can redirect inserts, updates, and deletes from the view into underlying relations, preferably using user defined functions.  Being relations, views are also classes and methods.  Views cannot simply be inherited and workarounds cause many hidden gotchas.

A second important tool here is the ability to define what appear to be calculated fields using stored procedures.  If I create a table called "employee" with three fields (first_name, middle_name, last_name) among others, and create a function called "name" which accepts a single employee argument and concatenates these together as "last_name, first_name middle_name" then if I submit a query which says: 

select e.name from employee e;

it will transform this into:

select name(e) from employee e;

This gives you a way to do calculated fields in PostgreSQL without resorting to views. Note that these can be done on views as well because views are relations.  These are not real fields though.  Without the relation reference, it will not do the transformation (so SELECT name from employee will not have the same effect).

Messaging and Class API's in PostgreSQL


A relation is a class.  The class is accessed using SQL which defines a new data structure in its output.  This data structure unless defined elsewhere in a relation or a complex type cannot have methods attached to it and therefore can not be used with the class.method syntax described above.  There are exceptions to this rule, of course, but they are beyond the scope of this introduction.  In general it is safest to assume that the output of one query, particularly one with named output fields, cannot safely be used as the input to another.

A second messaging apparatus in PostgreSQL is the LISTEN/NOTIFY framework which can be used along with triggers to issue notifications to other processes when a transaction commits.  This approach allows you to create queue tables, use triggers to move data into these tables (creating 'objects' in the process) and then issuing a notification to another process when the data commits and becomes visible.  This allows for very complex and and interactive environments to be built from modular pieces.

Polymorphism in PostgreSQL


PostgreSQL is very extensible in terms of all sorts of aspects of the database.  Not only can types be created and defined, but also operators can be defined or overloaded.

A more important polymorphism feature is the ability to cast one data type as another.  Casts can be implicit or explicit.  Implicit casts, which have largely been removed from many areas of PostgreSQL, allow for PostgreSQL to cast data types when necessary to find functions or operators that are applicable.  Implicit casting can be dangerous because it can lead to unexpected behavior because minor errors  can lead to unexpected results.  '2012-05-31' is not 2012-05-31.  The latter is an integer expression that reduces to 1976.  If you create an implicit cast that turns an integer into a date being the first of the year, the lack of quoting will insert incorrect dates into your database without raising an error ('1976-01-01' instead of the intended '2012-05-31').  Implicit casts can still have some uses.

Inheritance in PostgreSQL


In PostgreSQL tables can inherit from other tables.  Their methods are inherited but implicit casts are not chained, nor are their indexes inherited.  This allows you develop object inheritance hierarchies in PostgreSQL.  Multiple inheritance is possible, unlike any other ORDBMS that I have looked at on the market (Oracle, DB2, and Informix all support single inheritance).

Table inheritance is an advanced concept and has many gotchas.  Please refer to the proper sections of the manual for more on this topic.  On the whole it is probably best to work with table inheritance first in areas where it is more typically used, such as table partitioning, and later look at it in terms of object-relational capabilities.

Overall the best way to look at PostgreSQL as an object-relational database is a database which provides very good relational capabilities plus some advanced features that allows one do create object-relational systems on top of it.  These systems can then move freely between object-oriented and relational worldviews but are still more relational than object-oriented.  At any rate they bear little resemblance to object-oriented programming environments today.  With PostgreSQL this is very much a toolkit approach for object-relational databases building on a solid relational foundation.  This means that these are advanced functions which are powerful in the hands of experienced architects, but may be skipped over at first.

Next week:   An intro to object-relational "classes."  We will build a simple class that represents items in inventory.


Forthcoming Posts:
  • An intro to object-relational "classes" looking at tables with extended functionality
  • Table inheritance in PostgreSQL (including how to solve key problems)
  • Composite and row types in tables, including references to other tables
  • Interlude:  Comparison to MySQL as a way of questioning how we think about databases (no flames!)
  • Nested Data Structures (including things to avoid)
  • Polymorphism in PostgreSQL 
  • Messaging in PostgreSQL between client applications
  • General Design Patterns and Anti-Patterns

Monday, August 13, 2012

ACID as a basic building block of eventually consistent, distributed transactions

Anything worth doing is worth doing well.  It therefore follows that anything worth tracking for a business is worth tracking well.  While this involves tradeoffs which are necessarily business decisions, consistency and accuracy of data are always important considerations.

In a previous post, I looked at a the the possible origin of double entry accounting in stock and foil split tally sticks.  Now let's look at how financial systems of today might provide a different way of looking at distributed transactions in loosely coupled systems.  This  approach recognizes that in human systems all knowledge is fundamentally local and applies this to distributed computing environments, broadly defined.

Eventual Consistency as Financial Firewall

In the non-computer world there are very few examples of tightly coupled distributed transactions.  Dancing, perhaps comes to mind as does a string quartet playing a piece of music.  However most important work is done using loosely coupled systems.  Loosely coupled systems done right are more robust and avoid some of the problems associated with the CAP theorem.  If the first violinist suffers a mishap part way through a quartet and is unable to continue, you probably will have to stop playing.  If the petty cash manager suddenly falls ill after you have withdrawn money from the petty cash drawer to go buy urgently needed office supplies you can continue on your way.  You might have to wait until someone else can take over before you can give your receipt and change back, however.

Such systems however have two things in common:  first they are always locally consistent and this is extremely important.  The petty cash drawer and all cash vouchers are together in a consistent state.  Secondly counterparties are all locally consistent and transactions can be tied clearly back to such counterparties.  The party and counterparty together provide a basis for long-term conflict resolution in the form of an audit.  Eventual consistency is a property of the global system, not of any local component.

Moreover there are times when globally eventual consistency is actually desirable, but this business need is premised on absolute local consistency.  For example , if I am processing credit cards, the processing systems will be locally consistent and my accounting systems will be locally consistent, but these will not always be in sync.  My accounting department will have control over the data coming into the books.  Similarly if my inventory is stored by a third party and shipped, they may send me a report every day, but it is not going to hit the books until it is reviewed by a person.  Both of these approaches use global eventual consistency as a firewall against bad financial data entering the books.  In other words this is a control point where humans can review and correct the problems.  In addition to the performance issues, this is a major reason why you will rarely see two phase commit used to synchronize transactions between financial accounting systems.  Instead these will be moved in, in ways which are not globally consistent but only can become such after human review and approval.  The computer, in essence, is treated like a person which can make mistakes or worse.

If this is the case internally it is even more the case between businesses.  If I run a bank, I am not giving your bank direct access to my financial database, and I am not going to touch your database.  The need for eventual consistency in transferring money between our banks will have to take the form of messages exchanged, human review, and more.

Why SQL?  Why ACID?  Why not a NoSQL ERP?

I have often said that NoSQL is an extraordinarily poor choice for ERP software.  In addition to the difficulties in doing ad hoc reporting, you have a need for absolute, local consistency which is not a goal of NoSQL software.  Without local consistency, you can't audit your books, and you cannot determine where something went wrong.  ACID is not a global property of the business IT infrastructure and it shouldn't be (if you try to make it you run into a dreaded brick wall which is called "The CAP Theorem."  It is a property of local data stores and all your internal controls are based on the assumption that data is locally consistent and knowledge is local.

Example 1: An Eventually Consistent Cash Register

The first example of how we might look at this might be an eventually consistent retail environment.  In this environment we aren't taking materials out of the store to ship, or if we do we are going down and pulling them off the shelf before entry.  The inventory on the shelf is thus authoritative here.  The books are also reviewed every day and transactions reviewed/posted.

Since nobody can pull a product off the shelf that doesn't exist, we don't have to worry about real-time stock tracking.  If we did though there would be ways to handle this.  See example 2 below.

In this example the cash register would locally be running an ACID-compliant database engine and store the data through the day locally. At the end of the day it would export the transactions it did to the accounting system where the batch would be reviewed by a person before posting to the books.  Both the cash register and the accounting system would retain records of the transfer and synchronization, making an audit possible.  Because of the need for disconnected operation I am considering building such a cash register for LedgerSMB 1.5

Example 2:  Real-time inventory tracking for said cash register.

So business needs change, and now the disconnected cash register needs to report inventory changes on as real-time basis as possible to the main ERP system.  This is made easier if the cash register is running PostgreSQL

So we add a trigger to the table that stores the inventory movements which queues these for processing.  A trigger on the queue table issues a NOTIFY to another program which attempts to contact the ERP system.  It sends info on the inventory plus invoice number.  This is digitally signed with a key for the cash register.  The ERP stores this information and uses it for reporting and reconciliation at the end of the day, but treats it as a signed voucher checking out inventory.  At the end of the day these are reconciled and errors flagged.  If the message exchange fails, it tries again later.

Now, here you have three locally consistent systems: the POS, the ERP, and the messaging module.   Eventual consistency is a global property arising from it, and is preferable, business-wise, to absolute consistency because it gives an opportunity for human oversight.

These systems draw their inspiration from the paper accounting world.  They are partition-tolerant, available, and guaranteed, absent hardware destruction, to be eventually consistent.  Not only does this get around the problems flagged in the CAP theorem but they also provide opportunities for humans to be in control of the business.

Unlike BASE systems, these systems, built on ACID provide a basic framework where business process controls are consistently enforced.

Considerations

For such a system to work a few very specific requirements must be met (ACID allows us to meet them but we still have to design them into a system):
  • Messages must be durable on the receiving side.
  • Messages must be reproducible on the sending side.,
  • Each side must be absolutely consistent at all times.
  • Sender and receiver do not require knowledge of the operations handled by the other side, just of the syntax of the messages.  The sender must have knowledge that the message was received but need not have knowledge that it was durably stored (such knowledge can be helpful but it is not required.
  • Humans, not machines, must supervise the overall operation of the system.


Summary and thoughts on the CAP theorem

 In the off-line world, the CAP theorem not only enforces real limits in coordinated activities, but it also provides solutions.  Individuals are entrusted with autonomy, but controls are generally put in place to ensure both local consistency and the ability to guarantee eventual consistency down the road.  Rather than building on the BASE type approach, these build on the fundamental requirements of ACID compliance.   Paper forms are self-contained, and are atomic for practical reasons.  They maintain consistent state.  The actor's snapshots of information are inherently limited regarding what information they receive about the state on the other side, providing something akin to isolation.  And finally the whole point of a paper trail requires some degree of durability.  Thus the BASE approach resembles the global system but all components are in fact ACID-compliant.

This provides a different way to think of distributed computing, namely that functional partitions are sometimes desirable and can often be made useful, thus allowing one to move from the CAP theorem as a hard limit to viewing the CAP theorem as a useful reminder of what is inherently true, that loosely coupled systems create more robust global environments than tightly coupled ones, and that people rather than computers are often necessary for conflict resolution and business process enforcement.  The basic promise is that global consistency will be eventually maintained, and that the system will continue to offer basic services despite partitions that form.  This is the sort of thing the BASE/NoSQL proponents are keen on offering, but their solutions make it difficult to enforce business requirements because individual applications may see eventual consistency.  Here each application is absolutely consistent, but the entire environment is eventually consistent.  This matches off-line realities more closely.

Instead human systems, perhaps because they are not scalable in the CAP sense (having limited communications bandwidth), have developed very sophisticated systems of eventual consistency.  These systems require absolute consistency on the node (person) level and then coordination and reconciliation between systems.  In my accounting work the approach I have usually taken is to put the human in control but do whatever you can to simplify and streamline the human's workflow.  For example, when trying to reconcile a checking account it may be useful to get data from the bank.  This data is then matched to what's in the database using a best guess approach (in some cases we can match on check number but for wires and transfers we cannot and so we guess based on date and amount)  and the human is left to resolve the inevitable differences as well as review and approve. 

The nice thing about this model (and it differs from BASE considerably), is that you can expect absolute consistency at all times on every node, and availability of the global environment does not depend on every individual component functioning.  In the cash register example, the cash register can go down without the ERP going down and vice versa.  This means that you can achieve basic availability and eventual consistency without sacrificing ACID on a component level.  The key however is to be able to go back to the components and be able to generate a transaction history if needed, so if something failed to come through you can re-run it.

Because this is based on the ACID rather than the BASE model, I would therefore offer as a cute name:  Locally Available and Consistent Transaction and Integrity Control ACID as a name for this model of consistency.  This of course can be referred to by the cute shorthand of LACTIC ACID, or better "the LACTIC ACID model of eventual consistency."  It is a local knowledge model, rather than the more typical global knowledge model, and assumes that components, like people, have knowledge only of the things they need to know, that they are capable of functioning independently in disconnected mode, and that they are capable of being generating consistent, and accurate, pictures of the business later on demand.

This approach further relegates traditional tools like two-phase commit to the role of tools, which can be very helpful in some environments (particularly replication for high availability), but are not needed to ensure consistency of the environment when the systems are loosely coupled.  They may still be helpful in order to handle some sorts of errors, but they are one tool among many.

Finally although financial systems are not transportation vehicles, the automation paradox applies there as well.  If humans rely on too much automation, they are inclined to trust it, and therefore when problems arise are ill-prepared to deal with them.  In this approach, humans are integral parts of the operation of the distributed computing environment and therefore are never out of the loop.  This may sound less than desirable but in my experience every case of embezzlement I have heard of could have been prevented by more human eyes on the books and less, rather than more, immediate consistency.  There is nothing more consistent than having a single employee with full access to the money.  Separation of duties implies less consistency between human actors but this is why it is useful.

This approach is intended to be evolutionary rather than revolutionary.  Rather than try to create something new, it is an attempt to existing mature processes and apply them in new ways.