Thursday, October 11, 2012

Object-Relational Algebra 1: Definitions and Assumptions

 Introduction

This post begins a series to try to establish a mathematical basis for object-relational database systems.  The idea here is to extend relational math in order to cover object-relational database operations rather than to model current databases mathematically.  I therefore want to start with my assumptions and definitions.

For simplicity's sake I will follow with Codd's assumptions, namely that each row is unique, that the ordering is not significant, and that within each row, the columns are significant.  This establishes a definition of a relation as a set of tuples.  Everything here is then an extension of Codd's math, rather than a replacement for it.  In the programming world, boolean operations may mean something slightly different, or bags may be used instead of sets, but the goal here is not to model databases but to model data, and Codd's model is more elegant in this regard.  It is the real world that gets messy.

To Codd's assumptions I will add my own, namely that the value of any attribute is opaque to relational operations but may act as the domain for a function in this way.  This becomes important when I get to the special Σ function and the functional dependencies on its output.  The special function in my system Σ(R) (join series) with a θ condition gives you a way to represent self-joins of arbitrary depth for relational math, for example, and combined with functional dependencies of its output gives you a way to express queries of transitive binary closure, but that is not its most interesting use at least to me.

A second important point here is that I will diverge from some previous efforts in that I will not throw in "magic operators" designed to solve corner cases.  The goal here is to build a more expressive system that can solve more problems.  Therefore an operator that determines transitive binary closure is ruled out, and if we can't solve a problem without a magic operator, then this will be left for others.  Additionally this effort is intended to be minimalist and allow many different behaviors to be encapsulated through different approaches.  For this reason I will build directly on the work of Codd and more or less ignore other approaches that various researchers have taken.

Finally this is somewhat to be distinguished from the way SQL does things.  SQL itself is a messy (from a math standpoint) application of both relational algebra and predicate calculus, using bags instead of sets.

One important problem is choosing a name for this extension.  The term object in computer science tends to be a bit abstract and not really a good description of what is going on here.  A better description might be functional-relational algebra.  I call it object-relational algebra only because it fits in with object-relational databases.

In the posts that follow, relation and function will be given their ordinary mathematical meanings.  However the relationship between the two are not well defined to my knowledge.

What is an Object Anyway?

In computer science terms, an object is a data structure (here represented as a tuple) which has imperative behavior attached to it.  Object-oriented programming is thus essentially an extension to structural programming where behavior follows structure.    Classes define both structure and behavior, and objects instantiate and apply these rules.  Applying a model such as this to an algebra of data is somewhat problematic, because algebra is about derivation of values, not about behavior.

What we are doing here is similar, and yet different.  I haven't fleshed out a theory of inheritance vs composition (PostgreSQL supports multiple inheritance which can be used to implement either inheritance or composition, strictly speaking), and collection tables can be used to implement composition in various databases including PostgreSQL.  A theory of composition and inheritance is left to others.  From the standpoint of my model these are equivalent.

Instead of behavior, I use the same basic approach of tying structure to function in order order to expand the reach of the relational model.  In computer science terms, a class is then roughly similar to a relation, and an object roughly similar to a tuple.  However, instead of behavior, the typical focus is on derived information.  Thus functions become important in a way which they have not typically been used in relational math.

Functions of a Relation

So here, f(R) is a function of relvar R if, and only if, for every possible tuple in R, f(R) returns one distinct result (that result could however be a tuple or even a set).    Functional dependencies stored in the database can then be modeled as functions, but so can functional dependencies which are not stored.

The following then is not a function: f(R) =  x^0.5 because it resolves to two distinct values for every value of attribute x in relation R (one is positive and the other is negative).  However the following is a function:  f(R) = (abs(x^0.5), -1 * abs(x^0.5)) because it resolves to a tuple with both possible answers from the previous, at least if x is always positive or imaginary numbers are allowed.  It could return a set instead and that would be acceptable, however in this case the structure of the result is also set by the function, and the above function is distinct from g(R) = { abs(x^0.5), -1 * abs(x^0.5) } because the structure of the result is different.

In standard relational algebra, a tuple is finitely expressive, namely one can only express a finite set of values off a single tuple.  However, for any given tuple, an infinite number of functions are possible, and thus when combined with functions, a tuple becomes infinitely expressive.  Not only can all functional dependencies of the tuple's superkey be expressed as a function of the tuple, but any transformation of the tuple's values, or the values of functional dependencies, can be expressed as such as well.

A functional approach also allows us to dispense with the rename operation in relational algebra, since renamed relations can be seen as relation-returning functions.

Kinds of Functions

In my model, functions can be divided up into the following categories:
  1. Relational Functions can be expressed solely in relational algebra.
  2. Non-relational functions possess no non-trivial relational algebra reduction.  x^2 is non-relational.
  3. Complex Functions, relationally speaking have non-trivial relational reductions, but non-relational components too.
Assuming sufficient join dependencies in a database, every functional dependency can be expressed through relational functions.  Moreover trivial relational dependencies can always be expressed by relational functions, and indeed by mere projection operations.   We can then define a trivial relational function as one which reduces solely to project operations off information stored in the tuple.

Result

The resulting system essentially creates something like an object-oriented database but one which is fundamentally relational, and in which objects behave differently than they do with mere object-oriented persistence.   While each tuple is infinitely expressive, this is possible only because of a distinction between primary (Codd's Algebra) stored data and secondary (calculated) answers.  This extension however allows any kind of mathematics (or other logic) to be tied into relational algebra.   This allows relational algebra to be used along with many other kinds of disciplines to build extremely sophisticated data models.

Forthcoming:
1: Exploring properties of relational, non-relational and complex functions
2: The special function  Σ(R) representing the output of a simple series join.

No comments:

Post a Comment