Perspectives on LedgerSMB: January 2012

Friday, January 13, 2012

Further Reply to Tony Marston

Tony Marston has updated his piece about unintelligent databases to include information from my earlier reply, and another piece entitled Business Logic in the Database. The reply and his comments in the latter get into some of the reasons why his use case is in fact narrow and therefore are very much worth discussing here. Additionally some other viewpoints will be discussed from the coverage of the debate in the previously mentioned posts.

It is no surprise that Tony continues to take issue with the idea of putting business logic in the database and in fact expands that to include things like foreign keys and check constraints. However many of the points are actually interesting and worth discussing. My own sense is that we can relegate Tony's viewpoint largely to cases where software as a service. As a bonus, I will respond also to his points in his article Stored Procedures are Evil

Marston's New Points in His Reply

That has not been my experience. I have spent all of my career in writing applications which have sole access to their databases. Having multiple applications trying to perform the same function on the same database is something which I have never encountered.

Marston is right that this is usually not on the requirement sheet anywhere, though sometimes it is. Rather additional applications are usually added after the fact, for example to move data from one application to another, to generate reports, or even to do other things.

One of my customers uses LedgerSMB in a fairly automated way, and has all financial information imported through a staging schema from production applications, as well as a second application which creates and manages vendor profiles and payment information (and which eventually imports the data into LedgerSMB). They are an exception though. Most customers who end up with multi-app databases don't set out to build them. Instead they have additional requirements added after the fact where new applications are clearly the right approaches. By removing the logic which prevents invalid data from being stored in the database, Marston's approach slams shut the door on such after-market development.

I disagree completely. It is implementing business logic in the database which is more time consuming, difficult and expensive. And please don't muddy the waters with NoSQL databases as they are a recent phenomena which have yet to reach full maturity and live up to their hype.

The discussion about NoSQL holds true for other schemaless designs. If all you are using the RDBMS for is a very light-weight data store with some backup capabilities, why not use a schemaless database?

NoSQL engines in fact bring few things to the table compared to prerelational engines which are relevant to this discussion. So in this regard, they are not significantly different from any other key/value store, for example Berkley Database, which I think we can all agree has reached full maturity.

The tradeoff is quite simple really. If all you are doing is app development with no reporting, it's far faster development-wise to just throw your data into key/value stores (whether pre-relational or post-relational) than it is to deal with the overhead of relational design. On the other hand, you lose the ability to do data constraints, ensure all stored data is meaningful, and any reasonable ad-hoc reporting. On the whole, the losses for a business application usually outweigh the gains for reasons based on the very features Tony Marston doesn't appear to want to use (foreign keys, check constraints, views, reporting.....) and therefore the relational model wins out more times than not.

Now, my guess is that since Marston repeatedly says he doesn't do reporting but has an ERP, he is specifically talking about ad hoc reporting. Any ERP worth its salt relies heavily on reporting generally. So here we are beginning to see the narrow use case where Marston (and admittedly some big ERP's too) fall into, though I think a lot of the reason for this is licensing revenue.

Most large ERP's that I am aware of do in fact require that all access to the db goes through a middleware layer. I think there are two fundamental reasons for this. The first is that most of these depend on large, expensive, proprietary databases, and since most businesses are more interested in consolidating servers, it is hard to insist that customers, for example, buy SQL Server instead of Oracle, or DB2 instead of SQL Server. This means that there is a requirement to use the lowest common denominator in the RDBMS set.

The second reason is that the ERP can only enforce client access license limits on the middle tier--- it cannot do this effectively at the database level. Therefore in order to enforce client access license payments, this is usually enforced here, and no writes can be allowed beyond the ERP application itself. The goal of course is to build an ERP framework that is at the center of the business application environment and therefore the ERP vendor is most heavily in control of the license fees that they will collect.

That neither of these apply to LedgerSMB is why we can afford to move away from this model and aggressively re-use what frameworks are provided by the database and other components.

You misunderstand me. When I said "dumb" data store I meant the basic properties of any DBMS (whether relational or not) which is the ability to specify tables, columns within tables with data types and sizes, primary keys, unique keys and indexes. More advanced features, such as foreign key constraints, column constraints, triggers, stored procedures and user-defined functions were later additions. It is the use of these "advanced" features which I regard as optional and therefore I have the right to exercise that option and not use them.

Um...... Where to begin......

First, tables and columns are what define the relational model. Non-relational databases use other ways of representing data. Primary keys are also specific to a relational model, as I would think would be unique keys. That means of the features listed for all DBMS's ("whether relational or not") only the most general (indexes) are applicable outside the relational world, and that term is in fact so vague that it isn't clear to me that it is in fact even meaningful outside of a more specific context.

Berkeley Database, for example, is a basic key/value store. You can store whatever you like as a value, and it is referenced by it's key. You could store JSON, XML, YAML... whatever. No tables, columns, unique constraints, etc. You can just serialize your objects and put them there.

The idea that foreign key constraints are an advanced feature that should not be used would be laughable were it not so dangerous. I have seen first hand what the refusal to declare foreign keys can do in the context of an ERP application. I suppose it is great if you want to make money off of untangling people's data. It is very bad though if you want people to trust the application to give the right results from an accounting perspective. I remember a panicked call from a customer one April a few years before we forked LedgerSMB, "It's getting close to tax time and our books don't balance. We need this fixed right now. Can you help?" How long had the books been out of balance? About 8 months. Ouch, that's gotta hurt......

One of the big developments in LedgerSMB 1.2 has been the addition of foreign key constraints, and these were further improved upon in 1.3. Why? Because we take data integrity seriously......

I am most definitely *NOT* advocating a return to the COBOL/VSAM/IMS paradigm. I am simply questioning the argument that because you *CAN* implement business rules in the database then you *MUST*. I don't have to, and I choose not to.

The question of course is one of tradeoffs. How important is your data? How sure are you that folks will not want to create other import/export routines hitting the database? How confident are you in your code?

The more you can leverage declarative constraints at all levels of your application, the more sure you can be, mathematically, that your data is meaningful. It really is that simple.

Marston's Points (And Responses to them) in His Comments On Business Logic in the Database

In the blog entry mentioned above, "Business Logic in the Database" I think effectively narrow the use case where Tony is talking about to those where either revenue from middleware licenses is an issue or where something equivalent is going on:

Luck has nothing to do with it. Nobody is allowed direct access to my databases without my permission, and I never grant that permission. Access is either through my application or through my web services.

Now, thinking for a moment here about what "my databases" must mean here, this can only mean software as a service. Here the customer does not control their own data, which is a compelling reason to switch to LedgerSMB ;-). In particular this sort of control ends for all intents and purposes the moment the application is deployed at a customer's site. Therefore this application either requires license agreements, or it requires control over the database servers.

Either way, it has no place in open source.

To this, a commenter named "Thomas" replied:

While I realize you think your systems are confined to only being accessed by your applications, in the 25 years I’ve worked with databases, I’ve never encountered of a situation where such a restriction could be enforced indefinitely.

That matches my experience too. In fact I would say that I have never found an application deployed in production in a standard RDBMS at a customer's site where the customer found this restriction to be valuable. If it can be enforced indefinitely it is only through legal agreements requiring it, and those are burdens....

The fundamental problem, as I have repeatedly noted in this discussion is that kinds of logic that must be included in the database goes way up as soon as you want to support even the possibility of that second application. Yes, that means at a minimum, check constraints, referential integrity constraints, domains, and the like. Depending on your application, you might certainly have to go to stored procedures just to get an assurance of data integrity. Certainly that is the case when trying to enforce a rule like "all GL transactions must be balanced."

A Diversion into History: Goths, Marston, and Anti-Intellectualism

Robert Young commented:

Considering that you’re championing the failed COBOL/VSAM/IMS paradigm that Dr. Codd put a stake in, why aren’t you the one making such comments? The Goths put Europe into the Dark Ages by enforcing the pre-intelligence paradigm. Those, such as yourself, seek to return to the thrilling days of yesteryear, where each application was its own silo, and data sat in dumb files accessible only through bespoke code. Kind of the Dark Ages. Have a look at WebSocket, and see where the rest of us are going. No more disconnected client edit screens; just call the validation from the database from any client application. Complete flexibility, on both the server and client side. Separation of responsibility. And all that nice rhetoric that OO folks espouse, but seldom perform. You’re not “reinventing the wheel”, just rediscovering the square one that your grandfather used. That’s not progress.

I recognize it is unusual to see a history buff of my sort in application development but.....

Robert Young is wrong here on most counts, but not quite all. Most particularly, he is wrong about the Goths. The two most important books to read on the Goths are probably Peter Heather's book in the Peoples of Europe series and Herwig Wolfram's "The Roman Empire and Its Germanic Peoples." Both these books provide a view of the Goths which is well outside the standard dark ages narrative that Peter Young describes. In fact both authors tend to stress the continuity of Roman customs, land titles, and administration when the Goths ruled Italy and Spain. Sure there was some (necessary) simplification of the laws and so forth, but the decay of Italy continued, as Wolfram notes, because the concentration of wealth was so high that it prevented effective taxation and thus Ostrogothic Italy eventually fell to the Byzantines. Wolfram also credits the fall of the Western Empire with the same forces he claims doomed the Goths.

In Arms and Armor of the Medieval Knight, Miles and Paddock suggest that the Migration Ages were times when the Germanic peoples spread important metallurgical technologies across Europe, including the all-important technique of pattern-welding different forms of steel into weapons of far greater quality than the Romans managed. (The next major advancement in metallurgy would have to wait until the conversion of Scandinavia allowed Christians to import technologies for manufacture of homogeneous steel swords and other tools from Northern Europe.)

In fact, it's pretty clear due to the fact that the Goths arose from within the Roman empire and that at least one Gothic rebellion sought to obtain a Roman generalship for their leader, that the Goths were far less anti-Roman than the Romans were anti-Goth..... A lot of this had to do with the Goths early conversion to a form of Christianity known as Arianism, after its founder, Arius of Alexandria. This lead to several centuries of religious feuding between Nicene and Arian Christianity which is quite clear from every author of anywhere near that time.

And so we have a partial rebuttal of the master narrative that the great, progressive, scientific Roman civilization was destroyed by the backwards, barbaric, superstitious Goths.

But narratives matter including in this discussion. Just as the history is wrong, I think Young's central point is wrong too. I have come to the conclusion that conservatism in design choices is a good thing, and that a post-modern approach means valuing even approaches that seem at the moment to be outmoded. The real question is not which method is best, but rather in which use cases one method or another wins out.

In other words, to me there is no progress of methodologies, innovation is a dirty word, and we should be sceptical of new, overly hyped technologies, looking at them through a lens of what has been learned in the past. From this viewpoint, I have to say that I think that accusing people of getting in the way of progress is actually giving them a compliment, for in the words of Henry Spenser, "Those who do not learn from UNIX are destined to reinvent it badly."

Where Mr. Young is right is in the separation of responsibility point. This point cannot be overemphasized. The idea that a database takes on responsibility for storing meaningful data is something which no database which is both non-relational and non-dedicated can manage. The only use cases where one can avoid this are things like LDAP (which is a horror for reasons I won't go into here). And therefore relational databases usually win out, again, for reasons pertaining to the features Marston doesn't use.

Marston's Points in "Stored Procedures Are Evil"

In this article, Marston oversimplifies the positions of where/when to use stored procedures to two very extreme positions, entirely erasing any middle:

Use stored procedures and triggers only when it is an absolutely necessity.

vs.

Use stored procedures and triggers at every possible opportunity simply because you can.

He argues that his knowledge is better because:

You only know what you have been taught, whereas I know what I have learned.

Well, I used to agree with Marston. I have since come to learn that he is wrong. Perhaps it is this change of opinion which qualifies me to look at his arguments and address them, acknowledging where he has a going.

Before I start, though, I will say that people who use stored procedures at every possible opportunity (and I have met a few) do cause problems and that such is not a viable position. The obvious middle ground is:

Use stored procedures wherever they make sense given the functions of the database

But that may be too nuanced for Marston..... Presumably we should look at his arguments against stored procedures and it will make more sense.

First though let's look at what he says about arguments in favor of stored procedures:

This is a common argument that many people echo without realising that it became defunct when role-based security was made available. A good DBA defines user-roles in the database, and users are added to those roles and rights are defined per role, not per user. This way, it is easy to control which users can insert / update and which users can for example select or delete or have access to views in an easy way.

There are certain cases where role-based security doesn't quite get you there. Now these may not impact all applications but they do affect a significant set of them. The big one is something like "users of role x must be given the permission to mark transactions as approved, but cannot mark approved transactions as unapproved." Because the security setting depends on the value of the input, you cannot enforce the security using database role-based security. In these cases, the only secure way to grant access is through a stored procedure.

He also says:

With a view it is possible to control which data is accessed on a column basis or row basis. This means that if you want user U to select only 2 or so columns from a table, you can give that user access to a view, not the underlying table. The same goes for rows in one or more tables. Create a view which shows those rows, filtering out others. Give access rights to the view, not the table, obviously using user-roles. This way you can limit access to sensitive data without having to compromise your programming model because you have to move to stored procedures.

The problem here is that with update permissions, views suffer from all the portability problems of stored procedures. So this doesn't necessarily provide a significant win.

It is also said that stored procedures are more secure because they prevent SQL injection attacks. This argument is false for the simple reason that it is possible to have a stored procedure which concatenates strings together and therefore open itself up to sql injection attacks (generally seen in systems which use procedures and have to offer some sort of general search routine), while the use of parameterized queries removes this vulnerability as no value can end up as being part of the actually query string.

First, it's important to note that stored procedures get called through a SQL query like anything else, and therefore the underlying SQL query is still vulnerable to SQL injection attacks and these need to be addressed. So parameterized queries still have to be used. Similarly, Marston has a point about dynamic SQL in stored procedures (a problem which is actually more severe in some cases than he gives credit for).

This being said, it is simpler to understand where the potential issues are in stored procedures, and it is simpler to audit them for problems than it is to deal with either dynamic SQL or generated SQL based on the sorts of mappers that Marston suggests.

As for performance, Marston states:

The execution of SQL statements in stored procedures may have been faster than with dynamic SQL in the early days of database systems, but that advantage has all but disappeared in the current versions. In some cases a stored procedure may even be slower than dynamic SQL, so this argument is as dead as a Dodo.

My general experience is that the more dynamic the SQL is, the harder it is to tune because the more abstraction layers one has to go through in order to find the problems. ORM's are usually horrible in this regard because they typically do a very large number of operations when a smaller number of operations could succeed and this is usually not fixable. Similarly dynamic SQL, where the whole query is created at run-time imposes additional overhead when trying to locate and correct performance issues.

I have seen horrible performance from stored procedures, but I have seen how much easier they are to tune (when written in a maintainable way).

One thing to keep in mind though is that stored procedures are not necessarily more or less efficient. They do take skill and care to write so that they perform well under demanding cases. There are some additional gotchas to be aware of especially when folks who are more app authors than SQL authors start trying to write them. However on the whole, I think this problem is manageable and the performance gains worth it.

He goes on to say:

Performance should not be the first question. My belief is that most of the time you should focus on writing maintainable code. Then use a profiler to identify hot spots and then replace only those hot spots with faster but less clear code. The main reason to do this is because in most systems only a very small proportion of the code is actually performance critical, and it's much easier to improve the performance of well factored maintainable code.

I completely agree with this btw. Clear code is less costly to maintain than is unclear code. And I have read horribly coded stored procedures just as I have read beautifully coded ones. Focus on clear, maintainable code. Keep things simple.

BTW this brings me to one clear advantage of stored procedures: you can keep all your SQL code in different files from your higher tier programming logic. This can be used to really help with maintainability. Additionally I recommend the following for people writing stored procedures: "Use stored procedures, as much as possible, as named queries. Keep them, as much as you can, to single SQL statements with named parameters."

His arguments against stored procedures become more interesting:

Instead of having a structure which separates concerns in a tried and trusted way - GUI, business logic and storage - you now have logic intermingling with storage, and logic on multiple tiers within the architecture. This causes potential headaches down the road if that logic has to change.

This mangling goes both ways. A clear stored procedure architecture, using stored procedures where they make sense, can have the effect of untangling your database queries from your application code, and thus making a multi-tiered architecture clearer and cleaner. Of course, this isn't always the case, so common sense needs to be applied. Again, use stored procedures where they make sense and simplify things, but don't use them just because you can, and don't avoid them just because you can.

He then goes on to talk about the maintenance issues of stored procedures:

The reason for this is that stored procedures form an API by themselves. Changing an API is not that good, it will break a lot of code in some situations. Adding new functionality or new procedures is the "best" way to extend an existing API. A set of stored procedures is no different. This means that when a table changes, or behaviour of a stored procedure changes and it requires a new parameter, a new stored procedure has to be added. This might sound like a minor problem but it isn't, especially when your system is already large and has run for some time. Every system developed runs the risk of becoming a legacy system that has to be maintained for several years. This takes a lot of time, because the communication between the developer(s) who maintain/write the stored procedures and the developer(s) who write the DAL/BL code has to be intense: a new stored procedure will be saved fine, however it will not be called correctly until the DAL code is altered. When you have Dynamic SQL in your BL at your hands, it's not a problem. You change the code there, create a different filter, whatever you like and whatever fits the functionality to implement.

I don't understand this one at all. If the stored procedures are an API and extending them breaks things, why aren't the tables an API and extending them breaks things? Of course, Marston talks about his own approach here which does dynamic discovery. But in that case..... LedgerSMB does dynamic discovery on stored procedures. So it isn't clear to me that this is something that is specific to one side or the other.

On to the next one:

Business logic in stored procedures is more work to test than the corresponding logic in the application. Referential integrity will often force you to setup a lot of other data just to be able to insert the data you need for a test (unless you're working in a legacy database without any foreign key constraints). Stored procedures are inherently procedural in nature, and hence harder to create isolated tests and prone to code duplication. Another consideration, and this matters a great deal in a sizable application, is that any automated test that hits the database is slower than a test that runs inside of the application. Slow tests lead to longer feedback cycles.

I haven't found that to be true at all. In fact coming up with good unit tests for database-driven applications usually means some sort of sample data set anyway. A very clear win additionally is the ability to run tests on production systems in transactions that are guaranteed to roll back. Testing this on a live system via application logic is a lot more dangerous, and more likely to insert test data into the database by mistake.

If all the business logic is held in the database instead of the application then the database becomes the bottleneck. Once the load starts increasing the performance starts dropping. With business logic in the application it is easy to scale up simply by adding another processor or two, but that option is not readily available if all that logic is held in the database.

This is not necessarily the case. If you put everything in stored procedures that the database would already be doing, there is no reason to think the db becomes more of a bottleneck than it would have been before.

This is a big issue if you want an application where the customer can insert their own business logic, or where different logic is required by different customers. Achieving this with application code is a piece of cake, but with database logic it is a can of worms.

He speaks authoritatively here of something he professes at the beginning of the article not to really know well. The fact here is that adding custom logic is *different* in a stored-procedure-centered application than it is in a business application. There are ways of doing this right and managing the problem, just as there are ways of doing it on the application side incorrectly and making a mess.

The next point made me laugh:

A big problem with database triggers is that the application does not know that they exist, therefore does not know whether they have run or not. This became a serious issue in one application (not written by me) which I was maintaining. A new DBA who was not aware of the existence of all these triggers did something which deactivated every trigger on the main database. The triggers were still there, they had not been deleted, but they had been turned off so did not fire and do what they were supposed to do. This mistake took several hours to spot and several days to fix.

What competent DBA disables triggers without figuring out what they do first? However, for all of that he does have a point. Triggers can be used incorrectly to ensure that an application operation has a desired side effect. This is not a very maintainable way to go. It is better generally to use triggers to hook into a host application as a means of either enforcing RI, or allowing a third party application to force side effects that it needs. The main application should not depend on the use of triggers to run effectively. Only other applications should depend on triggers they provide for information as to what is going on in the host application, or as additional safety measures aimed at protecting data integrity.

As for his points about version control, all I can say is "test cases, test cases, test cases, all in transactions that roll back." You might not know which version your stored procedures are, but you WILL know that they are behaving the way you want them to.

Finally to the point of vendor lock-in.

Here Marston has a clear point. If you build a stored-procedure-centric application it will be difficult (though not necessarily impossible depending on various factors) to migrate to another database. Of course, migrating applications between databases is not necessarily a walk in the park either (Oracle empty string handling comes to mind).

In general, if you are writing on an Oracle db, it is feasible to migrate stored procedures perhaps to Fyracle and PostgreSQL, but less feasible to migrate to MS SQL. If you are writing Java stored procedures in Oracle, then you can migrate these to PostgreSQL without too much difficulty. If you are writing PL/Perl stored procedures in PostgreSQL, you will probably never be able to migrate them anywhere. On the other hand if you are writing an application to ship to customers and it must run on Oracle, DB2, MS SQL, and PostgreSQL, then stored procedures are out. In general, this comes back to "know your technology and choose it appropriately."

In conclusion, there are certainly cases where you don't want to use stored procedures, for example, because you don't want to be tied to a single RDBMS. However this is no reason to have a blanket avoidance strategy any more than it is a good idea to put everything possible inside the db.

Marston concludes by saying that stored procedures are optional and therefore his view is not clearly wrong, but calling something optional and calling something evil are very different. I think Marston is clearly wrong with the "evil" categorization, but generally right with the optional one. For his specific use case, stored procedures may not work very well, and hence might be avoided. However, generalizing that out is dangerous.

Thursday, January 12, 2012

Thoughts on what to put in the database

It seems if you ask three database developers what business logic should be put in the database, you will get (at least!) three answers. Having read Andrew Dunstan's blog post on the subject, as well as Tony Marston's advocacy pieces of putting no business logic in the database, I will give my viewpoint. Here it is:

All logic pertaining to storing, manipulating stored data, retrieving, and presenting data in a relational format belongs in the database provided it can be reduced to atomic calls, but all logic of providing the data to the user, accepting data from the user, and providing workflows belongs in upper layers.

Really that's it. If you want to send an email, don't do that in the database (there are a billion reasons why). If you want to generate HTML documents, the database would not be my first choice of where to put it. There are a couple other ways to look at this though. Here are a previous of my view and why I have changed my mind, slightly narrowing the field:

There is a difference between business logic inherent in the data and business logic regarding what you do with the data. Inherent logic belongs in the database. Use-based logic belongs in the application.

This doesn't quite work in practice as well as it works in theory because it is a bit fuzzy and arguably overinclusive. Things like converting date formats for example could be argued to be inherent logic, but flooding the database with round-trip calls for this would not be an efficient use of resources. Consequently this view only works when narrowly interpreted, and where data format instability is inherently seen as use-based logic. In other words, we end up getting back to something like my current position.

The second issue is that this view is slightly underinclusive. If I want to ensure that an email is always sent when certain criteria are met, that logic has to be tied to the database, and there is a correct way to do this in PostgreSQL (see below). The actual email would not be sent from the database, but information regarding sending it would be stored and made available on transaction commit.

Remember the goal here is to have an intelligent database which exists at the center of the application environment, providing data services in a flexible yet consistent way.

Back for LedgerSMB 1.2, I wrote a sample utility that would listen to updates and where a part would drop below its reorder point, would send out an email to a specified individual. This is the correct approach for a couple of reasons, but the utility still had significant room for improvement. There are two specific problems with the approach taken here:

The trigger was supplied as part of the main database setup scripts which meant that since the trigger logic was incomplete (see below), the utility would necessarily send out duplicate information on each email. Such utilities should actually be self-contained and supply their own triggers.
The trigger was inadequate, simply raising a NOTIFY when a part became short. The trigger should have taken the information from the new row and inserted it into a holding table which the email script could then clear.

So if we look at the way something like this should function transactionally, we get something like this:

Invoice issued, fewer parts onhand than in reorder point. Take summary of information regarding warning, store it in a queue table for email.
Raise a notification.
Invoice transaction commits, making the notification visible to the listening application.
Listening application receives notification, checks queue table,
Prepares and sends email, deleting records from queue table as entered.
Checks again for new records, Commits transaction, and checks for new notifications.

This means you have two asynchronous loops going on, coordinated by transactional controls on the part of the database. The invoice could also be queued, and another loop could be used to generate a printed document to be printed automatically and sent out to the customer. All manner of other things automatically done in such a way that other scripts which import invoices could do so without interrupting any of these loops. Moreover these can be added without disturbing the original application, simplifying testing and QA.

Monday, January 2, 2012

A Reply to Tony Marston's advocacy of unintelligent databases

Tony Marston has published an interesting critique of my posting about why intelligent database are helpful. The response is thought-provoking and I suggest my readers read it. Nonetheless I believe that the use cases where he is correct are becoming more narrow over time, and so I will explain my thoughts here.

A couple preliminary points must be made though. There is a case to be made for intelligent databases, for NoSQL, and for all sorts of approaches. These are not as close to a magic bullet as some proponents would like to think and there will be cases where each approach wins out, because design is the art of endlessly making tradeoffs. My approach has nothing to do with what is proper and has to do instead with preserving the ability to use the data in a other ways for other applications, as I see this as the core strength of relational database management systems. Moreover I used to agree with Tony but I have since changed my mind primarily because I have begun working with data environments where the stack assumption for application design doesn't work well. ERP is a great example and I will go into why below.

Of course in some cases data doesn't need to be reused, and an RDBMS may still be useful. This is usually the case where ad-hoc reporting is a larger requirement than scalability and rapid development. In other cases where data reuse is not an issue an RDBMS really brings nothing to the table and NoSQL solutions of various sorts may be better.

My own belief is that a large number of database systems out there operate according to the model of one database to many applications. I also think that the more that this model is understood the more that even single-application-database designers can design with this in mind or at least ask if this is the direction they want to go.

And so with the above in mind, on to the responses to specific points:

Well, isn't that what it is supposed to be? The database has always been a dumb data store, with all the business logic held separately within the application. This is an old idea which is now reinforced by the single responsibility principle. It is the application code which is responsible for the application logic while the database is responsible for the storing and retrieval of data. Modern databases also have the ability to enforce data integrity, manage concurrency control, backup, recovery and replication while also controlling data access and maintaining database security. Just because it is possible for it to do other things as well is no reason to say that it should do other things.

The above opinion works quite well in a narrow use case, namely where one application and only one application uses the database. In this case, the database tables can be modelled more or less directly after the application's object model and then the only thing the RDBMS brings to the table is some level of ad hoc reporting, at the cost of added development time and complexity compared to some NoSQL solutions. (NoSQL solutions are inherently single application databases and therefore are not usable where the considerations below exist.)

The problem is that when you start moving from single applications into enterprise systems, the calculation becomes very different. It is not uncommon to have several applications sharing a single database, and the databases often must be designed to make it quick and easy to add new applications on top of the same database.

However as soon as you make this step something awful happens when you try to use the same approach used for single application databases: because the database is based on the first application's object model, all subsequent applications must have intimate knowledge of the first application's object model, which is something of an anti-pattern.

A second major problem also appears at the same time, and that is that while a single application can probably be trusted to enforce meaningful data constraints, applications can't and shouldn't trust eachother to input meaningful data. The chance of oversights where many applications are each responsible for checking all inputs for sanity and ensuring that only meaningful data is stored is very high, and the consequences can be quite severe. Therefore things like check constraints become suddenly irreplaceable when one makes the jump from one data entry application to two against the same database.

Just because a database has more features does not mean that you should bend over backwards to use them. The English language contains a huge number of words that I have never used, but does that make my communication with others unintelligible? I have used many programming languages, but I have rarely used every function that has ever been listed in the manual, but does that mean that my programs don't work? With a relational database it is possible to use views, stored procedures, triggers or other advanced functionality, but their use is not obligatory. If I find it easier to do something in my application code, then that is where I will do it.

Indeed. However, if you aren't thinking in terms of what the RDBMS can do, but rather just in terms of it as a dumb storage layer, you will miss out on these features when they are of benefit. Therefore it is important to be aware of the costs of this when asking "should my database be usable by more than one application?"

That question becomes surprisingly useful to ask when asking questions like "should data be able to be fed into the database by third party tools?"

Where did this idea come from? The application has to access the database at some point in time anyway, so why do people like you keep insisting that it should be done indirectly through as many intermediate layers as possible? I have been writing database applications for several decades, and the idea that when you want to read from or write to the database you should go indirectly through an intermediate component instead of directly with an SQL query just strikes me as sheer lunacy. This is a continuation of the old idea that programmers must be shielded completely from the workings of the database, mainly because SQL is not objected oriented and therefore too complicated for their closed minds. If modern programmers who write for the web are supposed to know HTML, CSS and Javascript as well as their server-side language, then why is it unreasonable for a programmer who uses a database to know the standard language for getting data in and out of a database?

The idea comes from the necessities of ensuring that the database is useful from more than one application. Once that happens, then the issue is one of details of too much intimate knowledge of data structures between components.

However one thing I am not saying is that SQL-type interfaces are categorically out. It would be a perfectly reasonable approach to encapsulation to build updatable views matching the object models of each application, and indeed reporting frameworks perhaps should be built this way to the extent possible, at least where multiple applications share a database. The problem this solves is accommodating changes to the schema required by one application but not required by a second application. One could even use an ORM at that point.

If your programmers have to spend large amounts of time in writing boilerplate code for basic read/write/update/delete operations for your database access then you are clearly not using the right tools. A proper RAD (Rapid Application Development) toolkit should take care of the basics for you, and should even ease the pain of dealing with changes to the database structure. Using an advanced toolkit it should possible to create working read/write/update/delete transactions for a database table without having to write a single line of SQL. I have done this with my RADICORE toolkit, so why can't you? Simple JOINs can be handled automatically as well, but more complex queries need to be specified within the table class itself. I have implemented different classes for different database servers, so customers of my software can choose between MySQL, PostgreSQL, Oracle and SQL Server.

How does this work though when you have multiple applications piping heavy read/write workloads through the same database, but where each application has unique data structure requirements and hence its own object model?

The very idea that your software data structure should be used to generate your database schema clearly shows that you consider your software structure, as produced from your implementation of Object Oriented Design (OOD) to be far superior to that of the database.....

It seems I was misunderstood. I was saying that the generation of database structures from Rails is a bad thing. I am in complete agreement with Tony on that part I think.

You may think that it slows down development, but I do not. You still have to analyse an application's data requirements before you can go through the design process, but instead of going through Object Oriented design as well as database design I prefer to ignore OOD completely and spend my time in getting the database right. Once this has been done I can use my RAD toolkit to generate all my software classes, so effectively I have killed two birds with one stone. It also means that I don't have to waste my time with mock objects while I'm waiting for the database to be built as it is already there, so I can access the real thing instead of an approximation. As for using an ORM, I never have the problem that they were designed to solve, so I have no need of their solution.

This is an interesting perspective. I was comparing it though to initial stages of design and development in agile methodologies.

Here's the basic thing. Agile developers like to start out when virtually nothing is known about requirements with a prototype designed to flesh out the requirements and refine that prototype until you get something that arguably works.

I am saying that in the beginning at least, asking the question of what the data relating to topics being entered is and how to model it neutral of business rules has a cost in terms of developer time spent in the beginning. Sometimes, if a requirement to collect, say, addresses is dropped, that time will be entirely lost, so it is not really possible to say that such engineering effort is always a clear win.

I would agree though that the time spent there on looking at the data, modelling issues, and normalization is usually recouped later, so I suppose we are in agreement.

This is another area where OO people, in my humble opinion, make a fundamental mistake. If you build your classes around the database structure and have one class per table, and you understand how a database works, you should realise that there are only four basic operations that can be performed on a database table - Create, Read, Update and Delete (which is where the CRUD acronym comes from). In my framework this means that every class has an insertRecord(), getData(), updateRecord() and deleteRecord() method by default, which in turn means that I do not have to waste my time inventing unique method names which are tied to a particular class. Where others have a createCustomer(), createProduct(), createOrder() and createWhatever() method I have the ubiquitous insertRecord() method which can be applied to any object within my application. This makes use of the OO concept of polymorphism, so it should be familiar to every OO programmer. Because of this I have a single page controller for each of the methods, or combination of methods, which can be performed on an object, and I can reuse the same page controller on any object within my application. I have yet to see the same level of reuse in other application frameworks, but isn't a high level of code reuse what OOP is supposed to be about?
I am often told "but that is not the way it is done!" which is another way of saying "it is not the way I was taught". This tells me that either the critic's education was deficient, or he has a closed mind which is not open to other techniques, other methods or other possibilities.

The problem occurs when CRUD operations are not really as simple as this sounds, and where they must have complex constraints enforced from multiple data entry applications. A good example here is posting GL transactions, where each transaction must be balanced. This is far simpler to do in a stored procedure than anything else, because the set has certain emergent constraints that apply beyond the scope of a single record. Also if abstracting the application interface from the low-level storage, then this may be necessary as part of the mapping of views to relations.

I can even make changes to a table's structure, such as adding or deleting a column, or changing a column's size or type, without having to perform major surgery on my table class. I simply change the database, import the new structure into my data dictionary, and then export the changes to my application. I don't have to change any code unless the structural changes affect any business rules. My ERP application started off with 20 database tables but this has grown to over 200 over the years, so I am speaking from direct experience.

So if you are doing an ERP application and at the same time putting all business logic in your application, you do not expect any third party applications to hit your database, right? So really you get the same basic considerations I am arguing for by making the database into the abstraction layer rather than encapsulating the db inside an abstraction layer.

If we think about it this way, it depends on what the platform is that you are developing on. We like to have the database be that platform, and that means treating interfaces to it as an API. Evidently you prefer to force all access to go through your application. I think we get a number of benefits from this approach including language neutrality for third party components, better attention to performance on complex write operations, and more. That there is a cost cannot be argued with however.

I would conclude by saying that we agree on a fair bit of principles of database design. We agree normalization is important, and I would add that this leads to application-neutral data storage. Where we disagree is where the API/third party tool boundary should be. I would prefer to put in the database whatever seems to belong to the general task of data integrity, storage, retrieval, and structural presentation while leaving what we do with this data to the multiple third party applications which utilize the same database.