So one of the clients I do work with is moving a large database from PostgreSQL to Hadoop. The reasons are sound -- volume and velocity are major issues for them, and PostgreSQL is not going away in their data center and in their industry there is a lot more Hadoop usage and tooling than there is PostgreSQL tooling for life science analytics (Hadoop is likely to replace both PostgreSQL and, hopefully, a massive amount of data on NFS). However this has provided an opportunity to think about big data problems and solutions and their implications. At the same time I have seen as many people moving from Hadoop to PostgreSQL as the other way around. No, LedgerSMB will never likely use Hadoop as a backend. It is definitely not the right solution to any of our problems.
Big data problems tend to fall into three categories, namely managing ever increasing volume of data, managing increasing velocity of data, and dealing with greater variety of data structure. It's worth noting that these are categories of problems, not specific problems themselves, and the problems within the categories are sufficiently varied that there is no solution for everyone. Moreover these solutions are hardly without their own significant costs. All too often I have seen programs like Hadoop pushed as a general solution without attention to these costs and the result is usually something that is overly complex and hard to maintain, may be slow, and doesn't work very well.
So the first point worth noting is that big data solutions are specialist solutions, while relational database solutions for OLTP and analytics are generalist solutions. Usually those who are smart start with the generalist solutions and move to the specialist solutions unless they know out of the box that the specialist solutions address a specific problem they know they have. No, Hadoop does not make a great general ETL platform.....
One of the key things to note is that Hadoop is built to solve all three problems simultaneously. This means that you effectively buy into a lot of other costs if you are trying to solve only one of the V problems with it.
The single largest cost comes from the solutions to the variety of data issues. PostgreSQL and other relational data solutions provide very good guarantees on the data because they enforce a lack of variety. You force a schema on write and if that is violated, you throw an error. Hadoop enforces a schema on read, and so you can store data and then try to read it, and get a lot of null answers back because the data didn't fit your expectations. Ouch. But that's very helpful when trying to make sense of a lot of non-structured data.
Now, solutions to check out first if you are faced with volume and velocity problems include Postgres-XL and similar shard/clustering solutions but these really require good data partitioning criteria. If your data set is highly interrelated, it may not be a good solution because cross-node joins are expensive. Also you wouldn't use these for smallish datasets either, certainly not if they are under a TB since the complexity cost of these solutions is not lightly undertaken either.
Premature optimization is the root of all evil and big data solutions have their place. However don't use them just because they are cool or new, or resume-building. They are specialist tools and overuse creates more problems than underuse.