I am still a novice when it comes to the technical underpinnings of databases and Hadoop. So, I thought it might be useful if I just asked if my thought on the future of databases is correct.
Basically, “Does Hadoop signal the end of the database as we know it?’
Here’s where this question comes from:
I work for Netezza, who make blazingly fast data warehouse appliances. At the heart of the appliance is a Postgres database. But due to the appliance architecture (and I think the fast speed) you don’t have to do the usual things you have to do to make databases work, such as tuning, indexing, and so forth (indeed, we have a long list of “no”s that set database folks crazy, as in “How can you not do that?”).
That got me thinking. Our appliance has changed the need for coddling databases. Indeed, weren’t databases created to make it easier for (what back then were) slow computers to handle large amounts of data, and all the coddling is to compensate for weak hardware? Would we need databases if it didn’t matter how the data was structured, as long as we had a fast search and processing of the data?
Segue to Hadoop
Lately, at work we’ve been taking about Hadoop, hearing folks actually NOT wanting to have a structured database. And, we see folks with large amounts of data with Hadoop, just throwing more processing power at the data when needed.
Following that thread, I started wondering if the evolution of tools like Hadoop might make structure databases obsolete*, that it really doesn’t matter how the data is structured, just so long as we can find it. And the processing issues are obviated by just throwing more processing nodes at it.**
So, teach me:
Where am I wrong in this thought thread? Will data always need to be structured somehow for computing purposes? How much of the structured data world can Hadoop gobble up (though the unstructured data world must be larger than the structured data world, right?)?
What do you think?
*Of course, just like folks are still using VAX, databases will really never disappear. When a technology is displaced, it usually doesn’t disappear, just gets relegated to a different niche.
**Do you still keep things in folders? I only do when I don’t have a good search tool. On my Mac, I use Spotlight to find and open anything, rather than searching through folders. Indeed, everything usually goes into one folder. Unless I need to separate something for follow up on the desktop (so, OK, folder doe not go away altogether). Nonetheless, search has replaced most of what I would use folders for.
Good post.
That unfortunately makes sense.
Processing power is a cheaper solution than a well structured model.
I suppose it depends how much data we’re talking about, and how each search is performed, but for a fast response there will always needs to be an index, and consequently a basic structure.
I too rely on spotlight, but I still keep things in folders since I often have to transfer specific sets of data.
As far as I’m aware, the human brain is erratically ‘structured’, and we seem to be doing alright.
No.
What Hadoop and other NoSQL solutions, such as Cassandra, allow you to exchange certain properties of SQL databases to other properties, which are turning out to be very useful for a lot of cases.
Interestinly, there are even more structured databases such as Neo4J which are a part of the NoSQL movement, but is designed to provide very efficient graph representations.
However, data modeling isn’t the same thing as data searching; how you store the data is different on how the data is indexed, which is what searching typically consists of. E.g. Cassandra is very efficient in storing very large amounts of data and writing and reading it really fast; however, yet the query language used to search the contents is very similar to SQL.
Hadoop and Cassandra and Riak and MongoDB *are* databases. They just enforce different kinds of limits, and relax other kinds of limits when compared to Postgres or MySql. Mostly they remove the need to fix your data structure in advance on a database level, which is really useful for programmers who need to constantly adjust their schemas as the application grows. It allows for more agile development.
But whether this has anything to do with searchability or indexing? Nah. It’s all just a series of tradeoffs.