Monday, April 7, 2008

Database virtualization

Virtualization is everywhere and everybody has already used OS virtualization through well-known commercial or open source products.

Obviously, it is not limited to OS and virtualization is now quickly expanding in various other software solutions.

One could see ORM or some EII solutions like data access virtualization or simpler Data Virtualization. To some extent, Data Services are all about Data Virtualization and we'll discuss this in details in this blog in the future.

But if we go to the deeper layer of Data Virtualization one might think we'll first have to deal with Database Virtualization. What's that? The ability of using several database engines spread over a grid. Storing data into a set of database servers raises some concerns when compared to using a single database as most current relational database technologies have been primarily designed as a single engine.

NB: some different database technologies have been designed to be natively distributed (like the Versant ODBMS for instance), but it is not yet the common case.

Most of traditional database features become probematic when thinking distributed:

  • Query engine and merge of separate result sets

  • Algorithms to determine in which database(s) storing a new record

  • Transaction management

  • Management and synchronization of a distributed database schema

  • Indices management

  • Referential integrity

  • Security

  • Crash recovery

  • Replication, backups

Even administration and tuning become much more complicated.

So, what could be the potential benefits of a distributed database?

  • Lower costs (10 machines hosting 100Gb are cheaper than one single machine hosting 100Tb). But is that true?

  • Better reliability and robustness. Could depend on how the schema is spread over the grid...

  • Any other good idea?

When having to split a schema into a set of database servers there are basically two different partitioning strategies:

  • Vertical partitioning: Some tables (T11, T12...) are in DB1, other tables are in other DBs

  • Horizontal partitioning: the same table is stored in all the databases, records are dispatched into the different databses based on various algorithms (business values, round robin...).

The database virtualization could be managed at different levels:

  • Database engine level. I'm sure RDBMS leaders are working on such projects, I'd like to find information about these ones.

  • Connectivity layer. This is an interesting place to manage distribution. See for instance the C-JDBC project initiated at ObjectWeb, and now continued under the Sequoia name at Continuent.

  • Data Service Platform. This one is also a good candidate, potentially in conjunction with the previous one.

  • Application level. I don't really believe into that one, as it must be a nightmare for business programmers to deal with all the complexity of distributed database programming. However, I'm sure some projects have been implemented like that.

If you are interested in Data Virtualization, DataDirect will run a series of Architect Tutorials on the subject in April.

No comments:

Post a Comment