Data Services by Erix.: April 2008

Monday, April 28, 2008

Looking forward to JPA2

A new article by Mike Keith (Oracle / Toplink). He introduces some new cool features of JPA2.

Looking forward to JPA2

A new article by Mike Keith (Oracle / Toplink). He introduces some new cool features of JPA2.

http://java.dzone.com/announcements/openxava-301-jpa-application-e

I like the notion of fully-automated UI tool, not based on source-code generation, like NakedObjects, JMatter, etc.

XIC has its own solution for that, Navilis, which is a component bundled within Xcalia Studio. The main goal is to quickly validate your business model and the corresponding data access layer (mappings, transformations, etc.). It is a little bit like an object inspector (Smalltalk, Groovy) but coupled with some specific tools and widgets for data access (like a query builder, for instance). This kind of component is critical when following the MDA approach.

OpenXava

OpenXava

Distributed document databases

There is now a kind of global acknowledgement that RDBMS are not well-suited for all new business applications in SOA. I've already mentioned new databases like SimpleDB or Vertica the new project from Michael Stonebraker.

Here is a collection on links about DDD (distributed document databases):

StrokeDB (for Ruby)

http://www.infoq.com/news/2008/04/distributed-db-strokedb

http://strokedb.com/

http://strokedb.com/euruko2008.pdf (nice and innovating kind of presentation)

http://rashkovskii.com/articles/2008/2/14/strokedb-short-intro

CouchDB (written in Erlang)

http://www.infoq.com/news/2007/11/the-rdbms-is-not-enough

http://incubator.apache.org/couchdb/

http://incubator.apache.org/couchdb/docs/intro.html

http://incubator.apache.org/couchdb/docs/overview.html

RDDB (sub-project of CouchDB)

http://rddb.rubyforge.org/

Distributed document databases

Amazon adds persistent Storage to EC2

http://www.infoq.com/news/2008/04/amazon-storage

The competition between Amazon, Google and Microsoft around cloud computing is tough.

It is just a raw storage space (e.g. file system) added to EC2, you can use it as you prefer, for instance you can use it to allocate DBMS files. Obviously these storages are persistent, they remain usable after the termination of one EC2 instance.

The link with S3 and SimpleDB is not so clear, but for backup usage.

http://aws.typepad.com/aws/2008/04/block-to-the-fu.html

Amazon adds persistent Storage to EC2

JPOX becomes DataNucleus

http://www.theserverside.com/news/thread.tss?thread_id=49181

http://www.datanucleus.org/index.html

They will now add support for new kind of data sources like LDPA, XML and Excel files on top of already supported RDBMS and db4o.

But what is much more important is they are moving towards a Platform.

They are clearly going in the right direction, like Toplink or XIC. It is very exciting to see the market is going where Xcalia (now DataDirect) has investigated since its inception. Having a data access platform is important because:

it can offer services to various environments (Java, .Net, Unix, etc.)

it can support vairous programming styles (traditional programming like Java, C#, BPM like BPEL, workflows, etc.)

mappings, transformations and metadata can be shared between applications

several standards (APIs, QLs, etc.) can be supported simultaneously

administration and tuning is simplified

the platform can progressively offer services which where historically only available in databases, like security, fault tolerance, replication, etc.

it is the best vehicle for data virtualization

NB: their new tagline is "Information at your Service", which is not too far from our Xcalia's "Service your Data". Seems to me that the "service" part is actually still missing in their story, making a huge difference with the commercial solution from Xcalia.

JPOX becomes DataNucleus

it can offer services to various environments (Java, .Net, Unix, etc.)

it can support vairous programming styles (traditional programming like Java, C#, BPM like BPEL, workflows, etc.)

mappings, transformations and metadata can be shared between applications

several standards (APIs, QLs, etc.) can be supported simultaneously

administration and tuning is simplified

the platform can progressively offer services which where historically only available in databases, like security, fault tolerance, replication, etc.

it is the best vehicle for data virtualization

Thursday, April 17, 2008

Entity Framework Tutorial

Good introduction to the Entity Framework, by Julia Lerman (theDataFarm).

Another article, EDM: beyond the Basics seen on DataDeveloper.Net, by Matthieu Mezil.

And this introduction to EDM, by Michael Pizzo.

What strikes me with EF and EDM is that all examples are using the bottom-up approach (from the database schema to a business model). I suppose one can use other well-known approaches like Meet-In-the-Middle, but this probably requires manual changes in the EDMX files.

The generated classes are inheriting from technical classes in the framework (System.Data.Objects.DataClasses.EntityObject) which, IMHO, is not really clean. What is cool is the binding between the Entity and the Forms.

It will be interesting to see how Jasper (this project sounds promising) will hide the current "complexity" of the Entity Framework (see this blog entry from Julia about EF being complex). In this blog entry Julia writes:

I think one of the critical things I shared with them during the day was something that is also common to any LINQ queries, which is that you can very easily and unknowingly make trips to the database when you think you are just looking at only the cached objects.

EF developers can tune object graph loading (fetch plans, eager fetching) using specific APIs in their code (like Include()). Even if it what most ORM developers also do in Java there are better ways of doing this through dynamically configurable fetch plans, like in XIC.

Entity Framework Tutorial

I think one of the critical things I shared with them during the day was something that is also common to any LINQ queries, which is that you can very easily and unknowingly make trips to the database when you think you are just looking at only the cached objects.

Wednesday, April 16, 2008

DataDeveloper.Net

I found this portal useful to .Net programmers dealing with Data Access.

Articles, tutorials, resources and more.

DataDeveloper.Net

I found this portal useful to .Net programmers dealing with Data Access.

Articles, tutorials, resources and more.

They call this a benchmark

Seen on the ADO.Net blog, this article about performance comparison of the Entity Framework.

It complements this one and this other one. Even if we don't learn a lot about performance comparison, it brings some interesting information about the Entity Framework.

They call this a benchmark

12 things about REST and WOA

This blog entry from Dion Hichcliffe introduces the following diagram:

(BTW: this guy produces tons of excellent drawings)

Interesting to see that in the Service-oriented Era the relative position of the Data block is not clear.

In the Web-oriented Era, the Data block explodes into a constellation of inter-connected "resources", which we can see as Data Services, can't we? So Data Services Platforms (like Xcalia DAS) are at the heart of that constellation.

High-level services are combined (orchestrated together) to build processes. These coarse-grained services are themselves consuming Data Services. There is an exponential number of possible combinations of Data Services. It is almost impossible to discover which of these combinations are relevant and to maintain them in time as business is changing. That's why it is so important to have technologies to automate these combinations (see papers from IBM's Ali Arsanjani for instance): the GOOD method and Manners.

If the technology is able to combine Data Services dynamically, at runtime, it also becomes possible to select the best combination at any time based on cost metadata. That's like the Semantic Web but applied to Data Access.

That's exactly what the dynamic composition of Xcalia intermediation (XIC) is doing. See also that presentationto better understand the technology and metadata behind the dynamic composition of data services.

12 things about REST and WOA

Tuesday, April 15, 2008

Ebean by Avaje

Reading TheServerSide I've discovered yet another ORM, Ebean by Avaje.

The main positioning point is that they avoid using management context (like JPA EntityManager or JDO PersistenceManager). The product take inspiration from JPA (the mapping annotations, for instance), even if the author details a lot of problems with JPA.

The persistence layer has been specifically designed for relational databases and could not be easily ported to other data sources as it seems to me there is no abstraction layer. The product relies on byte-code enhancement (ASM), the author cruelly exposes himself to Hibernate bigots furor :-)

The current version includes support for the AutoFetch feature already described few weeks ago in this blog.

It seems the product is not completed yet, as some mapping options are missing (inheritance strategies for instance), there is no DDL generation and there is no QL. Most of these features are missing on purpose, or by design according to the author.

Their web site is well documented, clean, simple and is interesting to read per se.

To me the product is simple and developed by an engineer with its own vision. The only supported approach is bottom-up (from the database schema to the beans). No support for transparent ID, even if there is a nice text about what could be done with Oracle ROWID, to simulate an ODBMS navigational model on top of a RDBMS. I think only optimistic locking is supported. I still don't understand, looking at the examples, how transactions are managed. It seems to me there is an implicit transaction around a call to the save() method, but I've probably missed something, as this model would clearly be too limited.

It is not targetting enterprise applications, as too many features are missing. I rather see it as a clean "Lab" for future persistence.

Ebean by Avaje

The software design tree

Not directly related to Data Access but I knew this funny picture since a long time, and I've found it again yesterday, on Gigaspaces blogs. It truly represents how software is built. And also why developing a successful project is still an Art.

The software design tree

Wednesday, April 9, 2008

Google's App Engine

Google has just released, App Engine, yet another framework / environment for development and deployment (hosting) of Web applications. It seems it is targeting Amazon's EC2+SimpleDB offer. See reactions here on TSS and here on InfoQ.

This REST-based web app framework is built upon Python and claims to be designed for easy (transparent) scalability. Like Amazon's solution, the system comes with an integrated non-relational database with APIs (BigTable) and YAQL (GQL) to access it.

All this being highly proprietary.

App engine also includes the django framework with its ORM layer, which might not be helpful in that case... because the database is not relational.

Developers can freely try the technology preview (at least the first 10,000 beta-testers).

Now let's have a look at the data access layer (inspired from django according to documentation): Google Doc on Data Access. To be honest, I'm far from being impressed. I hope the automatic scalability feature is here, otherwise I don't really see the point. Is it supposed to be simpler and readable because based on Python?

Google's App Engine

Tuesday, April 8, 2008

When to use an embedded ODBMS?

I have recently seen this article on TheServerSide, asking the question about "When to use an embedded ODBMS?". See also the related thread on their forums.

I have the feeling that considering not to use an RDBMS now tends to be more and more acceptable to the developer community. As SOA is growing many tenants of RDBMS now agree that sometimes RDBMS are not the best technology to use (we're not saying that RDBMS should never be used). As the author, Rich Grehan, wrote ODBMS are not the only alternative, one could consider XML files, C-ISAM files, lightweight databases like BerkeleyDB or Sleepycat.

The case of embedded applications is maybe a good one for non-relational datastores because:

There is no need for ad-hoc queries outside the application

There is no need for BI around the system

The cost of a full RDBMS engine would be too much in terms of disk space, CPU usage and memory footprint

The author then list potential benefits of an ODBMS. I have to admit most of his arguments are not really valid. Basically, he is saying that schema management and evolution is much easier with an ODBMS than with an RDBMS plus ORM.

That is just a question on how the ORM layer has been designed and implemented and it is quite possible to imagine an advanced ORM solution being able to perform automatic schema evolution (and probably there are some of them doing it). The fact is that on real cases, automatic schema evolution is not often used (even when an ODBMS has been chosen) because schema evolution is in general included in a more global "version upgrade process" of the whole system (involving backups, data checking, data conversions, data re-initialization...). I agree, the situation is a little bit different with embedded systems and these ones might get some benefits from automatic schema evolution (being based on ODBMS or on RDBMS plus ORM). What I want to say is that automatic schema evolution is not a feature linked to ODBMS per se, and it could be available on any kind of datastores. It is just a cool feature, that can be used in very specific situations.

The author then have a second argument: even if ORM hides the complexity of managing an object model into an RDBMS it does not remove the need for some code to be executed in order to manage the impedance mismatch. The author is making here the wrong assumption (that most people are doing) that ODBMS engines are internally natively managing a full object-oriented model. In reality ODBMS engines are simulating inheritance and collections exactly as one ORM layer would do and most of this management is done in the client APIs, giving the taste of transparency.

An ODBMS is just a storage with APIs and QL around it being able to digest object models. That storage has to be as simple, efficient and robust as possible internally. Basically you just need: efficient page management, space allocation algorithm, object IDs, storage of any tuple, indices, etc. Then on top of that simple storage you will build all the necessary features of a typical database: APIs, QL, security, transaction management, crash recovery, logging, backups, replication, network protocol...

That's exactly what we did at Xcalia with the Jalisto project contributed to ObjectWeb. This is a basic storage on top of which you can enable/disable database features. The storage itself is quite configurable but we didn't provide any API around it so it's up to you to wrap Jalisto into an ORM layer of your choice. As many features can be disabled (including the network layer) it could be a good solution for embedded systems. The system is comparable to db4o or BerkeleyDB but the fact is does not impose its own set of APIs and QL, you have to wrap it with your favorite data access layer.

All in all, it is positive to see people starting to push ODBMS, even if I have the feeling this article is mostly a kind of masked db4o advertising (which is a good database and sane technology anyway). This potential rebirth of ODBMS will make the Xcalia universal mapping technology even more competitve against pure ORM solutions limited to RDBMS. It is nice to see that Toplink and JPOX are now also working on extending their mapping technology to non-relational datastores.

My bet, is that the future is in the data access, not in the database. We'll see several database technolgies co-existing and addressing different problems, with standardized Data Services in the middle to efficiently serve new business applications, reporting tools, BI, etc.

When to use an embedded ODBMS?

There is no need for ad-hoc queries outside the application

There is no need for BI around the system

The cost of a full RDBMS engine would be too much in terms of disk space, CPU usage and memory footprint

Monday, April 7, 2008

Database virtualization

Virtualization is everywhere and everybody has already used OS virtualization through well-known commercial or open source products.

Obviously, it is not limited to OS and virtualization is now quickly expanding in various other software solutions.

One could see ORM or some EII solutions like data access virtualization or simpler Data Virtualization. To some extent, Data Services are all about Data Virtualization and we'll discuss this in details in this blog in the future.

But if we go to the deeper layer of Data Virtualization one might think we'll first have to deal with Database Virtualization. What's that? The ability of using several database engines spread over a grid. Storing data into a set of database servers raises some concerns when compared to using a single database as most current relational database technologies have been primarily designed as a single engine.

NB: some different database technologies have been designed to be natively distributed (like the Versant ODBMS for instance), but it is not yet the common case.

Most of traditional database features become probematic when thinking distributed:

Query engine and merge of separate result sets

Algorithms to determine in which database(s) storing a new record

Transaction management

Management and synchronization of a distributed database schema

Indices management

Referential integrity

Security

Crash recovery

Replication, backups

Even administration and tuning become much more complicated.

So, what could be the potential benefits of a distributed database?

Lower costs (10 machines hosting 100Gb are cheaper than one single machine hosting 100Tb). But is that true?

Better reliability and robustness. Could depend on how the schema is spread over the grid...

Any other good idea?

When having to split a schema into a set of database servers there are basically two different partitioning strategies:

Vertical partitioning: Some tables (T11, T12...) are in DB1, other tables are in other DBs

Horizontal partitioning: the same table is stored in all the databases, records are dispatched into the different databses based on various algorithms (business values, round robin...).

The database virtualization could be managed at different levels:

Database engine level. I'm sure RDBMS leaders are working on such projects, I'd like to find information about these ones.

Connectivity layer. This is an interesting place to manage distribution. See for instance the C-JDBC project initiated at ObjectWeb, and now continued under the Sequoia name at Continuent.

Data Service Platform. This one is also a good candidate, potentially in conjunction with the previous one.

Application level. I don't really believe into that one, as it must be a nightmare for business programmers to deal with all the complexity of distributed database programming. However, I'm sure some projects have been implemented like that.

If you are interested in Data Virtualization, DataDirect will run a series of Architect Tutorials on the subject in April.

Database virtualization

Query engine and merge of separate result sets

Algorithms to determine in which database(s) storing a new record

Transaction management

Management and synchronization of a distributed database schema

Indices management

Referential integrity

Security

Crash recovery

Replication, backups

Even administration and tuning become much more complicated.

So, what could be the potential benefits of a distributed database?

Lower costs (10 machines hosting 100Gb are cheaper than one single machine hosting 100Tb). But is that true?

Better reliability and robustness. Could depend on how the schema is spread over the grid...

Any other good idea?

When having to split a schema into a set of database servers there are basically two different partitioning strategies:

Vertical partitioning: Some tables (T11, T12...) are in DB1, other tables are in other DBs

Horizontal partitioning: the same table is stored in all the databases, records are dispatched into the different databses based on various algorithms (business values, round robin...).

The database virtualization could be managed at different levels:

Database engine level. I'm sure RDBMS leaders are working on such projects, I'd like to find information about these ones.

Connectivity layer. This is an interesting place to manage distribution. See for instance the C-JDBC project initiated at ObjectWeb, and now continued under the Sequoia name at Continuent.

Data Service Platform. This one is also a good candidate, potentially in conjunction with the previous one.

Application level. I don't really believe into that one, as it must be a nightmare for business programmers to deal with all the complexity of distributed database programming. However, I'm sure some projects have been implemented like that.

If you are interested in Data Virtualization, DataDirect will run a series of Architect Tutorials on the subject in April.

Data Services by Erix.

Monday, April 28, 2008

Looking forward to JPA2

Looking forward to JPA2

OpenXava

OpenXava

Distributed document databases

Distributed document databases

Amazon adds persistent Storage to EC2

Amazon adds persistent Storage to EC2

JPOX becomes DataNucleus

JPOX becomes DataNucleus

Thursday, April 17, 2008

Entity Framework Tutorial

Entity Framework Tutorial

Wednesday, April 16, 2008

DataDeveloper.Net

DataDeveloper.Net

They call this a benchmark

They call this a benchmark

12 things about REST and WOA

12 things about REST and WOA

Tuesday, April 15, 2008

Ebean by Avaje

Ebean by Avaje

The software design tree

The software design tree

Wednesday, April 9, 2008

Google's App Engine

Google's App Engine

Tuesday, April 8, 2008

When to use an embedded ODBMS?

When to use an embedded ODBMS?

Monday, April 7, 2008

Database virtualization

Database virtualization

Blog Archive

About Erix