Thursday, February 7, 2008

Star schema

OLAP (On-Line Analytical Processing) tools are among the most common front end systems for data warehouses; they allow dynamical and multidimensional analysis to be performed against a huge amount of records in order to produce a small set of data which can be used as a dashboard for business process management by the so-called the "knowledge workers".
There are two widespread approaches to OLAP implementation: ROLAP (Relational OLAP) and MOLAP (Multidimensional OLAP). Ok, there is also an hybrid solution, called... you guessed, HOLAP (Hybrid OLAP).

Why should people use the relational model to implement a multidimensional model? There are many reasons, the most important being the diffusion of advanced RDBMSs and the expertise of IT people. Moreover, ROLAP systems don't have a "sparse data" problem, thus being far more scalable than MOLAP implementations. Unluckily, the relational - and bidimensional - model, in which we find attributes, relations and integrity constraints, has a reduced expressivity when it comes to describe the multidimensional model, in which we find facts, measures, attributes, dimensions and hierarchies. That's why we have to find a workaround, which leads us to the (notorious) star schema.

The star schema consists of one (or more) fact table(s), which represents facts, referencing many dimension tables, which represent the dimensions of analysis. Fact tables typically have a lot of columns, and newbies almost always smell the rat of a very bad use of the relational model where they should see a very good compromise instead.
One of the reasons behind the star schema is the very poor performance shown by RDBMS when they have to aggregate a huge amount of records belonging to many different tables, thus involving many expensive join operations: denormalization can then improve performance at the cost of the increased disk space required. Another way to improve performance is redundancy: you materialize derived tables (views) based on the most used aggregations to speed up typical analysis. In addition, ROLAP implementations often use surrogate keys, another feature that make newbies shrug.

The star schema can have some variations, as the snowflake schema, obtained decomposing one of more dimension tables eliminating transitive functional dependencies contained in the tables. Dimension tables which keys are imported in the fact table are called primary while the others are called... why, secondary, what else?

No comments: