How to Index a Fact Table – A Best Practice
Author: Christian Screen | 6 min read | March 16, 2010
At the base of any good BI project is a solid data warehouse or data mart. We can talk star schemas all day long and about the variation of snowflake versus but the main goal of this post is acknowledging the neglecting of the fact table as the foundation on which the star schema’s house resides. I see way too often clients with poorly performing retrievals against their subject areas as I enter new engagements and work from the client’s existing infrastructure. This is really unacceptable not only from a professional implementation perspective but also from one of query performance. This post will look at what the default indexing of any fact table should be, why the default indexing is needed, why composite keys are good and bad, and why you should care.
I owe this post to my buddies Owen and Greg. This one has been a long time in the writing and I have finally gotten around to putting fingers to keys.
Default Indexing Best-Practice
Short and sweet – all foreign key columns should have a non-clustered non-unique index.
So what does that mean?
It means that a foreign key reference on a column to a table containing the primary key reference only provides the database engine with a point-of-reference to your dimension table. It does nothing for how the data in the fact table is organized. So, after you create a foreign key (FK) DDL statement be sure to create the non-clustered, non-unique INDEX as well.
Why default Fact Table indexing is Needed?
I briefly touched on this above. But as a little more detail you must remember that a Foreign key provides a reference and you are also able to apply constraints which is one of the key reasons for using a FK. A constraint basically tells the database that any value entered in the FK column of the fact table must currently reside in the FK table itself. This can of course be switched off by setting the proper attribute of the fact table’s FK DDL but it is best left set to active. Again, this does nothing for performance, but more on data integrity which is indeed very important.
However, when you then combine indexing on each FK column of the fact table you are beginning a link to the database engine’s sorting abilities and logical layer. This then turns a query from doing a full table scan to a proper Index Seek. So instead of scanning the entire fact table for n rows it now has a deliberate path to retrieve data.
Are Composite Indexes Good or Bad?
When I first started learning about table indexing I was in the e-commerce industry. Back then we knew what every query was going to be as it was hard-coded in the underlying Data Access Layer (DAL). For example, if we had a shopping cart web site that only had a hand full of web pages for displaying product, we knew that the WHERE clause for the Products by Category page was quite simple and always filtered on category_id and a few other meta data searchable columns. Therefore we would create a composite index on those columns that we knew where always part of the page search criteria.
That worked for a static query system but it is not a best practice for Business Intelligence (BI). One of the main buzz words around BI is Ad Hoc reporting which means that the end users will create a myriad of queries, slicing and dicing the data whichever way they sit fit to best conduct their analysis. Due to this a predefined composite index would only add overhead to your fact table. More logically however is the number of combinations that you would have to account for if attempting to use composite indexes on your fact table.
Please also note that their is of course a place for composite indexes in an OLTP (transaction) database system. I won’t go into great detail but inspecting these systems as it relates to your BI ETL processes can also boost ETL extraction/load performances.
Ultimately composite indexes on fact tables are bad for BI.
Why you Should Care?
First of all you should take pride in the fact that there are different applications of logic for different systems. BI is still in its infancy but we should always understand that what works in one situation might not work in another.
Second, if you are creating a new implementation for your company/client you will want to provide the best solution possible. One of the items most overlooked by solution architects (unless you were a DBA in your past life) is indexing, until after performance is not what they believe it should be. But there is also room for improvement for existing structures. Take a look at your exiting data marts – Did the architect substitute composite indexes for what should have been FK Column indexes. I see it all of the time, the table might have a composite index for product_id, territory_id, and date_id but when I look closer at the indexes there will not be an index just for the date_id column.
And, don’t forget about the type of BI engines that exist on the market today. If you are using a true OLAP engine like Oracle Essbase or SQL Server Analysis services then you will know that those OLAP systems can leverage inherent power with such thinks as block storage, aggregate storage, and caching which really aid a poorly developed underlying data mart. However, in a system like Oracle BI (OBIEE) or SAP BW the underlying relational structure is paramount since those “OLAP” system are really just re-issue the SQL to the relational database in an OLAP-like retrieval. All-in-all its best to prepare in advance for no matter which application will sit atop of your data mart design.
Conclusion
There can be many discussions on this and I was hoping to get a large scale example together in order to visually show some execution plans and stats but I may throw those in at a later date with all of my copious time : ). However, this is a straight forward exercise that you can do yourself if needed. The idea behind this post was just to share an on-the-ground headache that I continue to bump into that is easily resolvable with some upfront effort and understanding.
Just remember you need both data integrity (FKs) and query performance (Indexes) and you’ll be more than half-way to a nicely performing data mart.
If your opinion differs, I would definitely like to hear what you have to say, so let us know.