Relational data warehouse fact table index | SQL Server Performance Forums

SQL Server Performance Forum – Threads Archive

Relational data warehouse fact table index

I have a fact table with 17 dimension foreign keys (all ints) and 1 degenerate dimension (bigint). 8 of the dimension foreign keys make the fact record unique (the other 9 are there for "the ride"). The degenerate dimension plus 1 of the dimension foreign keys also makes the fact record unique. Which is the best indexing strategy?: In all cases, there will be a non-clustered index for each dimension.
1)Primary clustered index on the date dimension plus a fact table big-int surrogate key (to make the record unique). Non-clustered index for each dimension (except the date dimension). 2)Primary clustered index on 8 of the dimension foreign keys make the fact record unique. 3)Primary clustered index on all 17 of the dimension foreign keys. 4)Primary clustered index on the degenerate dimension plus the 1 dimension foreign key that makes the fact record unique. 5)Primary non-clustered index on 8 of the dimension foreign keys make the fact record unique. 6)Primary non-clustered index on all 17 of the dimension foreign keys. Thanks,
Newbie
Assuming that most of your reports will query on the date column I would place a Primary Key Constraint on (4) or (5) to ensure uniqueness in your Datawarehouse. As far as indexing, (1) would be close to my choice but I would only place the Clustered Index on the Date Dimension leaving out the Surgate Key and Non-clustered indexes for all other dimension keys. Indexing is not an overnight job you need to plan and test properly.
Raulie
Thank you Raulie. Can you please clarify what "(4) or (<img src=’/community/emoticons/emotion-11.gif’ alt=’8)’ />" are?<br />For indexing, I thought that having the clustered index be unique was important – that was the reason why I thought about creating the surrogate key in the first place – to make the index unique.<br /><br />Thanks,<br />Newbie
You can place a Primary key Constraint without necessarily making it a Clustered Index this is trap that many developers fall into.
You can place a primary key Contraint to ensure uniqueness on options (4) or (5) (not 8 my bad) and not have be forced to use your one and only Clustered Index. Yes having a clustered index on a unique key is important, but only if you will be selecting from it. Assuming that most your queries will be on a date range, I would place the clustered index on the date column.
Raulie

On top of that if you don’t really need to select single row in dw environment you don’t have to have unique row identifier (pk, uk or unique index). I assume there is daily process refreshing dw data you have full control over and no other changes are allowed on your dw db. It depends on how is your load process designed, but it is possible that you don’t need pk at all.
No one said it is mandatory to place a PK on a table, by not placing a PK in your warehouse you run the chance of violating Entity Integrity, nulls, duplicates etc. Yes all of this is handled usually through the loading process but there are many times you will need to search a record by it’s PK. I guess it is a question of preference and what works for your environment. Raulie
quote:Yes all of this is handled usually through the loading process but there are many times you will need to search a record by it’s PK
That’s what I said: If you don’t need to select single row… If you are not sure better keep pk.
I think we are drifting away from the original question of what indexing strategy to use for his fact table.
Raulie
Not exactly. My point is not to put index if you are not going to use it. If don’t have functionality that will take advantage of index or pk, then don’t create one. But you have to know functionality to decide about it. So I’m not going to pick any number, just offer opinion to be taken into consideration.
The surrogate key option was based on an article published on the Microsoft site –
http://www.microsoft.com/resources/documentation/sql/2000/all/reskit/en-us/part5/c1761.mspx
Which had this to say: "The fact table must have a unique index on the primary key. There are scenarios where the primary key index should be clustered, and other scenarios where it should not. The larger the number of dimensions in the schema, the less beneficial it is to cluster the primary key index. With a large number of dimensions, it is usually more effective to create a unique clustered index on a meaningless IDENTITY column." And this article:
http://blogs.msdn.com/bi_systems/articles/164502.aspx
"If you build a clustered index on the fact table, you want it to be a small index. The multi-column logical primary key is thus eliminated from consideration. One solution is to create a surrogate primary key (bigint) on the fact table, and build a unique clustered index on it. The fact surrogate key serves no business or logical purpose. Instead, it exists to provide a nice row identifier for all other indexes. The clustered index is unlikely to be used directly in any user query. A clustered index is expensive to maintain, and may not be worth building in this case as the index is not useful for user queries.
An alternative solution is to create the surrogate primary key as above, and then create a 2-column unique clustered index on (DateKey, FactSurrogateKey). If you build the index with the DateKey in the outside or first position, it will be directly used in many queries. DateKey is chosen because it is very commonly included in user queries. " I thought the purpose of the clustered index is to provide a nice row identifier for all the non-clustered indices – thus it needs to be unique, right? So if I created a clustered index on date – it wouldn’t be unique. If I don’t create a clustered index, everything gets put into an unsorted heap, which may not be so bad right? I guess I just need to know where to start – with or without a clustered index, and if with a clustered index, if that index needs to be unique. Thanks,
Newbie
As mentioned in the article:
quote:To decide, perform some comparative benchmarking: clustered index on (DateKey, FactSurrogateKey) versus creating no clustered index on the table.

If you decide to create clustered index it is better to be unique, otherwise mssql server will add additional bytes to make the whole combination unique.
quote:Originally posted by mmarovic Not exactly. My point is not to put index if you are not going to use it.

So your saying that foreign keys will not take advantage of indexes? INDEXES ON FOREIGN KEYS WILL SPEED UP JOINS.
quote:Originally posted by Raulie Assuming that most your queries will be on a date range, I would place the clustered index on the date column.
I think my reply corresponds with what the article said. "DateKey is chosen because it is very commonly included in user queries." mmarovic what part of assuming didn’t you understand?
quote:Originally posted by mmarovic But you have to know functionality to decide about it.

I did say "Indexing is not an overnight job you need to plan and test properly."

quote:So your saying that foreign keys will not take advantage of indexes?.

Where did I say that?
quote:INDEXES ON FOREIGN KEYS WILL SPEED UP JOINS.

No objection here, but I don’t get it why are you so upset and how you came to the conclusion that I’m against indexes on foreign keys?
quote:I think my reply corresponds with what the article said. "DateKey is chosen because it is very commonly included in user queries." mmarovic what part of assuming didn’t you understand?

Again, show me where I have commented your date key recommendation.
quote:I did say "Indexing is not an overnight job you need to plan and test properly."

Right. I see you found my sentence to be critical to you post, which is not. As I said this is my standard disclaimer why I can’t give decisive answer without ifs and buts. Please, calm down and don’t read from my post more then it is actually written there.
Sure buddy….. Look the advice I originally gave the poster was something that I would do based on an assumption that most queries would be on date range. It didnt imply go out and index this and that. One thing you would want to do is Index all keys plane and simple, and choose the clustered index wisely. Most datawareshouse queries are on date range so thats why I suggested if that is going to be his case then I would Cluster the date column. Thank you Raulie
No objections.
Also, if minute precision is good enough it is better to use SmallDateTime instead of DateTime. This way you have narrower index, which means more data in a buffer, cach and that leads to better performance.
Thanks. Still a little bit confused though.
"If you decide to create clustered index it is better to be unique, otherwise mssql server will add additional bytes to make the whole combination unique."
Can you please elaborate on this.
I take it there may be 2 choices here – to either have the clustered index on the date, which for me is a smallint (surrogate key in the date dimension table) – or the clustered index on a unique combination of date plus the surrogate fact table key? The statement above implies that SQL Server can add the additional bytes automatically – which would then deem the surrogate key unneccessary, right? How many bytes will it add? Thanks,
Newbie
<blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote">"If you decide to create clustered index it is better to be unique, otherwise mssql server will add additional bytes to make the whole combination unique."<br />Can you please elaborate on this.<hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br />Sorry, it was really poor explanation. I mean if you need to have unique identifier there are good reasons to make it (part of) clustered index. Article actually elaborated it in details. <br /><br /> <blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote">I take it there may be 2 choices here – to either have the clustered index on the date, which for me is a smallint (surrogate key in the date dimension table) – or the clustered index on a unique combination of date plus the surrogate fact table key?<hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"> If you need to select single row than you have no choice but to use surrogate key. I would never make index on 8 columns to have unique identifier. <br />As author of the article said (and Raulie) combination of date and surrogate key looks like the best solution.<br /><br /> <blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote">The statement above implies that SQL Server can add the additional bytes automatically – which would then deem the surrogate key unnecessary, right? <hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br /><br />Surrogate can be omitted if you don’t need unique row identifier. <blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><br />How many bytes will it add?<hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br />Right know I really don’t have time to take a look at a book, but I believe it would be just one byte if you choose index on SmallDateTime column, since you I don’t expect many dups. I also believe that in general it could be more then one byte only if you have extremely high number of duplicates.<br /><br />Finally, to make it clear: If you don’t need unique row identifier for fact table in 99% cases date column is perfect choice for clustered index (as Raulie suggested <img src=’/community/emoticons/emotion-1.gif’ alt=’:)‘ />). However If you need unique identifier I would go with date + surrogate key.<br /><br />Hope I expressed my opinion better now. <img src=’/community/emoticons/emotion-1.gif’ alt=’:)‘ /><br />
]]>