Hi! I have a join which looks something like this: SELECT ... FROM @memberselection t INNER JOIN order o ON o.member_id = t.member_id INNER JOIN transactions t ON t.order_id = o.order_id For a large set in @memberselection this takes forever. For a selection of about 1000 members, it takes 30 second. PK on @memberselect is member_id. I have an index on order table on member_id. Member_id is VARCHAR(30). If I complement the join between order and @memberselection like below, time will drop to less than 1 second, even on larger sets! SELECT ... FROM @memberselection t INNER JOIN order o ON o.member_id = t.member_id AND o.customer_id = t.customer_id <--- Added join INNER JOIN transactions t ON t.order_id = o.order_id there is an index on order table on customer_id and customer_id is INTEGER. I thought that this big difference was because the server had an easier time dealing with INT's then VARCHARs, so I tried changing the member_id to BIGINT instead (its too large to use INT) Then time for query without customer_id dropped to 14 seconds, about half compared to VARCHAR columns. Should there really be such a big difference in processing time when joining on INT vs BIGINT columns!? I cannot see any other differences (except that the query plan when joining on customer_id needs some extra step to lookup member_id, so it should really be slower!). In both cases it uses the expected index and the indices are identical except one starts with member_id and one with customer_id. I can of course change all my joins to include customer_id, but Im interested in why this phenomenon happens in the first place... I can also mention that stats are updated and query cache cleared between each test.. cheers /Linus
The major difference would be between VARCHAR and numeric, not between INT and BIGINT. There's just more processing involved in joining on character data, that's why the common suggestion is to add a substitute key, like an identity column, to which you can refer in any FK relationship, and which is the better option for JOINs anyway. When you changed the VARCHAR number to a BIGINT, you probably had to drop some markers in the number, like hyphens. With the substitute key, the VARCHAR column can stay as it is.
Hi Thanks for the reply! In this case I did not have to loose any data (the member numbers were already numbers only), so it would be ok for me to change. Adding a substitute key would be an option but if i need to change all code i might as well just add joins on customer_id and the (performance) problem is solved... The question I really want answered is why is there such a a difference when joining on the bigint column instead of the int column... A workmate suggested it might be because the customer_id data is much nicer than the member_id data. Customer_is pretty much an unbroken sequence of numbers, whereas member_id has a lot of holes in the series (probably about every 8th number is used only). Could this be the reason? /L
You said: "In both cases it uses the expected index and the indices are identical except one starts with member_id and one with customer_id." And the best performing plan uses the index starting with member_id? Makes sense about the selectivity of member_id on your Order table.
No, I think you misunderstood me... When using only join on member_id, it will use index on member_id and query is slow. When using join on customer_id, it will use index on customer_id and query is fast.. and as a i mentioned there is a BIG difference on slow and fast here /L
Sorry about that. If the join on customer_id is performing better, then it is probably because there are more repeating values for member_id than for customer_id, so customer_id is more selective. Just goes to show you can design an index according to some very clean logic, but the actual data stored in the table can still have an unexpected impact on performance.
Hi! This seems like a valid explanation to me. But then I learned I could check index density and the one for member_id (the bad performing one) is even lower than the customer_id one (both are quite close to zero though). If it was really a problem about selectivity, shouldnt the index density indicate this? cheers /Linus
I think we're back to the initial point here - a join on numeric columns will outperform a join on alphanumeric columns. Not sure why we got sidetracked here.[]
hehe.. yeah, ok.. i buy the alphanumeric explanation, but i still dont get the 15 vs 1 second difference on BIGINT vs INT... /L