Hi, I have a table logging our website activity with about 60 M of rows. One user has for itself more than 1 M rows. When I do a count(distinct), it takes two seconds. I was wondering if there is a way to speed that up? Here's the table CREATE TABLE [dbo].[tbl_logs]( [id] [int] NOT NULL, [dateAdded] [datetime], [fk_tbl_visits_id] [int], [fk_tbl_users_id] [int] ) ON [PRIMARY] I tried different combinaison of indexes, but here the index that performed the best. CREATE NONCLUSTERED INDEX [idx_visits] ON [dbo].[tbl_logs] ( [fk_tbl_users_id] ASC, [dateAdded] ASC, [fk_tbl_visits_id] ASC )WITH (PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY] Curiously, that clustered index was never taken by the query optimizer when the one above was created. I thought clustered indexed would be better for that query... CREATE CLUSTERED INDEX [idx_primary] ON [dbo].[tbl_logs] ( [fk_tbl_users_id] ASC, [dateAdded] ASC, [fk_tbl_visits_id] ASC )WITH (PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY] And here's the query that return ~1 M rows. select count(distinct fk_tbl_visits_id) as nb from tbl_logs where fk_tbl_users_id = 10203 and dateadded between '2008-01-01 00:00:00' and '2008-05-31 23:59:59:999' The execution plan shows an index seek of 40% and next, a hash match taking 55%. Any idea of how to reduce the execution time? Thanks Stephane
Drop all the indexes and create clustered index on uinque column [fk_tbl_visits_id ] , nonclustered index on date column [dateAdded] , check the performances. Reference: http://searchsqlserver.techtarget.com/tip/0,289483,sid87_gci1299403,00.html# Cheers Sat
Good suggestion by Sat - except that [fk_tbl_visits_id] is probably not a unique column. Your table definition does not say which column is unique - quite possibly it's [id]. Make the unique column your clustered index, and add (non-clustered) indexes on the columns that actually interest you.
As a mater of fact, the id field is unique. If I understand well, the clustered index should be on increasing values (like date or identity) and it's better for range search. So in my case, I should create an clustered index on date since I use a lot of between? But the problem is, even though it's a date time, there are some row that has the same date so it can't be unique. So far, I've made some tests (clustered on Id, indexes on date and user id) and it's slower than two seconds (6-8 seconds in fact). But I'll keep trying. I'm sure there's a way to speed that up! Thanks Stephane
Not sure that range searches should be considered as the main argument about which index should be clustered. A non-clustered index with regular maintenance should be sufficient. The clue behind clustered indexes is that all non-clustered ones are based on them. As long as new values are added with increasing values on the clustered index, it helps keep the other indexes in good shape.
Hi, Finally, after a lots of tests, this is the fastest index (note the include column). It executes the query in 2 seconds. CREATE NONCLUSTERED INDEX [idx_date] ON [dbo].[tbl_logs] ( [fk_tbl_users_id] ASC, [dateAdded] ASC ) INCLUDE ( [fk_tbl_visits_id]) WITH (PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY] I tried many combination of clustered (id, date which is not unique, date + id) index but it always took 6 seconds or more. I also tried separate indexes (one for user id, one for the date and one for the visit id) but it was also very long (15 seconds) I'm now short of ideas to find out how to reduce execution time. As a comparaison when I check Google Analytics, it process reports in less than a second. I know we don't have Google servers, but we don't have as much data too. Does anyone know how they query their data? Thanks Stephane
I would expect that with a clustered index on id alone, plus this non-clustered index with INCLUDE, you should get the same quick results. There is little point in having a multi-column index that starts with a unique column.