SQL Server Performance

Optimizing index for select count(distinct)

Discussion in 'SQL Server 2005 T-SQL Performance Tuning' started by Steph2004, Jun 12, 2008.

  1. Steph2004 New Member

    Hi,
    I have a table logging our website activity with about 60 M of rows. One user has for itself more than 1 M rows. When I do a count(distinct), it takes two seconds. I was wondering if there is a way to speed that up?
    Here's the table
    CREATE TABLE [dbo].[tbl_logs](
    [id] [int] NOT NULL,
    [dateAdded] [datetime],
    [fk_tbl_visits_id] [int],
    [fk_tbl_users_id] [int]
    ) ON [PRIMARY]
    I tried different combinaison of indexes, but here the index that performed the best.
    CREATE NONCLUSTERED INDEX [idx_visits] ON [dbo].[tbl_logs]
    (
    [fk_tbl_users_id] ASC,
    [dateAdded] ASC,
    [fk_tbl_visits_id] ASC
    )WITH (PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
    Curiously, that clustered index was never taken by the query optimizer when the one above was created. I thought clustered indexed would be better for that query...
    CREATE CLUSTERED INDEX [idx_primary] ON [dbo].[tbl_logs]
    (
    [fk_tbl_users_id] ASC,
    [dateAdded] ASC,
    [fk_tbl_visits_id] ASC
    )WITH (PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
    And here's the query that return ~1 M rows.
    select count(distinct fk_tbl_visits_id) as nb from tbl_logs
    where fk_tbl_users_id = 10203 and dateadded between '2008-01-01 00:00:00' and '2008-05-31 23:59:59:999'
    The execution plan shows an index seek of 40% and next, a hash match taking 55%.
    Any idea of how to reduce the execution time?
    Thanks
    Stephane
  2. techbabu303 New Member

    Drop all the indexes and create clustered index on uinque column [fk_tbl_visits_id ] , nonclustered index on date column [dateAdded] , check the performances.
    Reference:
    http://searchsqlserver.techtarget.com/tip/0,289483,sid87_gci1299403,00.html#
    Cheers
    Sat
  3. Adriaan New Member

    Good suggestion by Sat - except that [fk_tbl_visits_id] is probably not a unique column. Your table definition does not say which column is unique - quite possibly it's [id].
    Make the unique column your clustered index, and add (non-clustered) indexes on the columns that actually interest you.
  4. Steph2004 New Member

    As a mater of fact, the id field is unique.
    If I understand well, the clustered index should be on increasing values (like date or identity) and it's better for range search. So in my case, I should create an clustered index on date since I use a lot of between?
    But the problem is, even though it's a date time, there are some row that has the same date so it can't be unique.
    So far, I've made some tests (clustered on Id, indexes on date and user id) and it's slower than two seconds (6-8 seconds in fact). But I'll keep trying. I'm sure there's a way to speed that up!
    Thanks
    Stephane
  5. Adriaan New Member

    Not sure that range searches should be considered as the main argument about which index should be clustered. A non-clustered index with regular maintenance should be sufficient.
    The clue behind clustered indexes is that all non-clustered ones are based on them. As long as new values are added with increasing values on the clustered index, it helps keep the other indexes in good shape.
  6. Steph2004 New Member

    Hi,
    Finally, after a lots of tests, this is the fastest index (note the include column). It executes the query in 2 seconds.
    CREATE NONCLUSTERED INDEX [idx_date] ON [dbo].[tbl_logs]
    (
    [fk_tbl_users_id] ASC,
    [dateAdded] ASC
    )
    INCLUDE ( [fk_tbl_visits_id]) WITH (PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
    I tried many combination of clustered (id, date which is not unique, date + id) index but it always took 6 seconds or more.
    I also tried separate indexes (one for user id, one for the date and one for the visit id) but it was also very long (15 seconds)
    I'm now short of ideas to find out how to reduce execution time.
    As a comparaison when I check Google Analytics, it process reports in less than a second. I know we don't have Google servers, but we don't have as much data too. Does anyone know how they query their data?
    Thanks
    Stephane
  7. Adriaan New Member

    I would expect that with a clustered index on id alone, plus this non-clustered index with INCLUDE, you should get the same quick results.
    There is little point in having a multi-column index that starts with a unique column.

Share This Page