SQL Server Performance

Tuning T-SQL for Table contain 25 million Records

Discussion in 'Performance Tuning for DBAs' started by sonnysingh, Aug 24, 2006.

  1. sonnysingh Member

    I have select query for this table which have 25 millions records. require response time is 5 minutes max.Conditions (where clause) using three columns from this table
    column1 - varchar
    column2 - varchar
    column3 - datetime
    I know index must create on this column either covered index or composite cluster index. I just want to know make sure that
    1. index approach to this query to speed up the query
    2. any other thing need to use in order to achieve require response
    3. data is already there so how long will take to create indexon this volume of data.

    Thanks in advance.. guys..
  2. Adriaan New Member

    How many rows are you returning? IOW, how long does it take to do

    SELECT COUNT(*) FROM table WHERE ...........
  3. sonnysingh Member

    Thanks Adrian... Like said I have the query ... it is taking too long (1:35 minutes)..I mentioned as above I still need to create index on those three fields that I have mentioned above. I am looking for suggestion on this.. such as what would be the ideal approach to achieve desire time..

    regards
  4. FrankKalis Moderator

    ???
    You write the required response time is 5 Minutes and now are 1:35 too long? Can you please post the table structure, indexes and the query in question?

    --
    Frank Kalis
    Moderator
    Microsoft SQL Server MVP
    Webmaster:http://www.insidesql.de
  5. Luis Martin Moderator

    What about inserts frequency?


    Luis Martin
    Moderator
    SQL-Server-Performance.com

    All in Love is Fair
    Stevie Wonder


    All postings are provided “AS IS” with no warranties for accuracy.



  6. sonnysingh Member

    Thanks guys

    I have asked that " I have select query for this table which have 25 millions records. required response time is 5 minutes max.

    Frank...Currently It is 1 hour and 35 minutes...There is no index on this table and that's what I am asking suggestion for.

    It is simple select query like
    SELECT columnA, columnB, columnC, columnD
    FROM Sales
    WHERE LOCID=25, SuppID=1003356, Perioddate between '02/01/2004' and '02/01/2005'

    Luis... It is select operation tuning I am looking for this table...(yeh.. Insert frequency of approx 50000 rows once in the month).

    reason I am asking co it is very large table and my prime concern is respone time of select query..

    regards


  7. Luis Martin Moderator

    Well, take a look of execution plan, but a non cluster index on Locid(I guess columnA), suppId and Pridoddat will help.
    About how long will take to create this index, I don't know. It's deppends on many factors inclusive hardware.


    Luis Martin
    Moderator
    SQL-Server-Performance.com

    All in Love is Fair
    Stevie Wonder


    All postings are provided “AS IS” with no warranties for accuracy.



  8. SQL_Guess New Member

    quote:Originally posted by sonnysingh

    I have select query for this table which have 25 millions records. require response time is 5 minutes max.Conditions (where clause) using three columns from this table
    column1 - varchar
    column2 - varchar
    column3 - datetime
    I know index must create on this column either covered index or composite cluster index. I just want to know make sure that
    1. index approach to this query to speed up the query
    2. any other thing need to use in order to achieve require response
    3. data is already there so how long will take to create indexon this volume of data.

    Thanks in advance.. guys..


    Sonny,

    Please post the where criteria for the query. Specifically: are all of the criteria going to be ='s, or are there like's or >'s etc.

    Secondly, do you know much about the selectivity of the different crieria you are likely to be applying? What you want, in your index, is for the 1st column used to cut down the number of possible matches by as much as possible. For example, lets say you where querying for all Men, with Green Eyes, born after 1900-01-01.

    Probably, Men would cut the number of rows in half (i.e. 50% of data);
    Probably, almost all your customers would have been born after 1900-01-01 ((i.e. 99% of data)
    Possibly, only 1/5th of the customer would have green eyes (i.e. 20% of data)

    Then, you would want the index on the following order:
    (EyeColour,Gender,BirthDate)
    since this would allow the 1st column to reduce the possible data by as much as possible.

    Does that help?

    what you may want to run, if you are uncertain of your selectivity for the column1,column2,column3 might be something as simple as:
    select count(*) from Table (nolock)
    select count(distinct column1) from Table (nolock)
    select count(distinct column2) from Table (nolock)
    select count(distinct column3) from Table (nolock)

    The bigger the number for the coumn, the greater the selectivity.

    there are a number of factors that can affect this, like skewed data, etc.

    HTH

    Panic, Chaos, Disorder ... my work here is done --unknown
  9. FrankKalis Moderator

    quote:Originally posted by sonnysingh


    Frank...Currently It is 1 hour and 35 minutes...There is no index on this table and that's what I am asking suggestion for.
    I guessed that, but it wans't really clear, so I thought, I'd ask.

    When you say, once a month 50,000 rows are loaded, it sounds like some kind of reporting and/or datawarehousing table.
    If that is the case, I would probably put the clustered index on (PeriodDate) or even (PeriodDate, SuppID, LOCID).

    I don't know how long it takes to (re)build the index, but I would certainly schedule this for non-working hours.


    --
    Frank Kalis
    Moderator
    Microsoft SQL Server MVP
    Webmaster:http://www.insidesql.de
  10. Roji. P. Thomas New Member

    What is your PK?
    What are other UNIQUE keys?
    What is the Selectivity of the columns LOCID, SuppID, Perioddate
    What is the RowSize?


    Roji. P. Thomas
    Microsoft SQL Server MVP
    http://toponewithties.blogspot.com
  11. sonnysingh Member

    Here it is the details..

    Rowsize
    Table Name MB Estimate Total Pages #Data Pgs #Clustered Idx Pgs #NonClustered Idx Pgs
    --------- ----------- ----------- ----------- ------------------ ---------------------
    Stock 0.22 28 1 1 26

    PK - perioddate (as suggested by Frank)
    SuppID - non-clustered index
    LOCID - non-clustered index

    Frank...It's took 1:35 hour to create cluster index only

    Query is:
    Select SuppID, LocID, perioddate
    from stock
    where SuppID=103 and LOCID=3564 and stocktypeid=65
    and perioddate >= '20050101' and perioddate < '20060201'

    This DBCC SHOWCONTIG result set:
    DBCC SHOWCONTIG scanning 'Stock' table...
    Table: 'Stock' (389888756); index ID: 1, database ID: 7
    TABLE level scan performed.
    - Pages Scanned................................: 1126313
    - Extents Scanned..............................: 141211
    - Extent Switches..............................: 141210
    - Avg. Pages per Extent........................: 8.0
    - Scan Density [Best Count:Actual Count].......: 99.70% [140790:141211]
    - Logical Scan Fragmentation ..................: 1.13%
    - Extent Scan Fragmentation ...................: 0.12%
    - Avg. Bytes Free per Page.....................: 2409.1
    - Avg. Page Density (full).....................: 70.24%
    DBCC SHOWCONTIG scanning 'Stock' table...
    Table: 'Stock' (389888756); index ID: 26, database ID: 7
    LEAF level scan performed.
    - Pages Scanned................................: 149067
    - Extents Scanned..............................: 18699
    - Extent Switches..............................: 18698
    - Avg. Pages per Extent........................: 8.0
    - Scan Density [Best Count:Actual Count].......: 99.65% [18634:18699]
    - Logical Scan Fragmentation ..................: 0.87%
    - Extent Scan Fragmentation ...................: 0.10%
    - Avg. Bytes Free per Page.....................: 2420.0
    - Avg. Page Density (full).....................: 70.10%
    DBCC SHOWCONTIG scanning 'Stock' table...
    Table: 'Stock' (389888756); index ID: 27, database ID: 7
    LEAF level scan performed.
    - Pages Scanned................................: 256395
    - Extents Scanned..............................: 32227
    - Extent Switches..............................: 32226
    - Avg. Pages per Extent........................: 8.0
    - Scan Density [Best Count:Actual Count].......: 99.45% [32050:32227]
    - Logical Scan Fragmentation ..................: 0.00%
    - Extent Scan Fragmentation ...................: 0.10%
    - Avg. Bytes Free per Page.....................: 2396.0
    - Avg. Page Density (full).....................: 70.40%
    *****************************************************************

    seems to all three indexes are fine(Scan Density-near to 100%). but still in doubt that how will select query select the index to perform at it's best speed.

    SQL_Guess.. thanks for your explanation.. really comprehensive...now you can advice me on selectivity of index on which columns should require.

    regards
  12. SQL_Guess New Member

    can you run the following and give us the results:

    select count(*) from Table (nolock)
    select count(distinct perioddate ) from Table (nolock) where perioddate >= '20050101' and perioddate < '20060201'
    select count(distinct SuppID) from Table (nolock)
    select count(distinct LOCID) from Table (nolock)
    select count(distinct stocktypeid) from Table (nolock)

    Those different numbers will help us determine the selectivity.
    Unfortunately, with such a wide range on the dates, I'm think that access via the PK would not be our preferred path.
    Also, I'm guessing that the values you've shown us for the other criteria are samples, or will those be static?




    Panic, Chaos, Disorder ... my work here is done --unknown
  13. sonnysingh Member

    Here it is the result set....

    select count(*) from stock (nolock) = 38459176
    select count(distinct perioddate ) from stock (nolock)
    where perioddate >= '20040801' and perioddate < '20060131' = 51
    select count(distinct SuppID) from stock (nolock) =1 (3:34)
    select count(distinct LocID) from stock (nolock) = 229
    select count(distinct stocktypeid) from stock (nolock) =7

    regards

  14. SQL_Guess New Member

    quote:Originally posted by sonnysingh

    Here it is the result set....

    select count(*) from stock (nolock) = 38459176
    select count(distinct perioddate ) from stock (nolock)
    where perioddate >= '20040801' and perioddate < '20060131' = 51
    51 is an unexpected date, since you said that perioddate was the PK, and was thus unique.

    quote:Originally posted by sonnysingh
    select count(distinct SuppID) from stock (nolock) =1 (3:34)
    is the 3:34 to show how long it took to run?

    quote:Originally posted by sonnysingh
    select count(distinct LocID) from stock (nolock) = 229
    select count(distinct stocktypeid) from stock (nolock) =7
    regards

    selectivity ratings (assumming even distributions):
    perioddate = 51 matching rows (since perioddate is unqiue as PK)
    LocID = (1/229) * (38459176) = 167 944 matching rows
    stocktypeid = (1/7) * (38459176) = 5 494 168 matching rows
    SuppID = (1/1) * (38459176) = 38 459 176 matching rows

    Since it seems that perioddate is the most selective (51 rows of 38459176 is smallest), then LocID, then stocktypeid. SuppId is virtually useless, since it is on every row.

    So, for your query:


    Select SuppID, LocID, perioddate
    from stock
    where SuppID=103 and LOCID=3564 and stocktypeid=65
    and perioddate >= '20050101' and perioddate < '20060201'


    I would recommend the following index:


    Create Index CIX_Stock
    on Stock
    (perioddate ,LOCID ,stocktypeid ,SuppID )


    The question is whether, with wider date ranges, the perioddate will be less selective, and how efficient it will be, since it is a range in the criteria.

    Can you create that index and see what the estimated access path is for the query?

    Panic, Chaos, Disorder ... my work here is done --unknown
  15. mmarovic Active Member

    When you reach certain number of rows, the query performance depends exclusively on number of rows selected not on number of rows in the table. What I wont to say is: Probably there won't be difference in execution of query returning 1000 rows out of 10 million and the same query returning the same number of rows (1000) out of 50 milion. At some point (when you reach 100 milion for example) the same query will become slower but not much and after that you may expect the same performance regardless of the table growth, as long as you have enough maintenance window to take care about index helth. The reason is that b-tree depth will stay the same...
  16. SQL_Guess New Member

    I think I see what you're saying - effecively, the optimizer recognizeds there isn't a significant difference between 0.00001 % and .00002%, and since at that level the b-tree is X number of levels deep, it's going to take a significant amount of rows to increase the depth of all leaf-pages, which would only add 1 level to the depth. That sounds more-or-less right?

    BTW, given the information Sonny has given us, would you agree with the proposed index?

    CiaO

    Panic, Chaos, Disorder ... my work here is done --unknown
  17. mmarovic Active Member

    quote:Originally posted by SQL_Guess

    I think I see what you're saying - effecively, the optimizer recognizeds there isn't a significant difference between 0.00001 % and .00002%, and since at that level the b-tree is X number of levels deep, it's going to take a significant amount of rows to increase the depth of all leaf-pages, which would only add 1 level to the depth. That sounds more-or-less right?
    The b-tree depth is not taken into account by query optimizer (see Joe's articles). I assumed the execution plan is the same so the only measurable difference would be when query access the table using index seek, especially in nested loop. How much difference it makes depends on the index depth (there is index_property function or something like that that will tell you what is the depth). The deeper is index the less difference it makes, as you mentioned in second part of your discussion. Also you are right, it takes exponentialy bigger number of rows to reach the next depth level, that's exactly what I had in mind.


    quote:
    BTW, given the information Sonny has given us, would you agree with the proposed index?
    I think Luis's recommendation was the best based on the information available at the time. For just this query I would go with: locId, stockTypeId, supID and datePeriod in that order.

    The reason is that we have = on the first 3 columns, first being more selective so it can be potentialy better used in other queries. We have range on date period so with condition in query we can jump to the item where: LOCID=@locID and stocktypeid=@stockTypeID and supID = @supId and datePeriod = @datePeriodFrom. Then leaf level will be scanned until item with datePeriod > @datePeriodTo or locId > @locID or stockTypeId > @stockTypeID or supID > @supID is read. That item will be filtered out and the query will be completed because all values after can't satisfy the query condition since index leafs are ordered by the columns mentioned. SuppID was included in the index to avoid bookmark lookup even though there is just one value for that column in the table.
  18. sonnysingh Member

    It is really getting intersting for me guys..It is learing curve in it.
    I have created the Cluster index(not unique) like this (as u see it is different form previous one that I have mentioned before but ofcourse according to the rule described above):

    CREATE CLUSTERED INDEX (CIDX_Stock) ON Stock
    (Locid, ItemCode, Suppid, PeriodDate)

    If I run the following query..it is take 4:36 minutes (It is different query)

    Select LocID, ItemCode, SuppID, perioddate
    from stock
    where SuppID='103' and ItemCode='10001193' and LOCID='3564'
    and perioddate >= '20040801' and perioddate < '20040930'

    Now.., if I need to run this query it's took 6:28 minutes and show 'Clustered Index Scan'.
    Select ItemCode, perioddate
    from stock
    where ItemCode='10001193'and perioddate >= '20040801' and perioddate < '20040930'

    why it's take longer time than expected (expectation is few seconds)time cos it is simple query and select coulmns are part of cluster index as well.


    Guys.. you have mentioned some of complex terms which I have to understand...

    quote:How much difference it makes depends on the index depth (there is index_property function or something like that that will tell you what is the depth). The deeper is index the less difference it makes
    .. what is mean by this???

    regards


  19. Adriaan New Member

    You have to look at the order of the columns in your clustered index. In your second query, there is no filtering on the first column, LocId, which means SQL Server has to go through each LocId branch to find the ItemCode and PeriodDate values.

    So your clustered index is covering for your first query, but not for the second. I would remove LocId from the clustered index, and add a separate index on LocId.
  20. sonnysingh Member

    Thanks Adriaan....

    I have also identity column in this table.. would be good idea to create index on identity column in this case????
  21. Adriaan New Member

    The general consensus seems to be that an identity column is the ideal choice for a clustered primary key (or clustered unique index). Any other (unique) indexes will be based on the clustered index and provide good performance.

    Assuming this is a production system, this type of change is a big thing, and should not be taken lightly.
  22. mmarovic Active Member

    Let me quote my previous answer:<br /><blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote">I think Luis's recommendation was the best based on the information available at the time. For just this query I would go with: locId, stockTypeId, supID and datePeriod in that order.<br /><hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br />and now what Luis recommended:<br /><blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><br />Well, take a look of execution plan, but a <b>non cluster index</b> on Locid(I guess columnA), suppId and Pridoddat will help.<br /><hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br />So what I had in mind is <b>non-clustered index</b> on columns mentioned in order mentioned.<br /><br />More proof I had in mind non-clustered index is: [<img src='/community/emoticons/emotion-1.gif' alt=':)' />]<br /><br /><blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><br />SuppID was included in the index to avoid bookmark lookup even though there is just one value for that column in the table.<hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br /><br />Sorry for forgetting to make it explicit.<br /><br />Btw, I am talking about the ideal index for the next query:<br /><pre id="code"><font face="courier" size="2" id="code"><br />Select SuppID, LocID, perioddate <br />from stock<br />where SuppID=103 and LOCID=3564 and stocktypeid=65and perioddate &gt;= '20050101' and perioddate &lt; '20060201'<br /></font id="code"></pre id="code"><br />However, is it the really the query you had in mind? If so, why do you return suppID and LocID when you know them and their value is the same for each row returned (suppID = 103, locID = 3564)? Isn't it enough to return just periodDate?
  23. mmarovic Active Member

    quote:
    Guys.. you have mentioned some of complex terms which I have to understand...

    quote:
    --------------------------------------------------------------------------------
    How much difference it makes depends on the index depth (there is index_property function or something like that that will tell you what is the depth). The deeper is index the less difference it makes
    --------------------------------------------------------------------------------

    .. what is mean by this???
    If the b-tree had 3 levels and because of table growth it now has 4 levels it means for each index seek you will have to read 4 nodes instead of 3. That may slow down the seek by 33%. Going from 4 to 5 may slow down the seek by 25%. However, I asked Joe if he measured the difference and he told me that actual difference is even smaller for some reason.
  24. sonnysingh Member

    Thanks..mmarovic
    Let me ask this which bothering me from sometime as well. see the following two queries.

    quote:
    Select LocID, ItemCode, SuppID, perioddate
    from stock
    where SuppID='103' and ItemCode='10001193' and LOCID='3564'
    and perioddate >= '20040801' and perioddate < '20040930'


    Select ItemCode, perioddate
    from stock
    where ItemCode='10001193'and perioddate >= '20040801' and perioddate < '20040930'

    If I have create Cluster Index(not unique) on LocID, ItemCode, SuppID, perioddate columns in the table for first query.

    And then create another Index (non clustered) on ItemCode, perioddate columns for second query.. Then how query optimizer will take this..I mean will query optimizer looking for ItemCode, perioddate columns index from first index or second index.. I ma just curious about it ad It will also help me to create appropriate indexes on useful columns.

    regards
  25. CedricL New Member

    Have you given some thought to maybe creating a partitioned view on that table .That will significantly increase your response time

Share This Page