SQL Server Performance

Indexing

Discussion in 'Performance Tuning for DBAs' started by mcatet, Sep 29, 2005.

  1. mcatet New Member

    I have a table with columns C1-C10.<br /><br />PK = C1, C2, C3<br />IX1 = C9<br />IX2 = C1, C2, C3, C7, C8<br /><br />Typical queries are:<br /><br />SELECT C3, SUM(C7), SUM(C<img src='/community/emoticons/emotion-11.gif' alt='8)' /><br />FROM MainTable MT<br />INNER JOIN TableA A on MT.C1 = A.C1<br />INNER JOIN TableB B on MT.C2 = B.C2<br />INNER JOIN TableC C on MT.C3 = C.C3<br />WHERE A.A1 = 'A'<br />AND B.B1 = 'B'<br />AND C.C1 = 'C'<br />GROUP BY C3<br /><br /><br />SELECT C2, SUM(C7), SUM(C<img src='/community/emoticons/emotion-11.gif' alt='8)' /><br />FROM MainTable MT<br />INNER JOIN TableA A on MT.C1 = A.C1<br />INNER JOIN TableB B on MT.C2 = B.C2<br />INNER JOIN TableC C on MT.C3 = C.C3<br />WHERE A.A1 = 'A'<br />AND B.B1 = 'B'<br />AND C.C1 = 'C'<br />GROUP BY C2<br /><br /><br />SELECT C1, SUM(C7), SUM(C<img src='/community/emoticons/emotion-11.gif' alt='8)' /><br />FROM MainTable MT<br />INNER JOIN TableA A on MT.C1 = A.C1<br />INNER JOIN TableB B on MT.C2 = B.C2<br />INNER JOIN TableC C on MT.C3 = C.C3<br />WHERE A.A1 = 'A'<br />AND B.B1 = 'B'<br />AND C.C1 = 'C'<br />GROUP BY C1<br /><br /><br />I'm under the impression that if PK has C1, C2, C3 and I query filtering on C1, C2, C3, C9 that the PK and IX1 will work fine together. <br /><br />The question I have is: Does the covering index ever get used? Just because the query is summing C7 and C8 should they be part of an index? Does an index really help speed up a sum? Seems like they should not be part of the index.<br /><br />Thanks for your help,<br />mcatet
  2. mmarovic Active Member

    quote:I'm under the impression that if PK has C1, C2, C3 and I query filtering on C1, C2, C3, C9 that the PK and IX1 will work fine together.
    I guess filtering is not c1 = a and c2 = b and c3 = c and c9 = d, ecause c1, c2 and c3 already identify single row. I don't know unless you post exact condition.

    Not asked but needs to be explained - If either condition:
    B.B1 = 'B'
    AND C.C1 = 'C'
    is more restrictive then
    A.A1 = 'A'
    you may be better served having two non-clustered indexes, one on c2 and one on c3 column. You also need them to speed-up fk constraint check when data are inserted.


    quote:The question I have is: Does the covering index ever get used? Just because the query is summing C7 and C8 should they be part of an index? Does an index really help speed up a sum? Seems like they should not be part of the index.
    If A.A1='A' filters out the most rows then the best execution plan is one using that index. If not, with addition of indexes I mentioned better execution plan can be generated.
  3. mcatet New Member

    More detail:

    We have an application that stores retail sales data. It allows users to select certain date ranges, certain stores, and certain items to report on.

    There are many users that hit the same database.
    The following tables are populated with the the Items, Stores, Dates that a user has selected:

    zItems: (PK = SKUID, UserID) Clustered
    SKUID INT
    UserID VARCHAR(20)

    zStores: (PK = SKUID, UserID) Clustered
    StoreID INT
    UserID VARCHAR(20)

    zWDate: (PK = SKUID, UserID) Clustered
    WDate SMALLDATETIME
    UserID VARCHAR(20)

    They all contain similar data as:
    StoreID UserID
    1 Bob
    2 Bob
    3 Bob
    1 Sue
    4 Sue
    3 Jon
    4 Jon
    5 Jon

    Similarly, there are records like this in the other 2 tables holding lists of Items and Dates.

    In addition there is a sales table: (PK = WDate, StoreID, SKUID) Clustered
    WDate SMALLDATETIME
    StoreID INT
    SKUID INT
    UnitsSold SMALLINT
    ProdSales REAL
    UnitsOH SMALLINT
    SalesCode CHAR(2)
    ...

    The sales table also has these indexes:
    IX_Sales_SalesCode: (Non-Clustered)
    SalesCode

    IX_Sales_Covering: (Non-Clustered)
    WDate, StoreID, SKUID, UnitsSold, ProdSales


    This is a sample of one of our queries:


    SELECT Sales.SKUID, SUM(Sales.UnitsSold) AS UnitsSold, SUM(Sales.ProdSales) AS TotalSales
    FROM Sales
    INNER JOIN zStores ON Sales.StoreID = zStores.StoreID
    INNER JOIN zItems ON Sales.SKUID = zItems.SKUID
    INNER JOIN zWDate ON Sales.WDate = zWDate.WDate
    WHERE (zWDate.UserID = 'Jon')
    AND (zStores.UserID = 'Jon')
    AND (zItems.UserID = 'Jon')
    GROUP BY Sales.SKUID

    I tested this query using a filter of only 1 of the UserID fields. It is marginally faster if I use WHERE zStores.UserID = 'Jon'. Either of the other 2 filters (zItems or zWDate) are actually slower than all three. So I'll implement the single filter by UserID from the zStores table.

    Since the PK has WDate, StoreID, SKUID, does the IX_Sales_Covering ever get used?

    Occasionally we filter by ProdSales (i.e. ...ProdSales > 0)

    Does this help a bit more?


  4. mmarovic Active Member

    quote:zStores: (PK = SKUID, UserID) Clustered
    StoreID INT
    UserID VARCHAR(20)

    zWDate: (PK = SKUID, UserID) Clustered
    WDate SMALLDATETIME
    UserID VARCHAR(20)
    Do you mean:

    zStores: (PK = StoreID, UserID) Clustered
    StoreID INT
    UserID VARCHAR(20)

    zWDate: (PK = WDate, UserID) Clustered
    WDate SMALLDATETIME
    UserID VARCHAR(20)
  5. mcatet New Member

    No the PK on all three tables is both columns.

    SKUID and UserID
    StoreID and UserID
    WDate and UserID

    The sales table has a PK of 3 fields:
    WDate, StoreID and SKUID.

  6. mmarovic Active Member

    That's what I said. You had skuID as part of pk in each table, but in later two it should be as I wrote, right?
  7. mmarovic Active Member

    The whole design (table structure) looks strange to me. How do you know you didn't mix sales items for different users if userId is not included in that table?
    Anyway, your covering index will probably be used, but depending on values passed for userID column and data distribution sometimes better performance can be gained with covering index with the same columns but ordered differently in index definition.

    However, struggling to understand business logic behind the table design, it is hard to comment on index design.
  8. mcatet New Member

    Right, good catch on the typo.


    Without getting into the whole design of the application, I'll try to describe the usage of these tables better.

    zWDate, zStores, and zItems are Filter tables.

    A user selects report options from the UI. They select which weeks, stores and items to report on. The app populates the filter tables:

    zWDate: 1 record for every week in the date range selected (i.e. 9/24/2005, 'Jon')
    zStores: 1 record for every store selected (i.e. 1000, 'Jon')
    zItems: 1 record for every item selected (i.e. 2500, 'Jon')

    User 1 might select 1 store, 1 item and 1 week.
    User 2 might select 100 stores, 100 items and 100 weeks
    User 3 might select 1 store, all items, and 4 weeks
    and so on...

    All of these selection get inserted into the filter tables with the userid.

    Then data is queried out of the Sales table based on the users selections saved in the filter tables.

    So the sales table is joined to the filter tables and the query is filtered on userid so as to get the correct selections each user made.

    It works really well from a development stand point because it is simple to manage user selections.

    However, I'm concerned that the indexes on the DB are incorrect. Specifically that the cover index is not being used and is just wasting disk space.


  9. mmarovic Active Member

    Let's discuss 3 extreme cases.

    1. User 'Jon' selects one store ('Jon & sons'), all items and all weeks.

    One possible execution plan is to select all rows in Sales table and for each row to find related row (based on storeID) in zStores table. It may be million rows in sales table but just 1000 rows for the specific store. It means you will search million times for related row in zStores table and filter out all of them except for 1000 with the storeID equal to one of 'Jon & sons' store. That is not quite efficient.

    Alternative plan would be: Select all rows in store table 'Jon' is interested in (and that's actually only one - 'Jon & sons' store) and find all rows in Sales where storeID is one of 'Jon & sons' store. Without (appropriate) index we would need to scan the whole Sales table and for each row to test if sales.StoreID = StoreID of 'Jon & sons' store. With index on Sales table starting with StoreID we can jump to the first item (or close enough) with appropriate storeID. When we find it then we either have all columns we need in index (if index covers all columns) or we need to read actual data row for columns not contained in index. Index contains pointer to that row that can be implemented as clustered index column values if there is any or pointer to the place in heap where the row is. After that we scan next 999 items in indexes that are placed right after previous one and repeat the same. We finish scanning index when we find first item where StoreID > StoreID('Jon & sons'). Since index entries are ordered by StoreID we know then after that entry each that follows will have bigger StoreID then one we are interested in. That way we read just 1000 items from Sales table and only one row from zStore table. If we have all columns we need in index starting with StoreID column there is no need to even touch Sales data row, we can just scan the index.

    2. User 'Jon' selects all stores, one item (book "Inside Microsoft SQL Server 2000" Kalen Delaney) and all weeks.

    Discussion is similar with previous one, but we need index on Sales table starting with SKUID. Index starting with StoreID on Sales table would not help. You can visualize index entry as a string with next structure: '<col1 value>-<col2 value>-...#%92 The complete index is sorted list of such entries. It's simplification but good enough approximation. If you need to find all entries for specified <col1 value> index provides a way to jump to first such entry (actually close enough) and then you just need to read subsequent entries until you find first different then one you are looking for. If you want to find all rows with specified value of col2 you have to scan the whole index because col2 is in the middle of the string, so item order can't help you.

    3. User 'Jon' selects all stores, all items and only one week

    Following the same logic your covering index will be part of the most efficient execution plan.

    The conclusions are:

    1. Your covering index will allow for the best exec plan in case condition on date is the most restrictive one.

    2. To have optimal search performance in other cases you need at least two more indexes -
    one starting with StoreID and another starting with SKUID.

    3. If the query posted is the most important/critical for your app then make all of them covering which means add indexes on:

    StoreID, WDate, SKUID, UnitsSold, ProdSales in that order and
    SKUID, WDate, StoreID, UnitsSold, ProdSales

    assuming you will rarely select regardless of date.
  10. mcatet New Member

    Good. This makes sense and is explained well.

    Typically this is how records are selected.

    All stores (1500 of 1500)
    Some Items (2400 of 15000)
    Some Dates (13 of 104)

    If selecting 1 record (1 SKUID or 1 StoreID or 1 WDate) it is very selective.
    Selecting 1 SKUID, all stores and all dates means that there are 156000(1500*1*104) records.
    But, a typical selection is a group not a single value. So as listed above, there are 46.8 million records (1500*2400*13).

    If there is a covering index with all fields needed to avoid reading table rows for data, and it starts with SKUID because it is the most selective, how does that perform when there is a range of values.

    I can see how the index would work if I pick 1 SKU, but if I have 2400 SKUID's that are not in order (for example, I choose SKUID's 1, 19, 34, 59, 324, 838...) The iindex will not have all the SKUID's I need in 1 block. Would it be better to have WDate (although less selective) as the first field in the index since I know all dates in the range will be in order?

    mcatet
  11. mmarovic Active Member

    quote:If there is a covering index with all fields needed to avoid reading table rows for data, and it starts with SKUID because it is the most selective, how does that perform when there is a range of values.

    I can see how the index would work if I pick 1 SKU, but if I have 2400 SKUID's that are not in order (for example, I choose SKUID's 1, 19, 34, 59, 324, 838...) The iindex will not have all the SKUID's I need in 1 block. Would it be better to have WDate (although less selective) as the first field in the index since I know all dates in the range will be in order?
    Explanation is taking into consideration loop join algorithm. 2400 skuIds will be found one by one and then process explained will repeate 2400 times. This is still much better then selecting all rows in Sales table and for each search for coresponding one in zItems table to test additional condition. There are other algorithms availabe, but for joining two tables with so much diferent size my experience is that loop join performs the best. Anyway, it is (almost) always the best to start with the most restrictive condition so following processing will be repeated for minimal number of rows.

    On top of that if you have proper index on zItems only 2400 rows will be accessed on that table.
  12. mmarovic Active Member

    Uhm, I didn't give quite clear answer. Lookup tables (stores, Items and Dates) are small so time to fileter out rows from them will not impact much overall performance. Query performance is more dependant on how many rows will be processed after you selected row(s) in lookup table you start with.
  13. mcatet New Member


    The application programmatically determines which query option join type to use based on the type of selection (how many stores, item, dates) Typically it uses a Loop Join. In my experience I agree that the Loop join usually offers the best performance in this scenario, but some variations use a Hash or Merge join to get better performance. The join type is based on extensive time testing to determine the best performing option for the users selection.


    Now, if I add the 2 additional covering indexes as you mentioned I will have 2 issues:

    1) Disk space. Adding 2 more, multiple column indexes will use a large amount of disk space since the Sales table has nearly 1 billion rows.
    2) Adding those indexes will take a very long time and will cause the tran log to grow significantly (perhaps filling the disk)

    First, how can I tell exactly how much disk space is used by the index? This will help me estimate how much would be used if I add the other indexes.

    Second, Is there anything I can do to speed up making index changes on the sales table (and any table for that matter)?


  14. mcatet New Member

    I'm looking at the results from sp_spaceused:

    Sales930061642 207830880 KB111610488 KB96220432 KB-40 KB


    It shows the size of the data in the table = 111GB and the sie of the index on the table is 96GB a total of over 200GB when the size of the entire DB is only 130GB.

    What's up with sp_spaceused?

  15. mmarovic Active Member

    If you typicaly select all stores, maybe you can have index on this column only just for checking foreign key (and just in case someone selects just a few stores).

    Sales table has almost billion rows!!! Does anyone really need all these rows? What is b-tree depth for existing indexes?
    If you really need all these rows, you may consider partitioning sales table.

    Besides that, you may put indexes on different filegroups on spearate physical drives if available.

    I hope Derick or Joe will answer this question, they are much better in that stuff then me.
  16. mmarovic Active Member

    For b-tree depth use function indexProperty(object_id(Sales), <index name>, 'indexDepth')
  17. mcatet New Member

    select indexProperty(object_id('Sales'), 'PK_Sales', 'indexDepth') Returns 3
    select indexProperty(object_id('Sales'), 'IX_Sales_Cover', 'indexDepth') Returns 4

    Unfortunatly they do "need" the data. Some of the older data will be purged but it is a fairly small percentage.

    The application performance is quite good considering the amount of data. Most of the reports take in the 12-40 second range. Some others take longer but they are doing a lot more work. So at this point we probably do not need to worry about partitioning. We have considered it, but have not really pursued it. I've thought about testing it on one of the smaller versions of this application but haven't had time.

  18. mmarovic Active Member

    Since you have billion rows table (I never worked on table that size, the most I had was 400+ M), other things had to be taken into consideration.

    How do you populate this table?
    What is your maintenance window?
    How do you handle index fragmentation?

    I guess you don't have fks defined and that's right decision for this type of processing.

    Since you are fine with query performance I would be conservative with changes on the table that size.

    After I revisited your table and index design and your typical query description I think you have pretty good design (and that conclusion is supported by good performance you have).

    1. Queries where condition on date is the most restrictive are covered by primary key, because it starts with wdate column.
    2. Queries where condition on SKUID is the most restrictive are covered by covering index.
    3. You said typically all stores are selected so it looks like you can live without covering index starting with StoreID column.

    I think we completed full circle and came to the conclusion that your index design is pretty good.

    Only improvement I see there is that I would use smallint instead of int for SKUID and StoreID. The smaller data type the better performance.
  19. mmarovic Active Member

    Sorry, I made an mistake. You actually don't have an index on SKUID, but you have two indexes starting with wDate. In that case I would try configuration I mentioned in previous post.

    Your covering index is used because it is more efficient then primary key when condition on wDate is the most restrictive, but clustered index can still give you good enough performance in that case.

    So to summarize my proposal:

    1. Make SKUID and StoreID smallint.
    2. Instead of existing covering index create covering index starting with SKUID so you will have enough space.
    3. Test new index configuration in development environment before applying on prod server.
    4. If new configuration is proved better, plan a day or two downtime.
  20. mmarovic Active Member

    About covering index, there are a few possibilities to be tested:

    1. SKUID, WDate, StoreID, UnitsSold, ProdSales
    2. SKUID, WDate, UnitsSold, ProdSales. That may be faster because your typical query selects all stores, so no filtering is typically needed and StoredID value is part of clustered index, which means it is contained in leaf level of each non-clustered index. That index will be narrower then first one, so it may perform better for typical queries.
    2. SKUID, UnitsSold, ProdSales will be probably the slowest one but I would test it anyway.

    To test you need big enough extract of production table on development server and query set from trace run on production server.
  21. mcatet New Member

    I did some index testing on a dev version of the DB. I captured timings for several different index variations where I changed indexes, order, added, removed and so on.

    I found that although some of the other index variations used considerably less disk space, the original was the most effective concerning performance. What I have yet to test is changing the order of the original indexes (PK and Covering).

    Since I have several applications that are the samy design (both DB and application) just with different data (Makes development so much easier), I need to make sure the DB structure is the same for all versions. 1 of my apps has more than 32,000 skus so I'll have to leave SKUID alone. However, storeid can be changed to a smallint. If a retail chain has more than 32000 stores I have more problems that just indexing.

    Thanks mmarovic. You have been very helpful.

Share This Page