SQL Server Performance

Which way has faster query performance? Inner join or In Clause

Discussion in 'T-SQL Performance Tuning for Developers' started by markandjenn, Aug 24, 2007.

  1. markandjenn New Member

    I have two tables:
    - itemloc - this table lists items and the stores i which they reside with on hand quantities, etc. There is a clustered index on item and loc. (12.5 million records)
    - item - this table lists all of the items with their descriptions, etc. There is a clustered index on item. (10 million records).
    I was wondering which way to query the data would give me the best performing results. I could do a straight inner join, I could use and in clause or I could join one table to the other using a derived table. While I have performed various tests the results do not provide an absolute answer and I wanted to find out which method is fastest based n the design.
    In my studies, the inner join method was the fastest, followed by the inner join on the derived table then the In Clause method. Is that correct?
    Right now, I am tesing with very simplified examples, but should I expect the same results when querying against 12 - 15 million records?
    Below are the three queries I have executed.
    -- Straight Inner Join
    select il.item, count(loc) as locs, sum(il.on_hand) as OH
    from itemloc il
    inner join
    item i on
    il.item = i.item
    where i.item between 100000 and 200000
    group by il.item
    -- In Clause
    select il.item, count(loc) as locs, sum(il.on_hand) as OH
    from itemloc il
    where item in(
    select item
    from item
    where item between 100000 and 200000
    group by il.item
    -- inner join with derived table
    select il.item, count(loc) as locs, sum(il.on_hand) as OH
    from itemloc il
    inner join (
    select item
    from item
    where item between 100000 and 200000
    ) i on
    il.item = i.item
    group by il.item
  2. simondm New Member

    Because you have a clustered index on the column you are joining too I would expect the inner join to be quickest. But this may not always be the case! The SQL query optimiser does come to different conclusions based on the number of records and indexes etc. So generally you have to experiment.
    Though with the examples you have provided it wouldn't surprise me is the query plan is identical for the first two methods. You can test that by including the actual query plan.
    I personally tend to use inner joins for larger queries and IN for smaller recordsets (such as parameters in a report). When I find something performing unexpectedly I look further.
  3. markandjenn New Member

    Unfortunately, the queries the I provided are simplified versions of the real queries. The real queries are actually executing as pass-through queries to our mainfram DB2 system, so the query plans are identical for all DB2 pass-through queries and they were not much help.
    As far as the derived table is concerned, can I assume that because the derived table does not have an index it would perform slower?
    Also, I was under the understanding that if I joined 2 tables, SQL inherently has to join each record in the first table to each record in the second table and then filters the results based on the join. In other words, if I had 2 tables, one with 100 records and another with 10 records, SQL would ultimately create a table with all of the combinations of the two tables (a table with 1,000 records) and then filter down to the 10 recrods that match. Based on that logic, I assume the In clause would be faster. Maybe that is not true in the case of joining on a clustered index and hence that is why it is faster. Is that correct?
    I appreciate your help.
  4. simondm New Member

    SQL Should only join the relevant records. It will either scan through or seek to a record or records and join if applicable. It should not try to join the entire tables. This is providing the tables are all on the local server.
    I would expect the derived table to be slower because of the reason you pointed out, and that it would cause additional load on the tempdb - particulary if the table was large.
    In this case are you joining to data held in DB2? SQL will copy the entire table from the remote source when you query accross external sources (including SQL linked servers). If you are doing this the IN method could offer much better performance because you control the amount of data returned from the external source.
  5. merrillaldrich New Member

    The only way to tell is to compare the execution plans. For any incoming query of any form, the optimizer will try to refactor the query into the fastest possible form that is logically equivalent. That means it will change "in" clauses into joins and subqueries into joins, rearrange things, etc. etc. as long as the rules it has at its disposal allow it to guarantee the logical equivalence of the result set. So, basically, when things are working right it makes no difference at all -- the optimizer will "rewrite" the query for you.
    The place where things get sticky is when you pass in a query you've constructed in a way that the optimizer can't apply all those rules -- because there's something in the query that makes those equivalence rules invalid. You've then, in effect, made an "unoptimizable" query. Some corellated subqueries are like that -- there is an equivalent that one could write as a join, but while the optimizer theoretically should rewrite the query, in reality it has no way to solve it and selects a slower method instead. This is also true with linked servers -- queries across linked servers often will not optimize because the optimizer can't use access methods on a remote table that it could locally. Instead it'll pull a large set over into temporary space or RAM and then run the query.

  6. markandjenn New Member

    Yes. Exactly. This is my problem. I am writing the queries through a linked server. In this case, my company's mainframe DB2 system. Both tables reside on the mainframe. Since I am using the linked server, I cannot see the true execution plan and I was soliciting advice on how to write the query. I had been experimenting with writing the query in three different ways and it seemed like linking the tables directly performed best. It seemed to increase performance by about 25%. I just found that surprising because of the size of the two tables that I was joining. I assumed that linking one table (in its entirety of 125 million records) to a subset of the other table (100,000 of the 10 million records) would have been faster, but it does not seem to be the case.

  7. Adriaan New Member

    DB2 is executing the query, so unless someone here knows DB2, you need to ask the same question on DB2 forums.
  8. alzdba Member

    From your options, I'd go for the INNER JOIN ! (an in-list my materialize sub-results in workdb (=tempdb))
    There is another option : EXISTS
    So if you don't need info from that table, and you only need to know if there is data, use and EXISTS clause with a (corelated) query.
  9. satya Moderator

    I would second Adriaan's suggestion here as the execution is from other provider, you haveto check from that side too. Also check whether any network issues in getting results durin this execution.
  10. alzdba Member

    Indeed the check with the actual DB2-dba may give a correct picture regarding what's going on.
    Sometimes when using linked servers, sqlserver may get a bit selfish and pull over more data than actualy needed by the query and then handle it local.
    IMO, as long as you stick with SQL92 standard coding, the guidelines are pritty much the same for sqlserver and db2.
    Although I didn't get the full technical bits of db2 V8 for os/390, folowing common guidelines gets you at least through 90% of your query's performance issues.
  11. satya Moderator

    .. just to add if you are using old drivers for DB2 then performance is mostly affected area, hence better to refer to IBM site for upto date drivers.

Share This Page