SQL Server Performance

Distinct rows with first occurance & create table

Discussion in 'T-SQL Performance Tuning for Developers' started by surendrakalekar, Apr 25, 2005.

  1. surendrakalekar New Member

    I have one tables called Codes which contain following fields.
    CODENAME(char 12), (CODEVALUE int)
    This table has no primary or unique key. This table contains near about 1 million rows. like this.
    'Business', 105
    'Business', 101
    'Business', 104
    'Market', 200
    'Market', 201
    'Employee' 340
    'Employee' 359
    'Employee' 330
    'Employee' 341
    By using this table I want to create another table (Select * into) which should contain only distict rows and the first occurance of that CODENAME.. so my output will be like this
    'Business', 105
    'Market', 200
    'Employee' 340

    How should I write the t-sql statement? I am trying with Select Into. You can suggest me any other combination in t-sql statement
  2. mmarovic Active Member

    You don't have indexes and you didn't mention any sorting criteria, so first occurence may not be what you expect since if you don't use order by any specific order is not guaranteed. If you were lucky and rows are stored in physical order they are inserted you have a chance to have a table that meets your requirements. You can use the code bellow (with no guarantee):
    create table #t(
    tID int identity(1,1),
    CodeName char(12),
    CodeValue int
    )
    --
    insert into #t(CodeName, CodeValue)
    select CodeName, CodeValue
    from Codes
    --
    create index idx_CodeName_tID on #t(CodeName, tID)
    --
    delete from a
    from #t a
    join #t b on a.CodeName = b.CodeName and a.tID > b.tID
    --
    insert into NewTable(CodeName, CodeValue)
    select CodeName, CodeValue
    from #t
    It can also be done without delete statement. Instead you can use select from temp table joined to derived table that groups by CodeName and has min(tID) aggregate.

    Edited: I had wrong join condition and index is better to create after insert.
  3. ranjitjain New Member

    i feel u should follow these steps as select into runs slower.
    1>create table table_distinct
    2> Insert Query

    insert into table_distinct
    select top 1 * from test1 where itemid='Business'
    union all
    select top 1 * from test1 where itemid='market'
    union all
    select top 1 * from test1 where itemid='employee'
    Go

  4. FrankKalis Moderator

    A TOP 1 without an ORDER BY is meaningless and not guaranteed to produce the desired result. The main flaw here is the missing PK. Since you want the first inserted row for each codename, which unfortunately is not the MIN(codevalue), you might have luck in adding an IDENTITY column to your table and see if this works:


    CREATE TABLE #showme
    (
    codename CHAR(12)
    , codevalue INT
    )
    INSERT INTO #showme
    SELECT 'Business', 105
    UNION ALL
    SELECT 'Business', 101
    UNION ALL
    SELECT 'Business', 104
    UNION ALL
    SELECT 'Market', 200
    UNION ALL
    SELECT 'Market', 201
    UNION ALL
    SELECT 'Employee', 340
    UNION ALL
    SELECT 'Employee', 359
    UNION ALL
    SELECT 'Employee', 330
    UNION ALL
    SELECT 'Employee', 341
    ALTER TABLE #showme
    ADD ID INT IDENTITY
    GO
    SELECT * FROM #showme t1
    WHERE ID=
    (SELECT MIN(ID) from #showme t2
    WHERE t1.codename = t2.codename)
    DROP TABLE #showme


    codename codevalue ID
    ------------ ----------- -----------
    Business 105 1
    Employee 340 6
    Market 200 4

    (3 row(s) affected)

    I'm not sure if this also works on a table with 1 million rows. But hopefully you've learned that *each* table should have a PRIMARY KEY.


    --
    Frank Kalis
    SQL Server MVP
    http://www.insidesql.de
  5. deepakontheweb New Member


    select distinct(codename) into #test1 from test

    --------- if you like cursor for this---------

    declare @Vcodename char(12)

    DECLARE Test_Cursor CURSOR FOR
    SELECT codename
    FROM #test1

    OPEN Test_Cursor

    FETCH NEXT FROM Test_Cursor into @Vcodename
    WHILE @@FETCH_STATUS = 0
    BEGIN
    select top 1* from test where codename =@Vcodename

    FETCH NEXT FROM Test_Cursor into @Vcodename
    END

    CLOSE Test_Cursor
    DEALLOCATE Test_Cursor


    Deepak Kumar

    --An eye for an eye and everyone shall be blind
  6. FrankKalis Moderator

    quote:
    select distinct(codename) into #test1 from test
    Nah, that would be way tooo easy here. How will you get the wanted value of codevalue with your query?

    --
    Frank Kalis
    SQL Server MVP
    http://www.insidesql.de
  7. mmarovic Active Member

    Trick question - What do you think would be result of the statement bellow?<pre id="code"><font face="courier" size="2" id="code">Insert into NewTable(CodeName, CodeValue)<br />Select CodeName, CodeValue<br />from Codes c<br />where not exists(select *<br /> from NewTable new<br /> where new.CodeName = c.CodeName)</font id="code"></pre id="code"><br /><br /><img src='/community/emoticons/emotion-5.gif' alt=';)' />
  8. FrankKalis Moderator

  9. FrankKalis Moderator

    Did you delete a post?

    On the "Active topic" page I see

    Last post
    04/25/2005 08:47:32
    by: mmarovic

    But it isn't here?

    --
    Frank Kalis
    SQL Server MVP
    http://www.insidesql.de
  10. mmarovic Active Member

    No, I updated the previous post. I just wanted to wait for more opinions. The facts I think make the question a litle bit tricky are:

    1. There are duplicates in source table.
    2. It is considered that MSSQL Server doesn't have pre-image concept (compared to ORACLE for example).

    I don't want to explain more at the moment, just to see reactions...
  11. FrankKalis Moderator

    I like your idea. But I think you would rely on the underlying storage engine.
    You have tried it before posting, I guess?

    --
    Frank Kalis
    SQL Server MVP
    http://www.insidesql.de
  12. mmarovic Active Member

    No. I'm pretty sure what is the result, but I can test it later before I post the answer. <img src='/community/emoticons/emotion-1.gif' alt=':)' /><br />
  13. ranjitjain New Member

    Hi Experts,
    I've one quetion that how to determine which record is inserted first
    out of these three records.
    'Business', 105
    'Business', 101
    'Business', 104

    as the table has no primary or any identity column and no index.
  14. Adriaan New Member

    A primary key wouldn't help, as it only enforces that there are no duplicate entries. An index only helps speeding up searches. You can use either of the two to present data in an ordered fashion, but you would use them only indirectly: you need an ORDER BY clause in your select query to do the ordering.

    An identity column would record the order in which data was entered, but it must already be in place before any data is entered into the table. If you add the column afterwards, it is not impossible that the rows are still in "chronological" order but there are no guarantees.

    You cannot build a house without bricks: no bricks, no building.
  15. mmarovic Active Member

    ranjitjain: As I mentioned in my first reply, there is no way to know for sure. You have to have identity column in the table to identify insertation order. Even in that case someone could update key column in the meantime (if key column is not identity itself). Another posibility is to have DateCreated with getDate() as default. The problem is that column can be updated. Identity column is safer because it can't be updated.

    Frank: Your answer is correct, as you already know. Obviously I was not able to trick anyone with my additional comments. Data are selected before the insert so command will copy all rows to the new table (if there is no pk/uk constraint or unique index on destination table).
  16. ranjitjain New Member

    <blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>Originally posted by surendrakalekar</i><br /><br />I have one tables called Codes This table has no primary or unique key.<br />By using this table I want to create another table (Select * into) which should contain only distict rows and the first occurance of that CODENAME.. so my output will be like this<br />'Business', 105<br />'Market', 200<br />'Employee' 340<br /><hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br />Hi surendra i need ur comments on this that how will u determine the first inserted codename with respective value.[<img src='/community/emoticons/emotion-11.gif' alt='8)' />]<br />Is it possible with the table schema which u have currently.<br />i feel u'll not get proper order of records.[?]
  17. FrankKalis Moderator

    Mirko,
    the idea was good, but, at least this is how I explain it to me, it is an atomic transaction. So
    at the moment where EXISTS checks the table, no row is there present and so all rows are copied to the new table. Even if there were already rows, I think it would not be guaranteed that you will always get the "first" row in a group. SQL Server might opt for some other faster ways to return the rows. And that might not be equal to the order in which the rows were inserted. And think about what this would yield when there is a clustered index present.

    Ranjit,
    sorry to say this, but you're missing the fact that there is no such thing as a "first row" in a relational database. Tables are essentially unordered sets. So the term "first" might become only meaningful when you have classification criterion. Usually you use a DATETIME column to determine the first row inserted. In that case it is the one having MIN(DATETIME column). Or maybe the MIN() of a PRIMARY KEY, if you have a rather sequential PK like when using an IDENTITY column. Since in your case the MIN() does not apply, you'll have a very hard time to determine the "first row".



    --
    Frank Kalis
    SQL Server MVP
    http://www.insidesql.de
  18. mmarovic Active Member

    Frank, you are right and I new that. The solution that is semi-correct I offered in the first reply. I said I tried to trick people into the wrong answer. Oracle people I'm working with right now, inspired me to post it. They are right about MSSQL Server not having efficient implementation of pre-image concept... Anyway it's not really related to the problem we try to solve.
  19. FrankKalis Moderator

    One might question that Oracle guys have an idea of relational databases. Oracle relies *very* heavily on the physical implementation. This is surely fast and at the same moment not what Dr. Codd had in mind. [<img src='/community/emoticons/emotion-2.gif' alt=':D' />]<br /><br />--<br />Frank Kalis<br />SQL Server MVP<br /<a target="_blank" href=http://www.insidesql.de>http://www.insidesql.de</a><br />
  20. surendrakalekar New Member

    Thanks everybody for your response.
    I checked with all the options posted by all of you and finally decided to accept the solution provided by FrankKalis. For this I need to insert one more IDENTITY column in my table, to avoid identification problem of the first inserted row. Ranjit raises the same problem.
    Ranjit, there is no way in SQL Server to identify first inserted row. I think in Oracle there is RowID or something. For this we need to add either IDENTITY / DATATIME column in SQL Server. Thanks Ranjit your question force me to add either one column in my table.
    I updated my t-sql with Frank#%92s solution and given it to QA for testing. Hopefully this will be perfect solution for this scenario.
  21. Adriaan New Member

    Frank, ever been in the situation where you had to install Oracle and restore an Oracle database from a client, and all you have is an Oracle manual? That's when you start to appreciate the clearheadedness behind the SQL Server product.
  22. FrankKalis Moderator

    I've installed Oracle once so far. It were really boring two hours. But we bought two pre-configured server directly from Oracle Germany... and returned them some days later because of massive misconfiguration and performance problems. [<img src='/community/emoticons/emotion-6.gif' alt=':(' />]<br />Glad I don't deal that much with the product<br /><br />--<br />Frank Kalis<br />SQL Server MVP<br /<a target="_blank" href=http://www.insidesql.de>http://www.insidesql.de</a><br />
  23. ranjitjain New Member

    Hi surendra,
    Even i know there is no such first record in sql with the object like the one u have.
    so i raised this question for all that whatever we r trying to get will not be the perfect solution on i million rows.
    So i finally asked query on ur query.
    Anyways i think for now its the best idea to put an identity column and make a index on that identity column which will be clustered i feel by default.
  24. Max New Member

    FYI !!

    select distinct CODENAME,(select top 1 CODEVALUE from ttt where ttt.CODENAME=x.CODENAME) from ttt x
  25. surendrakalekar New Member

    Thanks Max
    Good T-SQL statement. Showing me the same result.
  26. Madhivanan Moderator

    quote:Originally posted by Max

    FYI !!

    select distinct CODENAME,(select top 1 CODEVALUE from ttt where ttt.CODENAME=x.CODENAME) from ttt x


    Perfect !!!

    I tried this but could not achecive the same

    Declare @t table(codename CHAR(12), codevalue INT)
    INSERT INTO @t select 'Business', 105
    UNION ALL SELECT 'Business', 101 UNION ALL
    SELECT 'Business', 104 UNION ALL SELECT 'Market', 200 UNION ALL
    SELECT 'Market', 201 UNION ALL SELECT 'Employee', 340 UNION ALL
    SELECT 'Employee', 359 UNION ALL SELECT 'Employee', 330 UNION ALL
    SELECT 'Employee', 341
    --select * from @t

    Select codename, codevalue
    From @t as t1
    Where (Select count(distinct codevalue) from @t Where codename = t1.codename and codevalue>=t1.codevalue)<=1
    Order By codename



    Madhivanan

    Failing to plan is Planning to fail
  27. Adriaan New Member

    I hope you are aware that this is just an needlessly complex variation of:

    SELECT CodeName, MIN(codevalue)
    FROM ttt
    GROUP BY CodeName
  28. Madhivanan Moderator

    quote:Originally posted by Adriaan

    I hope you are aware that this is just an needlessly complex variation of:

    SELECT CodeName, MIN(codevalue)
    FROM ttt
    GROUP BY CodeName


    Adriaan, I think you meant

    SELECT CodeName, Max(codevalue)
    FROM ttt
    GROUP BY CodeName



    Madhivanan

    Failing to plan is Planning to fail
  29. mmarovic Active Member

    It would be nice if somebody tests solution mentioned plus solution from my first reply. I believe my solution is still the fastest on million rows. Correlated queries are in magnitude of performance with cursors, group by is usually better but still not one to bet on if you can avoid it. Not sure about result of comparison, but I had in mind all them and choose one I posted as probably the fastest. Another criteria why I choosed that one is that it gives you best chance that it will return the first row inserted for each codeName.
  30. Adriaan New Member

    <blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>Originally posted by mmarovic</i><br />best chance that it will return <i><b>the first row inserted </b></i><hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br />That's not a chance that I would count on![<img src='/community/emoticons/emotion-5.gif' alt=';)' />] If you must know the order of entry afterwards, you absolutely have to include a reliable sorting value on each row as it is entered - no if's or but's about it ...<br /><br />Good point about the MAX value - but it does show that this last variation on the query is needlessly complex.<br /><br />And I would love to believe that a GROUP BY with millions of rows would be slower than this last variation, except that if that is the case then the table is not properly indexed for this particular query.

Share This Page