Distinct rows with first occurance & create table | SQL Server Performance Forums

SQL Server Performance Forum – Threads Archive

Distinct rows with first occurance & create table

I have one tables called Codes which contain following fields.
CODENAME(char 12), (CODEVALUE int)
This table has no primary or unique key. This table contains near about 1 million rows. like this.
‘Business’, 105
‘Business’, 101
‘Business’, 104
‘Market’, 200
‘Market’, 201
‘Employee’ 340
‘Employee’ 359
‘Employee’ 330
‘Employee’ 341
By using this table I want to create another table (Select * into) which should contain only distict rows and the first occurance of that CODENAME.. so my output will be like this
‘Business’, 105
‘Market’, 200
‘Employee’ 340 How should I write the t-sql statement? I am trying with Select Into. You can suggest me any other combination in t-sql statement

You don’t have indexes and you didn’t mention any sorting criteria, so first occurence may not be what you expect since if you don’t use order by any specific order is not guaranteed. If you were lucky and rows are stored in physical order they are inserted you have a chance to have a table that meets your requirements. You can use the code bellow (with no guarantee):
create table #t(
tID int identity(1,1),
CodeName char(12),
CodeValue int
)

insert into #t(CodeName, CodeValue)
select CodeName, CodeValue
from Codes

create index idx_CodeName_tID on #t(CodeName, tID)

delete from a
from #t a
join #t b on a.CodeName = b.CodeName and a.tID > b.tID

insert into NewTable(CodeName, CodeValue)
select CodeName, CodeValue
from #t
It can also be done without delete statement. Instead you can use select from temp table joined to derived table that groups by CodeName and has min(tID) aggregate. Edited: I had wrong join condition and index is better to create after insert.
i feel u should follow these steps as select into runs slower.
1>create table table_distinct
2> Insert Query insert into table_distinct
select top 1 * from test1 where itemid=’Business’
union all
select top 1 * from test1 where itemid=’market’
union all
select top 1 * from test1 where itemid=’employee’
Go
A TOP 1 without an ORDER BY is meaningless and not guaranteed to produce the desired result. The main flaw here is the missing PK. Since you want the first inserted row for each codename, which unfortunately is not the MIN(codevalue), you might have luck in adding an IDENTITY column to your table and see if this works:
CREATE TABLE #showme
(
codename CHAR(12)
, codevalue INT
)
INSERT INTO #showme
SELECT ‘Business’, 105
UNION ALL
SELECT ‘Business’, 101
UNION ALL
SELECT ‘Business’, 104
UNION ALL
SELECT ‘Market’, 200
UNION ALL
SELECT ‘Market’, 201
UNION ALL
SELECT ‘Employee’, 340
UNION ALL
SELECT ‘Employee’, 359
UNION ALL
SELECT ‘Employee’, 330
UNION ALL
SELECT ‘Employee’, 341
ALTER TABLE #showme
ADD ID INT IDENTITY
GO
SELECT * FROM #showme t1
WHERE ID=
(SELECT MIN(ID) from #showme t2
WHERE t1.codename = t2.codename)
DROP TABLE #showme
codename codevalue ID
———— ———– ———–
Business 105 1
Employee 340 6
Market 200 4 (3 row(s) affected) I’m not sure if this also works on a table with 1 million rows. But hopefully you’ve learned that *each* table should have a PRIMARY KEY.

Frank Kalis
SQL Server MVP
http://www.insidesql.de


select distinct(codename) into #test1 from test ——— if you like cursor for this——— declare @Vcodename char(12) DECLARE Test_Cursor CURSOR FOR
SELECT codename
FROM #test1 OPEN Test_Cursor FETCH NEXT FROM Test_Cursor into @Vcodename
WHILE @@FETCH_STATUS = 0
BEGIN
select top 1* from test where codename [email protected] FETCH NEXT FROM Test_Cursor into @Vcodename
END CLOSE Test_Cursor
DEALLOCATE Test_Cursor
Deepak Kumar –An eye for an eye and everyone shall be blind
quote:
select distinct(codename) into #test1 from test
Nah, that would be way tooo easy here. How will you get the wanted value of codevalue with your query? —
Frank Kalis
SQL Server MVP
http://www.insidesql.de

Trick question – What do you think would be result of the statement bellow?<pre id="code"><font face="courier" size="2" id="code">Insert into NewTable(CodeName, CodeValue)<br />Select CodeName, CodeValue<br />from Codes c<br />where not exists(select *<br /> from NewTable new<br /> where new.CodeName = c.CodeName)</font id="code"></pre id="code"><br /><br /><img src=’/community/emoticons/emotion-5.gif’ alt=’;)‘ />
Äh, yes? A copy of the original table? —
Frank Kalis
SQL Server MVP
http://www.insidesql.de

Did you delete a post? On the "Active topic" page I see Last post
04/25/2005 08:47:32
by: mmarovic But it isn’t here? —
Frank Kalis
SQL Server MVP
http://www.insidesql.de

No, I updated the previous post. I just wanted to wait for more opinions. The facts I think make the question a litle bit tricky are: 1. There are duplicates in source table.
2. It is considered that MSSQL Server doesn’t have pre-image concept (compared to ORACLE for example). I don’t want to explain more at the moment, just to see reactions…
I like your idea. But I think you would rely on the underlying storage engine.
You have tried it before posting, I guess? —
Frank Kalis
SQL Server MVP
http://www.insidesql.de

No. I’m pretty sure what is the result, but I can test it later before I post the answer. <img src=’/community/emoticons/emotion-1.gif’ alt=’:)‘ /><br />
Hi Experts,
I’ve one quetion that how to determine which record is inserted first
out of these three records.
‘Business’, 105
‘Business’, 101
‘Business’, 104 as the table has no primary or any identity column and no index.

A primary key wouldn’t help, as it only enforces that there are no duplicate entries. An index only helps speeding up searches. You can use either of the two to present data in an ordered fashion, but you would use them only indirectly: you need an ORDER BY clause in your select query to do the ordering. An identity column would record the order in which data was entered, but it must already be in place before any data is entered into the table. If you add the column afterwards, it is not impossible that the rows are still in "chronological" order but there are no guarantees. You cannot build a house without bricks: no bricks, no building.
ranjitjain: As I mentioned in my first reply, there is no way to know for sure. You have to have identity column in the table to identify insertation order. Even in that case someone could update key column in the meantime (if key column is not identity itself). Another posibility is to have DateCreated with getDate() as default. The problem is that column can be updated. Identity column is safer because it can’t be updated. Frank: Your answer is correct, as you already know. Obviously I was not able to trick anyone with my additional comments. Data are selected before the insert so command will copy all rows to the new table (if there is no pk/uk constraint or unique index on destination table).
<blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>Originally posted by surendrakalekar</i><br /><br />I have one tables called Codes This table has no primary or unique key.<br />By using this table I want to create another table (Select * into) which should contain only distict rows and the first occurance of that CODENAME.. so my output will be like this<br />’Business’, 105<br />’Market’, 200<br />’Employee’ 340<br /><hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br />Hi surendra i need ur comments on this that how will u determine the first inserted codename with respective value.[<img src=’/community/emoticons/emotion-11.gif’ alt=’8)’ />]<br />Is it possible with the table schema which u have currently.<br />i feel u’ll not get proper order of records.[?]
Mirko,
the idea was good, but, at least this is how I explain it to me, it is an atomic transaction. So
at the moment where EXISTS checks the table, no row is there present and so all rows are copied to the new table. Even if there were already rows, I think it would not be guaranteed that you will always get the "first" row in a group. SQL Server might opt for some other faster ways to return the rows. And that might not be equal to the order in which the rows were inserted. And think about what this would yield when there is a clustered index present. Ranjit,
sorry to say this, but you’re missing the fact that there is no such thing as a "first row" in a relational database. Tables are essentially unordered sets. So the term "first" might become only meaningful when you have classification criterion. Usually you use a DATETIME column to determine the first row inserted. In that case it is the one having MIN(DATETIME column). Or maybe the MIN() of a PRIMARY KEY, if you have a rather sequential PK like when using an IDENTITY column. Since in your case the MIN() does not apply, you’ll have a very hard time to determine the "first row". —
Frank Kalis
SQL Server MVP
http://www.insidesql.de

Frank, you are right and I new that. The solution that is semi-correct I offered in the first reply. I said I tried to trick people into the wrong answer. Oracle people I’m working with right now, inspired me to post it. They are right about MSSQL Server not having efficient implementation of pre-image concept… Anyway it’s not really related to the problem we try to solve.
One might question that Oracle guys have an idea of relational databases. Oracle relies *very* heavily on the physical implementation. This is surely fast and at the same moment not what Dr. Codd had in mind. [<img src=’/community/emoticons/emotion-2.gif’ alt=’:D‘ />]<br /><br />–<br />Frank Kalis<br />SQL Server MVP<br /<a target="_blank" href=http://www.insidesql.de>http://www.insidesql.de</a><br />
Thanks everybody for your response.
I checked with all the options posted by all of you and finally decided to accept the solution provided by FrankKalis. For this I need to insert one more IDENTITY column in my table, to avoid identification problem of the first inserted row. Ranjit raises the same problem.
Ranjit, there is no way in SQL Server to identify first inserted row. I think in Oracle there is RowID or something. For this we need to add either IDENTITY / DATATIME column in SQL Server. Thanks Ranjit your question force me to add either one column in my table.
I updated my t-sql with Frank#%92s solution and given it to QA for testing. Hopefully this will be perfect solution for this scenario.

Frank, ever been in the situation where you had to install Oracle and restore an Oracle database from a client, and all you have is an Oracle manual? That’s when you start to appreciate the clearheadedness behind the SQL Server product.
I’ve installed Oracle once so far. It were really boring two hours. But we bought two pre-configured server directly from Oracle Germany… and returned them some days later because of massive misconfiguration and performance problems. [<img src=’/community/emoticons/emotion-6.gif’ alt=’:(‘ />]<br />Glad I don’t deal that much with the product<br /><br />–<br />Frank Kalis<br />SQL Server MVP<br /<a target="_blank" href=http://www.insidesql.de>http://www.insidesql.de</a><br />
Hi surendra,
Even i know there is no such first record in sql with the object like the one u have.
so i raised this question for all that whatever we r trying to get will not be the perfect solution on i million rows.
So i finally asked query on ur query.
Anyways i think for now its the best idea to put an identity column and make a index on that identity column which will be clustered i feel by default.
FYI !! select distinct CODENAME,(select top 1 CODEVALUE from ttt where ttt.CODENAME=x.CODENAME) from ttt x

Thanks Max
Good T-SQL statement. Showing me the same result.
quote:Originally posted by Max FYI !! select distinct CODENAME,(select top 1 CODEVALUE from ttt where ttt.CODENAME=x.CODENAME) from ttt x

Perfect !!! I tried this but could not achecive the same Declare @t table(codename CHAR(12), codevalue INT)
INSERT INTO @t select ‘Business’, 105
UNION ALL SELECT ‘Business’, 101 UNION ALL
SELECT ‘Business’, 104 UNION ALL SELECT ‘Market’, 200 UNION ALL
SELECT ‘Market’, 201 UNION ALL SELECT ‘Employee’, 340 UNION ALL
SELECT ‘Employee’, 359 UNION ALL SELECT ‘Employee’, 330 UNION ALL
SELECT ‘Employee’, 341
–select * from @t Select codename, codevalue
From @t as t1
Where (Select count(distinct codevalue) from @t Where codename = t1.codename and codevalue>=t1.codevalue)<=1
Order By codename Madhivanan Failing to plan is Planning to fail
I hope you are aware that this is just an needlessly complex variation of: SELECT CodeName, MIN(codevalue)
FROM ttt
GROUP BY CodeName

quote:Originally posted by Adriaan I hope you are aware that this is just an needlessly complex variation of: SELECT CodeName, MIN(codevalue)
FROM ttt
GROUP BY CodeName

Adriaan, I think you meant SELECT CodeName, Max(codevalue)
FROM ttt
GROUP BY CodeName Madhivanan Failing to plan is Planning to fail
It would be nice if somebody tests solution mentioned plus solution from my first reply. I believe my solution is still the fastest on million rows. Correlated queries are in magnitude of performance with cursors, group by is usually better but still not one to bet on if you can avoid it. Not sure about result of comparison, but I had in mind all them and choose one I posted as probably the fastest. Another criteria why I choosed that one is that it gives you best chance that it will return the first row inserted for each codeName.
<blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>Originally posted by mmarovic</i><br />best chance that it will return <i><b>the first row inserted </b></i><hr height="1" noshade id="quote"></font id="quote"></blockquote id="quote"><br />That’s not a chance that I would count on![<img src=’/community/emoticons/emotion-5.gif’ alt=’;)‘ />] If you must know the order of entry afterwards, you absolutely have to include a reliable sorting value on each row as it is entered – no if’s or but’s about it …<br /><br />Good point about the MAX value – but it does show that this last variation on the query is needlessly complex.<br /><br />And I would love to believe that a GROUP BY with millions of rows would be slower than this last variation, except that if that is the case then the table is not properly indexed for this particular query.
]]>