SQL Server Performance Forum – Threads Archive

Specific Design issues — need advices !!

Hi, I need specialists advice concerning sql design.
Here’s the situation :

1 sql table name "users" table but done / design the bad way…
contains 500.000 records
first 30 columns are user related stuff
last 180 columns are tinyint that indicates whether a specific user is a subscriber to the specific Newsletter (in fact the 180 columns are all our newsletters)

===> Since I’m in charge of redesigning the whole Informatin System this structure is no more possible !! Solution :
3 tables

1 "user" table (iduser, …, … etc)
1 "subscription" table (iduser,idNewsletter, optin, unsubscrbed, …)
1 "newsletter" table (idNewsletter, newsletterName, newsletterFrequency, …, …)

==> troubles :
this easy and quite fast query that allows me to select user that are subscribed to NL1 and NL2 or NL3 and NL4 . select user from users where (NL1 = 1 and NL2 = 1) or (NL3 = 1 and NL4 = 1)
–> is not really possible as the future "subscription" table could have millions of records and to get those users I’d need to inner join several times …. Moreover, I’d like a schema(method) that allows a fast "subscribers" selection based on multiple criterias(as the above query + optin, or unsubscribed … etc) What do you think about that ? Thanks in advance. Best regards,

The biggest factor here is the indexes on your Subscription table. First suggestion would be the Primary Key clustered on idNewsLetter+idUser (smaller range of values comes first) and then a non-clustered index on idUser. If that doesn’t perform well, you may try the primary key on idUser+idNewsLetter, and a non-clustered on idNewsLetter. Don’t worry too much about having large numbers of records, as long as the indexes are appropriately designed you should be okay. (And inner joins aren’t so bad, really.) Make sure the other tables have proper indexing as well! Plus you can run some tests to see if using aggregate queries perhaps helps the speed … SELECT S.idUser FROM Subscription S
WHERE S.idNewsLetter = ‘NL1’ OR S.idNewsLetter = ‘NL2’
GROUP BY S.idUser
HAVING COUNT(*) = 2
UNION
SELECT S.idUser FROM Subscription S
WHERE S.idNewsLetter = ‘NL3’ OR S.idNewsLetter = ‘NL4’
GROUP BY S.idUser HAVING COUNT(*) = 2

I agree with Adriaan, except for two points: 1. Column order in clustered index should be decided based on:
– Page splitting and fragmentation consideration: that probably favors userID being the first column
– Query pattern: More frequent queries with newsletterID as parameter favors newsletterID as the first column otherwise better have userID as first one. In each case create non-clustered index on the second column of the clustered index.

2. I would use next query as union branch: select s1.userID
from subscription s1
join subscription s2 on s1.userID = s2.userID
where s1.newsletterID = @newsletterID1 and
s2.newsletterID = @newsletterID2
That should be more efficient with index design mentioned. Also such query would be more efficient then current one on no-normalized table, because now you can’t take advantage of indexes unless you plan to create 100+ indexes on that table. Of course that would kill insert/update/delete performance on that table.

Sorry I was not able to post the complete answer as a single reply for some reason. I tried both explorer and firefox, both quick reply and reply to topic and I even tried to post it as a new topic without luck.

Thanks you for this answer. Moreover, I’ve one more question :
I’ve heard that sql server 2005 allows to create UDA (User Defined Agregates) function (via CLR and .NET integration).
I’ve thought of using such a clever cutom agregate function in order to be able to use group by and have directly subscribers to some specifics Newsletters. A kind of : select mycleverfunction(idNewsletter, ‘nl1|NL2’), iduser from subscription group by iduser – mycleverfunction(idNewsletter, ‘a mask or pattern to match subscriptions’) could return only the user subscribe to those 2 Newsletters. Then , it could avoid having to heavy UNION or INNER JOINS. What do you think of that ? Thanks for your advices. Best regards,

Such query would look less complex, but it would be less efficient as well. The problem is that you would not be able to use indexes to restrict number of rows accessed.

No UDA can help you when your underlying schema is suboptimal. —
Frank Kalis
Microsoft SQL Server MVP
http://www.insidesql.de
Heute schon gebloggt?http://www.insidesql.de/blogs

I have to bow out to Mirko on the PK design – idUser should indeed come before idNewsLetter. Any particular reason why you need an ‘unsubscribed’ column? You could also just delete the entry on your Subscription, perhaps with a history table to keep track of things.

Well, Thanks all for your advices, I’m going to do some benchmark tests to see which solution is the best for us. Concerning the ‘unsubscribed’ field …. well, I’m not sure about Marketing Director’s face when he’d learn that I deleted all those unsubscribers <img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ /> … Theorically I agree, they could be deleted but practically I have to keep those records. Best regards. P.S.: I’ll try to post back my results concerning each solution. Thanks again DB gurus <img src=’/community/emoticons/emotion-5.gif’ alt=’;-)’ />

The idea (a good one) is to move subscribers to archive table, so the volume of former subscribers doesn’t affect queries supporting current subscribers processing (more frequent and important queries I guess).

Feedback is always appreciated. [<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />]

Yes, indeed, feedback is always appreciated. It will surely help others facing the same or a similar problem. —
Frank Kalis
Microsoft SQL Server MVP
http://www.insidesql.de
Heute schon gebloggt?http://www.insidesql.de/blogs

Ok guys, Here are first results : with old structure : Q1 : SELECT * FROM Users WHERE idUser = 10 Q2 : SELECT idUser FROM Users WHERE NL1 is not null and NL1 <> 0 Q3 : (Get users subscribed for 34 and 38, but not subscribed for 42,44) ) SELECT idUser FROM Users WHERE isnull(NL34,0) <> 0 or isnull(NL38,0) <> 0 and not(isnull(NL42,0) <> 0 or isnull(NL44,0) <> 0) Q4 : Get users subscribed for 34,38, and 40) SELECT idUser FROM Users WHERE isnull(NL34,0) <> 0 and isnull(NL38,0) <> 0 and isnull(NL40,0) <> 0 results : times in sec : Q1:0 ; Q2:10 ; Q3:11; Q4:10 ———————————————————————————- with new structure and a "subscription" join table (approximatively 1.600.000 records): Q1: SELECT idNL FROM Subscriptions WHERE idUser = 10 Q2: SELECT idUser FROM Subscriptions WHERE idNL = 50 Q3: SELECT distinct(s1.idUser) AS userID FROM Subscriptions s1 where s1.idNL in (34,3<img src=’/community/emoticons/emotion-11.gif’ alt=’8)’ /> and s1.idUser not in (select s2.idUser from Subscriptions s2 where s2.idNL in (42,44)) order by s1.idUser Q4: SELECT s1.idUser, count(s1.idNL) FROM Subscriptions s1 where s1.idNL in (34,38,40) group by s1.idUser having count(s1.idNL) = 3 order by s1.idUser with No indexes : Q1:0 ; Q2:1 ; Q3:2; Q4:2 with idUser non-clustered index : Q1:0 ; Q2:2 ; Q3:2; Q4:2 with idNL non-clustered index : Q1:0 ; Q2:0 ; Q3:1; Q4:1 In all cases idUser, idNL were primary key so it was a clustered intex. What do you think? Any comment on building good schema …. ? <img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ /> Thanks folks, hope it will help. Bye

Well, the increased performance is pretty obvious – what more do you want?[<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />]

Expected results. Now you can see that normalization improves performance [<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />]. That was half-joke. Selective denormalization might help performance for some queries, but it is often used as an excuse for bad design. <blockquote id="quote">quote:<hr height="1" noshade id="quote">Q4: SELECT s1.idUser, count(s1.idNL) FROM Subscriptions s1 where s1.idNL in (34,38,40) group by s1.idUser having count(s1.idNL) = 3 order by s1.idUser<hr height="1" noshade id="quote"></blockquote id="quote">Compare the performance with the next query:<pre id="code">select s34.IdUser from Subscriptions s34 join Subscriptions s38 on s38.IdUser = s34.IdUser join Subscriptions s40 on s40.IdUser = s34.IdUser where s34.idNL = 34 and s38.idNL = 38 and s40.idNL = 40 order by s34.idUser</pre id="code">Btw, you don’t need count(s1.idNL) in your query, it’s enough to put having criteria.

Adriaan, what do you think about Q3? Can it be improved? I think potential difference in execution plan and performance between not in, not exists and left join solution is good topic for an article.

Mirko, As you know NOT IN clauses are supposed to be bad performers, but apparently that’s not the case here. If these timings are against 1,600,000 rows of real data, then it is a nice surprise that the response is this good. You’ve got to be kidding about the article – Frank is already preparing one!

You are right. Frank: table structure described here is maybe real life like structure you can use for the article. I haven’t checked for a long time how "not in (select …)" queries are resolved. I know I had a few surprises with in (…) recently. I see many diferent possibilities using either in, exists or join depending on data distribution. I just don’t know if the latest version of query optimizer is able to take into account all of them. Also when discussing query optimizer behavior you probably want to test it against the most used sql server versions/service packs right now(2000 sp3, 2000 sp4 and 2005).

Are you able to let us know what the execution plan for Q3 is? I presume it is an left anti semi join? This is done by the merge join algorithm, which is very fast for exactly the purpose you describe here, two index scans/seeks merged together Cheers
Twan

Hi guys, I can see that this topic interests some sql’ers <img src=’/community/emoticons/emotion-5.gif’ alt=’;-)’ /> I’m going to perform some other tests and also less than a second precision tests. I get back soon. Thanks.

Pierre (thinapps) – did you clean the buffers and cache on your test server, before each query? This is necessary to get a fair comparison between queries. Otherwise the first query has to retrieve the complete data, and make up its mind about statistics, and each query after that can take advantage of that. DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE <query 1> DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE <query 2> DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE <query 3> DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE <query 4>

I am no the tester but I’ve told them to use your DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE in order to have real tests. get back soon…

<blockquote id="quote">quote:<hr height="1" noshade id="quote">Originally posted by mmarovic You are right. Frank: table structure described here is maybe real life like structure you can use for the article. <hr height="1" noshade id="quote"></blockquote id="quote"> Sounds like a good idea. [<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />] thinapps, I’m preparing an article where I want to discuss performance differences between EXISTS, NOT IN and LEFT JOINs. Along with a comparison about EXISTS and COUNT. Can you post some information according to this:<a target="_blank" href=http://www.aspfaq.com/etiquette.asp?id=5006>http://www.aspfaq.com/etiquette.asp?id=5006</a> – Frank Kalis Microsoft SQL Server MVP http://www.insidesql.de</a> Heute schon gebloggt?<a target="_blank" href=http://www.insidesql.de/blogs>http://www.insidesql.de/blogs</a>

Hi guys, Ok to provide you data for your article. (I’ve to find some time to do it…) We’ve run another query which is important for us and which seems to take much longer with the new structure than with the old one. with the old structure we used to do :
SELECT email, iduser, codeEmail FROM users WHERE
(isnull(NL1,0) <> 0 OR isnull(NL2,0) <> 0 OR isnull(NL3,0) <> 0)
and (isnull(NL4,0)= 0 OR isnull(NL5,0) = 0)
and mktmail=1
and etatsaisie=100
and blacklist=0 which took after having cleaned up the buffer and cache :
10 sec, 10 sec, 10, sec, 10 sec, 11 sec, 10 sec on a 5 time basis benchmark with the new structure the wuery looks like : select distinct(subs.iduser), us.email, us.codeEmail from Subscriptions as subs left loop join Users as us
on subs.iduser = us.iduser where subs.idNL in (NL1, NL2, NL3) and subs.iduser not in (
select iduser from Subscriptions where idNL in (NL4, NL5))
and us.mktmail = 1 and
us.etatsaisie = 100 and
us.blacklist = 0 which took :
23 sec, 21 sec, 17, sec, 13 sec, 19 sec, 12 sec Any idea ? Best regards. P.S.: I’ll try to prepare data for you FrankKalis. Pierre

Pierre, Not quite clear what you are doing here … Are you repeating the same query 5 times? We’re interested in the response times for the different syntax options that you have (NOT IN, LEFT JOIN, NOT EXISTS …) both against your old table structure and the new one we proposed. You are adding somewhat to the confusion by using the LOOP hint for the join – you may have done that earlier, but anyway this type of thing is usually best left to SQL Server to decide. Also, you should use an INNER JOIN instead of a LEFT JOIN on a table if you also set criteria for the outer table. We normally expect the NOT IN syntax to perform poorly, compared to NOT EXISTS syntax – so please let us know the timings for this script: DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE select distinct(subs.iduser), us.email, us.codeEmail
from Subscriptions as subs
inner join Users as us on subs.iduser = us.iduser
where subs.idNL in (‘NL1’, ‘NL2’, ‘NL3’) and subs.iduser not in (
select iduser from Subscriptions where idNL in (‘NL4’, ‘NL5’))
and us.mktmail = 1 and
us.etatsaisie = 100 and
us.blacklist = 0 DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE select distinct(subs.iduser), us.email, us.codeEmail
from Subscriptions as subs
inner join Users as us on subs.iduser = us.iduser
where subs.idNL in (‘NL1’, ‘NL2’, ‘NL3’) and not exists (
select iduser from Subscriptions where iduser = subs.iduser AND idNL in (‘NL4’, ‘NL5’))
and us.mktmail = 1 and
us.etatsaisie = 100 and
us.blacklist = 0

There is a number of alternative solutions but I also want to know how would the query perform with inner join, before trying to find the faster solution.

<blockquote id="quote">quote:<hr height="1" noshade id="quote">Originally posted by thinapps P.S.: I’ll try to prepare data for you FrankKalis. <hr height="1" noshade id="quote"></blockquote id="quote"> Thanks! And be assured, I’ll make them anonymous to keep their privacy. [<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />] – Frank Kalis Microsoft SQL Server MVP http://www.insidesql.de</a> Heute schon gebloggt?<a target="_blank" href=http://www.insidesql.de/blogs>http://www.insidesql.de/blogs</a>

Hi guys, here are some results 1) DBCC DROPCLEANBUFFERS DBCC FREEPROCCACHE select distinct(subs.iduser), us.email, us.Codeemail from Subscriptions as subs inner join Users as us on subs.iduser = us.iduser where subs.idNL in (NL1, NL2, NL3) and subs.iduser not in ( select iduser from Subscriptions where idNL in (NL4, NL5)) and us.mktmail = 1 and us.etatsaisie = 100 and us.blacklist = 0 Result is 103415 rows in 21 seconds. 2) DBCC DROPCLEANBUFFERS DBCC FREEPROCCACHE select distinct(subs.iduser), us.email, us.Codeemail from Subscriptions as subs inner join Users as us on subs.iduser = us.iduser where subs.idNL in (NL1, NL2, NL3) and not exists ( select iduser from Subscriptions where iduser = subs.iduser AND idNL in (NL4, NL5)) and us.mktmail = 1 and us.etatsaisie = 100 and us.blacklist=0 Result is 103415 rows in 24 seconds. …. Write later for some new adcentures in redesign process… <img src=’/community/emoticons/emotion-5.gif’ alt=’;-)’ /> Pierre

Can you please post the execution plan?

Hi All, Sorry to inform you that I need to get back to some hot projects for my company. I’ll try to post the execution plan as soon as I have enough time to do it properly. Best regards.

]]>