SQL Server Performance Forum – Threads Archive

Table design

Hello! I am about to design a one to one message system.
What I mean with one to one is that it should not be possible for one user to send a message to multiple recievers. Let me first state what I need id – may be optional with another approach
threadId – will be used to group a specific thread
fromUserId
toUserId
fromFolder
toFolder
createdDate
toReadDate – when the reciever of the message read the message
toReplyDate – when the reciever of the message replied to the message
topic
message – text If we first start to play with the possibility of have all these just as they are (in one table). We have some possible performance issues here since we would not know where to place our clustered index. For a query like "select all messages to a specific user in a specific folder", it could look like this: select
id, threadId, fromUserId, toUserId, fromFolder, toFolder, createdDate, toReadDate, toReplyDate, topic
from
messages
where
toUserId = 10
and toFolder = 1 If we create a clustered index on toUserId, toFolder then everyone would be happy since the query would return the data really quick. But what if we need to have a list of messages that a specific user sent. We would need an index on fromUserId, fromFolder, and preferly that would be an clustered. But since we already have specified a clustered index we need to make a non clustered index and then we will notice that a bookmark lookup will occour that will drain the performance. When we have messagelist we also probably want to read a specific message and the clustered index should be on the id-column so that we can fetch the message-column without a bookmark lookup. Another approach could be to create two tables. Where we place the message-column in its own table so that we can create a clustered index on that. Another approach could be to separate the message on two rows in the message table so that the sent message has its own copy and the recieved message has its own copy. But that would pretty much duplicate the amount of rows in the table. But it would be easier for us to create a clustered index that we could use for any folder. This approach would also create some redundance for example of the topic-column. The solution could look like this. id, fromUserId, toUserId, messageDataId, folder, readDate, replyDate, topic, createdDate
where the clustered index span over fromUserId, toUserId, messageDataId, folder There are a millions of possible solutions, but from my criterias, I want you to specify your own solution that you think would be the best. An important thing is also that since this is a message system there is more inserts than the avarage table. I could guess that we have like 5% inserts and 95% selects. That means that the design cannot only be built for selects, we need to have the performance of inserts in our mind to. Thanks!

Another thing that could be imporant is that a criteria is that I often need to count the number of unread messages to a specific user.

Not sure why you would find it necessary to limit messages to only a single recipient … With regular e-mail, there is always a single sender of a message, but there can be multiple recipients. So it would seem most logical to have a parent table for the message which includes the sender, and a child table for the recipients of each message which should also have a column for the recipient having read the message. You could also add if the recipient is on TO, CC or BCC. Finding out all unread messages would be fairly easy: SELECT M.* FROM Message M
WHERE NOT EXISTS
(SELECT * FROM MessageRecipients MR
WHERE MR.MessageId = M.MessageId AND MR.Read = 1) Regarding clustered indexes, remember that they are most useful when most queries against the table are for ranges of values on the column(s) covered by that index. If most of your queries are not for ranges, then it is probably wiser to add the clustered index on the identity column, which will reduce the resources taken up by the non-clustered indexes on the same table, thereby boosting overall performance.

Yes, that is also a solution and was the first that came to me as well. The reason that I decided for a one to one message system is that it will probably be easier to get performance? I may be wrong! <img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ /> During peak hours a single query like "fetch all messages for a specific user in a specific folder" can be exacuted 10000 times. That is ~150 times per minute. So we really need it to be quick. And we shall not forget the other queries "fetch the data in a specific message". With your suggestion, where should we place the threadId? And if sent to several recievers, should all recievers have their own threadId, or should all recievers have the same threadId? We will use threadId to be able to show the "last message" while you reply to a message.

Isn’t the thread the "recipient" in this case?

Not really since we want to be able to group by threadId. It should work pretty much like a newsgroup, which you can group by thread (topic).
But since topic is varchar and can also be changed when you reply, we need another identifier. Two users that are talking can have two different threads ongoing at the same time.

Yes, but can a single message belong to more than one thread? In all newsgroups that I have seen, your message gets posted to a single thread (plus you run the risk of getting flamed when you post the same message to different threads[<img src=’/community/emoticons/emotion-5.gif’ alt=’

‘ />]). A thread is a series of messages from at least one sender, with a common starting point, so I’m sticking to my assertion that the thread is indeed the recipient.

A single message cannot belong to more than one thread, but multiple messages can belong to one thread. For example topic1
reply1 topic1
topic2
reply1 topic2
reply2 topic2 Here we have two threadstarters (topic1, topic2) and we have three replies. Topic2 have two replies. Maybe threadId is the wrong word?

Just a thought. For both, MAPI structures and NNTP structures are standards defined. What about having a look at those standards how they define threads, conversation, etc… NNTP you can find here:http://www.freesoft.org/CIE/RFC/977/index.htm
And for MAPI it might be worth investigating MSDN. There are several documents available. —
Frank Kalis
Microsoft SQL Server MVP
http://www.insidesql.de
Ich unterstÃ¼tze PASS Deutschland e.V. http://www.sqlpass.de)

You have a thread that can fork out, so you get a new branch. Each message belongs to the thread, but if it is a reply to a specific message then you have a new branch, identified by a reference to the parent message. Thread (ThreadId, Originator, Subject, DateAndTime)
Message (ThreadId, MessageId, Sender, ParentMessageId, Topic, DateAndTime) A thread without a reply is a thread with just a single message.
A message without a reply has a MessageId that does not exist as a ParentMessageId.

I am not really working on a newsgroup implementation or such thing. I was just trying to give an example of how it should work. I am looking for a fairly good structure that perform really good performance.
Another imporant thing is that we will delete messages as they become old, and therefore the thread starter may be deleted. So that is why we have one single column for the thread. But I do agree that a real message hiearchy should belong with a parentMessageId, but in our case we really not have any use for that. I did some tests and it shows that so far the single table solution is the fastest one. Although I know that a solution like that is not expandable nor normalized, and that is why I hesitate. Performance is the real main issue here and normalization is not.

With proper indexes in place, there is no reason why normalized tables should perform worse. For ‘deleted’ messages, why not have a bit column to show/hide them instead of actual deletion?

Well. The ideal would be to keep everything for history and such. But then the size of the DB would grow pretty fast. We try to keep the DB small, so that it can fit into the RAM, since RAM is faster than HDD.
Currently we have 8GB and the database is 5GB. Have you made any tests that confirm that a normalized design perform equal or better than a non normalized? What I have seen is that there are ways to step aside from normalization to gain performance. Even though I have choosed the indexes very carefully, for best performance. For this example that is because with one table we do not need to perform joins, but with a more normalized design we need to perform joins, and even though there are indexes on booth pk/fk. Of course the normalized version performs really good, and I can see that there are index seek/clustered index seeks for everything. But it cannot compare to a query/table without joins where we have proper indexes as well.

Index seeks are the best performing index operations you can have – I wouldn’t worry too much, as long as there are no index scans or table scans (and even then only if they are occurring on tables with large numbers of rows). If you’re trying to keep the DB loaded in RAM in the long run, I think you’d better start looking at archiving solutions – how long do you expect it will take for the db to grow from 5GB to 8GB? And beyond? Also in that case you are clearly running SQL Server Enterprise Edition on Windows Datacenter – that’s one heck of an investment for a message board.

This is not just a message board, its part of a whole application, with many other parts. Actually we already have one "version" of the message system up and running. I am looking into ways (planning) of improve the system. Can I ask why I shouldnt worry to much if I dont have any index/table scans? When I can clearly see that other solutions will perform better, but with the loss of less normalization. <img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />

I don’t think we disagree about seek/scan performance, just that I may have misread your previous post a little[<img src=’/community/emoticons/emotion-1.gif’ alt=’

‘ />]. About the normalized/denormalized issue – denormalized may prove better performing if your client app is retrieving loads and loads of data without much filtering. This may improve the response time for an isolated user, but obviously the time that SQL Server spends on sending out unfiltered data cannot be spent on other processes. In a system with proper indexing, proper querying, and proper filtering requested by the client app, the index on a denormalized table is quite likely to be non-clustered, which is quite probably less efficient for filtering than a clustered index on a normalized child table.

]]>