SQL Server Performance Forum – Threads Archive

Seek advice on data model architecture

Our development team is entering a big project. Among others there is a requirement to be able to add entity and attributes easily without the need to make any maintenance changes to the database. It is supposed that the system administrator would just make changes to metadata via a system UI and the new attribute must appear immediately in all the UIs. The system has data definition tables storing entities and their attributes (it is supposed that entities would have no more than 50 attributes). Assume we have these tables for metadata:
1. Data Type definitions (same as SQL Server types in this sample) CREATE TABLE DataType (
DataTypeId [int] NOT NULL, — PK
[Name] [nvarchar](255) NOT NULL
) Sample row values for DataType:
DataTypeIdName
1int
2varchar
3datetime
4bit
5money 2. Entities (this one certainly will have more columns) CREATE TABLE Entity (
EntityId [int] NOT NULL, — PK
[Name] [nvarchar](255) NOT NULL
) Sample row values for Entity:
EntityIdName
1Bank
2Customer
3. Attributes (virtually those are columns for entities) CREATE TABLE Attribute (
AttributeId [int] NOT NULL, — PK
EntityId [int] NOT NULL, — FK to Entity
[Name] [nvarchar](255) NOT NULL
DataTypeId [int] NOT NULL, — FK to DataType
Length [int] NULL
) Sample row values of Attribute:
AttributeId EntityId Name DataTypeId Length
1 1Name 2100
2 1Address 2100
3 1Status 1
4 2FirstName220
5 2LastName220
6 2DateOfBirth3
So far so good. But further we see the two possible approaches: 1. Flat table model. Create standard flat tables based on the metadata. That means that the system would create these tables automatically: CREATE TABLE BankFlat (
BankId [int] NOT NULL, — PK
Name [VARCHAR(100)] NULL,
Address [VARCHAR(100)] NULL,
Status [INT] NULL,
) CREATE TABLE CustomerFlat (
CustomerId [int] NOT NULL, — PK
FirstName [VARCHAR(20)] NULL,
LastName [VARCHAR(20)] NULL,
DateOfBirth [DATETIME] NULL,
)
So when system administrator adds another column to the Bank entity the system would execute DDL to the corresponding entity table. Pros:
-It#%92s a natural way to store entities and attributes as tables and columns
Contras:
-DDL to support the table changes may end up to be very complicated
-DDL on large tables will take too much time and block other users
-In a distributed environment with replication subscribers table structure changes should be avoided 2. Detached attribute storage model. Create tables with permanent structure as entity containers this way: CREATE TABLE Bank (
BankId [int] NOT NULL — PK
) CREATE TABLE BankAttributeVarchar (
RowId [int] NOT NULL, — PK
BankId [INT] NOT NULL,
AttributeId [INT] NOT NULL,
Value [VARCHAR(MAX)] NULL — index on this column would be created
) CREATE TABLE BankAttributeInt (
RowId [int] NOT NULL, — PK
BankId [INT] NOT NULL,
AttributeId [INT] NOT NULL,
Value [INT] NULL — index on this column would be created
)
So the idea is to create simple master table for entity and detail typed tables to store entity attributes. So the schema would be steady (just new tables would be created, no changes in existing tables). Inn the sample above for Bank the attributes Name and Address are stored in BankAttributeVarchar and Status is stored in BankAttributeInt. It#%92s clear that to get complete data rows we need views like: CREATE VIEW BankView AS
SELECT
BankId,
FirstName = Attr1.Value,
LastName = Attr2.Value,
Status = Attr3.Value
FROM Bank
INNER JOIN BankAttributeVarchar Attr1 ON (Attr1.BankId = Bank.BankId AND Attr1.AttributeId = 1)
INNER JOIN BankAttributeVarchar Attr2 ON (Attr2.BankId = Bank.BankId AND Attr2.AttributeId = 2)
INNER JOIN BankAttributeInt Attr3 ON (Attr3.BankId = Bank.BankId AND Attr3.AttributeId = 3) Pros:
-The schema remains unchanged and replication subscribers won#%92t need schema replication
-No heavy DDL scripts when adding new attributes to entities
Contras:
-It#%92s very likely that this unusual kind of data model may end up in performance issues on big volumes both for inserting and selecting data
We developed some tests emulating an entity with 50 attributes of different types for both models above (note that BankView has 50 joins in this case!) and generated about 100 000 rows. We created indexes on (BankId, AttributeId) in all BankAttributeXxx tables to improve JOIN performance. We tested these two simple queries: 1)
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS SELECT * FROM BankFlat
WHERE BankId = 12781 2)
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS SELECT * FROM BankView
WHERE BankId = 12781
The results are:
1) Usually takes <1 second to return the row
2) Usually takes 9-16 seconds to return the row That means that on this simple selection of a single record by a PK value the second model performs more than 10x times worse. Of course if we don#%92t clear cache and buffers it#%92s much better on consequent runs. So these are the questions we have to consider: 1) Did we try all means of optimization for detached attribute storage model? Can it be optimized with another set of indexes or hints?
2) Is there any chance that the detached attribute storage is more efficient for value-based searches?
3) What will likely to happen on millions of records per table? There are some political issues about taking one of the approaches. I think that the detached attribute storage model will be a huge failure on performance but there are some opinions that it might be compensated by more powerful hardware. I don#%92t believe 10x times (I expect exponential performance fall on millions of records) can be compensated by hardware. The flat model has only the two weak points I specified in its contras (slow DDL blocking users and schema replication). I read that SQL Server 2005 has support for schema replication but is it quite reliable? I read some posts that people had trouble using it. I would appreciate any expert advice. Thank you.

i have seen designs like this before my assessment then was this is the stupidest idea i have seen
you should think about your application needs to accomplish
before trying build a database for it my opinion has not changed

How about relevant indexes for those tables that are involved in this JOIN clause?
Taking few DMVs help you can achieve the performance to find the root cause. Satya SKJ
Microsoft SQL Server MVP
Writer, Contributing Editor & Moderator
http://www.SQL-Server-Performance.Com
This posting is provided AS IS with no rights for the sake of knowledge sharing. Knowledge is of two kinds. We know a subject ourselves or we know where we can find information on it.

CheckMates: I am working hard on saving from disaster a product that has one of the most important component using similar approach. Flexibility and generality you want to achieve is just illusion. Attributes you add later are added becuase there is meaning behind it and business rules related to the meaning. You automate adding attributes but application code has to know exactly what is the attribute because business logic has to be added. As you can see to return all columns from normally one row from the table you have to make as much joins as you have columns. There is no wonder we have performance problems with our component and we are in the middle on rewritting it on the fly. We also have data consistency problems, because you can’t apply most of validation in database (foreign keys, unique constraints and so). Joe was probably a bit rude, but he was right (as usual). Please reconsider the design, it is not good idea at all.

Thank you all for the answers. I generally share the same opinion that the traditional design would be better. The tests were presuasive and now all the team agrees that performance have to be an issue. The last point to support the detached attribute storage model is that it is easy to replicate. Can anyone suggest a robust solution for deployment of schema changes in a disctributed environment?

IME it’s usually when people want to have just one ‘entry point’ for lots of different types of entities. There are other and better ways to ‘combine’ data for different entities – like in a front-end application.

]]>