Recently, I was asked my opinion on a database design schema our company developers were working on. When I heard they were using GUID’s (Global Unique Identifier) as PK’s (primary keys), my jaw about dropped.
It turns out the reasoning behind the use of GUID’s for keys was completely legitimate, and actually is the reason GUID’s were developed by Microsoft in the first place. The developers needed a key that would be unique across databases and database servers. The GUID certainly fulfills this need, but it comes with a cost.
The GUID is a wide column (16 bytes to be specific) and contains a unique combination of 33 uppercase and numeric characters (these are HEX value characters that range from A to F and numeric values that range between 0-9). This column, because it is the primary key, is going to be stored in the clustered index (unless specified to be a non-clustered index), and will be the page pointer for each leaf page in a non-clustered index. Also, if a GUID is used instead of an integer identity column, then the binary rendering of 33 characters needs to be matched for each row that is returned using that column in the WHERE clause.
If a high volume of inserts are done on these tables, then the large size of the GUID’s will contribute to page splits, potentially resulting in performance problems. Another disadvantage is that typically, primary keys are searched frequently, further contributing to additional server overhead. And trying to recall a GUID from memory, giving it to someone verbally, or manually typing a GUID value into a query, is not fun or easy.
If you need a key is unique across all servers, I would recommend using a composite key as the PK instead of a GUID. To do this, I would use two integer columns. The first should be an identity column to uniquely identify the rows in the table, and the second integer column should be used to uniquely distinguish the server from other servers. To ensure the integrity of the server_id, I would give the column a default value of that server’s server_id, and also place a check constraint on the column to validate that the value is only that server’s server_id.
I performed a test to validate my theory (results below) but I am interested in what the rest of you have found. (Click in the link above to access my blog.)
Testing Insert Performance
To test inserts, the first method (I refer to this method as Test1 throughout) used the NEWID() as a default on the uniqueidentifier clustered primary key. The second method was to use a composite key made up of two integer columns, one being an identity and the other one forced with a server id (this method is referred to as Test2 throughout). These columns make up this table’s clustered primary key. The third method uses a function to create a sequential uniqueidentifier, using a combination of the NEWID() function and a converted binary(6) datetime value (referred to as Test3 throughout).
These tests followed the following methodology. First, the tables were created, then 100,000 rows were inserted into the test tables Test1, Test2 and Test3. Next, the tables were truncated and then I tested inserting 4,000,000 rows into each of the test tables Test1, Test2 and Test3.
Next, I tested single row selects for each table. For the next test, I selected 70,000 rows from each table. For all 70,000 row select statements, I did a range select. In order to make sure that I had a clean and accurate test, I first cleared the proc cache, then the buffer cache, did a checkpoint on the database, and then ran the select statement. If you are not sure how or why I cleared the proc and buffer cache and ran a checkpoint on the database, check out Books Online for these topics: DBCC FREEPROCCACHE, DBCC DROPCLEANBUFFERS, and CHECKPOINT. I did this each time I ran a query and recorded the averages here.
The 100,000 row insert for the Test1 table took 79 seconds, and took 45,191 seconds to insert 4,000,000 rows. The inserts into the Test2 table took 33 seconds for 100,000 rows and 1,630 seconds for 4,000,000 rows. For Test3, it took 33 seconds to insert 100,000 rows and 2025 seconds for 4,000,000 rows.
The inserts took much longer for test1 because GUID values are random values and because the default fill factor for the index was 80%, which wasn’t changed for this test, and there were a great number of page splits. Because the inserts for Test2 and Test3 were sequential, the fill factor for those indexes could be set to 100% for better performance on inserts and selects because fewer data pages would be used or returned. The 400 second difference between Test2 and Test3 was, I believe, because of the function overhead required to create the SEQUENTIAL NEWID() for each insert.