Large Data Operations in SQL Server
The test environment server is a 2-way Xeon 2.4GHz/512K cache, 533MHz system bus, and 2GB memory on a motherboard with the ServerWorks GC-LE chipset. The test database data file resides on 2x15K 18G SCSI U160 drives (OS) striped, all log files on one 10K SCSI drive, and the tempdb data file on an IDE/ATA drive. The database recovery model is set to SIMPLE, and 4GB of space is pre-allocated for both data and log. In addition, the max degree of parallelism was restricted to one CPU for most tests. The results are single query time measurements. There are some variations between measurements since there is no easy method to determine the contents of the buffer cache. An attempt was made to determine a reasonably repeatable measurement.
The following script creates a test table and populates the table with test data. As defined, this table has 8 integer columns of 4 bytes each, a char column of 10 bytes, a decimal column of 5 bytes, and three other columns of 8 bytes each for a total 71 bytes. Along with the metadata overhead, this works out to approximately 81 bytes per row, allowing 99 rows to fit in each 8KB page at 99% fill factor, with 77 bytes free.
CREATE TABLE M3C_00 (
ID int NOT NULL,
ID2 int NOT NULL,
ID3 int NOT NULL,
ID4 int NOT NULL,
ID5 int NOT NULL,
ID6 int NOT NULL,
SeqID int NOT NULL,
DistID int NOT NULL,
Value char(10) NOT NULL,
randDecimal decimal (9,4) NOT NULL,
randMoney money NOT NULL,
randDate datetime NOT NULL,
seqDate datetime NOT NULL )
The script below populates the above table with 10,000,000 rows. The ID column is a sequence from 1 to 10M, but is not explicitly defined as an identity column. The columns ID2 to ID6 are not important to this series of tests, and any values can be used. The SeqID values are sequential and DistID values are distributed. In this case, the table has 10M rows. The first 100,000 rows have SeqID value 1; the next 100K rows have SeqID value 2 and so on. The first row has DistID value 1, the next row with DistID value is row 101, and so on. All rows with a specific SeqID are in adjacent rows. All rows with a specific DistID value are 100 rows apart, which cause each row having a common DistID value to be in a different 8KB data page. This table is not clustered, so while there is no guarantee on the row placement, it does appear that the rows are inserted in sequence.
DECLARE @I int, @rowCnt int,
@p int, @sc1 int, @dv1 int
SELECT @I = 1, @rowCnt =
10000000, @p =100, @sc1 = 100000
SELECT @dv1 = @rowCnt/100000
WHILE @I <= @RowCnt BEGIN
(ID, ID2, ID3, ID4, ID5, ID6, SeqID, DistID,
Value, randDecimal, randMoney,
VALUES ( @I,
@I, @I/2, @I/4, @I/10, @I/20,
(@I-1)/@sc1 + 1, — Sequential values
(@I-1)%(@dv1) + 1, — Distributed values
>CHAR(65 + 26*rand())+CHAR(65 +
26*rand())+CHAR(65 + 26*rand())
+CONVERT(char(6),CONVERT(int,100000*(9.0*rand()+1.0)))+CHAR(65 + 26*rand()),
SET @I = @I+1
The above script takes 17min 45sec to run, averaging 9,380 row inserts per second. The M3C_00 table ends up at 105,264 pages with 95% fillfactor. Instead of regenerating the table each time a fresh data set is needed, the original data set is preserved in the above table and a copy of the original is generated with the SELECT INTO statement below.
This table has 101,012 pages, averaging 99 rows per page at 99% fillfactor for a table size of just over 789MB. The SELECT INTO takes approximately 28 seconds to run, averaging 357,000 rows inserted per second. The following two indexes are created on SeqID and DistID columns. Each index occupies 18,554 pages or 145MB.
CREATE INDEX IX_M3C_01_Seq
ON M3C_01 (SeqID) WITH SORT_IN_TEMPDB
CREATE INDEX IX_M3C_01_Dist
ON M3C_01 (DistID) WITH SORT_IN_TEMPDB
The first index takes approximately 30 seconds to create with 256M memory and the seconds takes 60 seconds.